|
Table of Contents
|
Introduction
Model selection is the process of finding or adjusting the model parameters for any classification problem. For BCI systems, model selection is a crucial part of the design. This process may include selecting the features, the type of the feature extractor, the classifier, the EEG channels, the neurological phenomenon, the frequency band of interest, the values of the classifier’s parameters and the preprocessing and post-processing components. As an example, to find the optimal set of features for a certain BCI, different sets of features are considered. For every set, the performance of the system is calculated and different performances are compared. The set of features that yields the best performance is then selected. The performance of this best model can then be compared with those achieved by similar BCI systems (i.e., systems with the same experimental as well as evaluation protocols). Therefore, the performance of an SBCI must be evaluated in the following two cases, 1) during the model selection procedure and 2) when comparing the performance with other systems.
The performance of a BCI with discrete states is usually summarized by a confusion matrix. The (i,j) entry of this matrix represents the number of samples from class i that are classified as belonging to class j. A confusion matrix provides valuable information regarding how well each class is classified by the BCI system. It is, however, not usually straightforward to compare different confusion matrices. Evaluation metrics are thus needed to summarize a confusion matrix into a single value. For classification problems with balanced datasets such as synchronized BCI systems, the overall classification accuracy (OA) is the most common evaluation metric presently used to summarize the performance [SCH07]. The use of OA for problems with highly imbalanced classes is not satisfactory[PRO01].
The choice of the evaluation metric is of great importance and is application-dependent. A poorly defined evaluation metric may guide the model selection procedure to a far-from-optimal model or it can lead to erroneous conclusions when comparing the performances of two BCI systems. As a result, all the effort spent in the design of a sophisticated BCI may be lost, simply because of the poor choice of the evaluation metric. Recently, the choice of OA as the default evaluation metric has been questioned, even in classification applications with balanced datasets. Specifically, it was shown that for many applications, the area under the receiver operating characteristic (AUC) can summarize the performance better than OA [HUA05].
Performance Metrics
Various performance metrics have been proposed in the BCI literature for different problems. As an example, see the performance metrics used for the evaluation of BCI Competition III [BLA06] and a discussion on performance metrics in [MCF06]. It has been stated that the comparison of different metrics may be difficult , and perhaps there is not single measure that is ideal for all applications [MCF06]. Some of the performance metrics used in BCI systems are as follows:
1) Overall Accuracy
OA shows the total number of test samples correctly classified by an SBCI system. It has been frequently used in evaluating many synchronized BCI systems [[YAM06], [BUT06], [INC06], [MUL06], [SEL06] , [LEU06], [KAU06]. Its use in SBCIs, however, has so far been limited[BIR02]. This is because, for an SBCI system, OA assigns a huge weight on the more frequent class (NC) and a very small weight on the less frequent classes (IC). This may lead to misleading conclusions about the performance of the system.
2) Information Transfer Rate (ITR)
The information transfer rate (ITR) has been specifically proposed for evaluating the performance of synchronized BCI systems [BLA06b], [BUT06], [WAN06]. This metric is proposed based on the similarities between an SBCI and a communication channel, and using Shannon’s communication theory. The rationale is that ITR measures the amount of information transferred between two reference points. The output Y of an SBCI is the interpretation (information) of the current state of the brain, and Y conveys this information to the downstream components. It is thus argued in [WOL98] that the amount of information in Y is a useful tool for comparing the results obtained from different synchronized BCI designs. It is also argued that ITR by itself is “not“ a suitable single evaluation metric for an SBCI system. This is because of the unique nature of this metric having more than one maxima (see [FAT06] for a detailed discussion).
3) Kappa
Cohen’s Kappa coefficient is a measure of agreement between two estimators [COH60]. Since it considers chance agreements, it is regarded as a more robust measure than OA [SCH07]. The values of Kappa, vary between -1and +1. The negative values correspond to a performance that is worse than random. Kappa can also be used for comparison between BCI systems with different number of classes, where it is hard to use classification accuracy for comparison (see [TOW06] for more discussion). Papers that have used Kappa: [TOW06].
4) ROC curve and the related measures
The recipient operating characteristics (ROC) curve is a popular metric for evaluating systems with imbalanced classes. The ROC curve depicts the relationship between TPR and FPR. Popular methods that use the ROC curve for measuring the performance employ one of the following two criteria 1) The area under the ROC curve (AUC) which is used as the fitness of the system [SCH07]; 2) Defining a critical FPR value (FPRCritical) , and then using the value of the TP rate at FPRCritical as the fitness [BOR04]. The advantage of using the ROC curve over previous metrics is that a whole range of solutions (in terms of a tradeoff between TPR and FPR) is provided.
One problem with using the ROC curve is that when it is plotted over the whole range of TPR and FPR, most SBCI systems produce a curve that is similar to a perfect ROC curve [MAS06]. The other problem with using the ROC curve (and perhaps more important) is that it is computationally more demanding than other evaluation metrics. Several points need to be evaluated until a partial ROC curve that is accurate enough for estimating the AUC is drawn. Similarly, several points need to be calculated in order to obtain the value of TPR at . Even if the ROC curve is estimated using the more computationally efficient algorithm as described in [HUA05], it remains much more time consuming than the metrics described above as these only need the value of a single point to assess the performance. When these metrics are used to evaluate the performance and select a model from thousands of confusion matrices during a model selection procedure, the computational burden becomes problematic. For these reasons, evaluation metrics that summarize the performance based on a single evaluation of a confusion matrix are more desirable during the model selection procedure.
5) HF- difference
The HF-difference is a newly proposed metric that summarizes the confusion matrix [HUG99]. It is defined as the difference between the TP rate and the percentage of total activations that are incorrect (the false discovery rate (FDR)[BEN95]). The advantage of using HF-difference is that it is sensitive to the ratio of FPs to the total number of detections. The downside of using the HF-difference is that it does not consider the length of NC periods.
6) TPR/FPR
The
is another evaluation metric that was recently proposed for 2-class SBCI systems [FAT07, FAT07b]. This metric gives more weight to cases with low FPRs. As a result, during the model tuning process, any model with a high FPR is assigned a low fitness, even though TPR might have a high value. The downside is that for FPR=0, the system cannot differentiate between confusion matrices with different TPRs.
Each of these metrics has strengths and weaknesses [SCH07], however, the published SBCI studies do not usually discuss why a particular evaluation metric is chosen for evaluating the performance. This leads to the obvious conclusion that finding suitable evaluation metrics forms an important and a needed study for SBCI systems. This need has been emphasized in a recently published technical report on evaluating SBCI systems [MAS06].
Suitable Performance Metrics for self-BCI Systems
Although overall accuracy (OA) is not suitable for classification problems with imbalanced datasets (e.g., self-paced BCI systems) , the choice of an alternative evaluation metric is not obvious. Several attempts have been made to define more suitable evaluation metrics for these problems. Examples of such evaluation metrics include weighted overall accuracy (WOA) [ZHU04], the use of receiver operating characteristic (ROC) curves and related measures such as area under the ROC (AUC) [BRA97] and the Kappa coefficient[CHO01]. In the SBCI literature, some of the evaluation metrics used include overall accuracy[BIR02], HF-difference[GRA04], mutual information (information transfer rate) [KRO05], Kappa [SCH07], AUC [SCH05], the true positive rate (TPR) at a fixed false positive rate (FPR) [BOR04] and [FAT07]. Figure 1 shows the proposed evaluation metrics for synchronized and self-paced BCI systems. As seen, the number of proposed evaluation metrics is significantly higher for self-paced BCI systems than synchronized BCI systems.
Figure 1 .types of evaluation metrics used in synchronized and self-paced BCI systems.
Acceptable Performance for BCI Systems
It has been stated that accuracy more than 70% allows communication and device control [KUB06]. In [SEL06], it is argued that if an BCI system does not reach at least a 70% accuracy level, it may be frustrating to use.` But the validity of such claim still needs to be verified.
Online Vs. Offline Performance
It has been shown that the performance of subjects during online systems may be significantly lower than their performance when evaluated offline (probably because of lack of focus) [MUL06].
References
[BEN95] Y. Benjamini and Y. Hochberg, "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing", Journal of the Royal Statistical Society.Series B (Methodological), vol. 57, no.1, pp. 289-300, 1995.
[BIR02] G. E. Birch, Z. Bozorgzadeh and S. G. Mason, "Initial on-line evaluations of the LF-ASD brain-computer interface with able-bodied and spinal-cord subjects using imagined voluntary motor potentials", IEEE Trans. Neural Syst. Rehabil. Eng., vol. 10, no.4, pp. 219-224, Dec.2002.
[BLA06] Blankertz, B., et. al, "The BCI Competition III: Validating Alternative Approaches to Actual BCI Problems", IEEE Trans. Neural Systems and Rehab. Eng., Vol. 14, No.2, June 2006, pp.153-159.
[BLA06b] Blankertz, B., et. al, "The Berlin Brain-Computer interface: EEG-based communication without subject training", IEEE Trans. Neural System and Rehab. Eng., Vol. 14, No.2, June 2006, pp. 147-152.
[BOR04] J. F. Borisoff, S. G. Mason, A. Bashashati and G. E. Birch, "Brain-computer interface design for asynchronous control applications: improvements to the LF-ASD asynchronous brain switch", IEEE Trans. Biomed. Eng., vol. 51, no.6, pp. 985-992, Jun.2004.
[BRA97] A. P. Bradley, "Use of the area under the ROC curve in the evaluation of machine learning algorithms", Pattern Recognit, vol. 30, no.7, pp. 1145-1159, 1997.
[BUT06] Buttfield, A., et.al, "Towards a robust BCI: error potentials and online learning", IEEE Trans. Neural System and Rehab. Eng., Vol. 14, No.2, June 2006, pp. 164-168.
[CHO01] N. T. Choplin and D. C. Lundy, "The sensitivity and specificity of scanning laser polarimetry in the detection of glaucoma in a clinical setting", Ophthalmology, vol. 108, no.5, pp. 899-904, May.2001.
[COD06] Cososchi, S., et. al., "EEG features extraction for motor imagery", in the Proc. IEEE EMBS Int. Conf. New York, USA, Aug30- Sep 3, 2006, pp.1142-1145.
[COH60] J. Cohen, "A coefficient of agreement for nominal scales", Educational and Psychological Measurement, vol. 20, no.1, pp. 37-46, 1960.
[DOY06] Doyle, T.E., et.al, "Affective State Control for Neuroprostheses", in the Proc. IEEE EMBS Int. Conf. New York, USA, Aug30- Sep 3, 2006, pp.1248-1251.
[FAT06] M. Fatourechi, S. G. Mason, G. E. Birch and R. K. Ward, "Is information transfer rate a suitable performance measure for self-paced brain interface systems?" in Proc. IEEE Int. Symp. Signal Processing and Information Technology, pp. 212-216, 2006.
[FAT07] Fatourechi, M., Birch, G. E., and Ward, R. K., “A Self-paced Brain Interface System that Uses Movement Related Potentials and Changes in the Power of Brain Rhythms", Journal of Computational Neuroscience, Vol.23, No.1, Aug 2007, pp.21-37.
[FAT07b] M. Fatourechi, G. E. Birch and R. K. Ward, "Applying a hybrid genetic algorithm in the design of a self-paced brain interface with a low false positive rate," in Proc. IEEE ICASSP’07,vol.4,pp. IV-1157; IV-1160,. 2007.
[GRA04] B. Graimann, J. E. Huggins, S. P. Levine and G. Pfurtscheller, "Toward a direct brain interface based on human subdural recordings and wavelet-packet analysis", IEEE Trans. Biomed. Eng., vol. 51, no.6, pp. 954-962, Jun.2004.
[HUA05] J. Huang and C. X. Ling, "Using AUC and accuracy in evaluating learning algorithms", IEEE Trans. Knowled. Data Eng., vol. 17, no.3, pp. 299-310, 2005.
[HUG99] J. E. Huggins, S. P. Levine, S. L. Bement, R. K. Kushwaha, L. A. Schuh, E. A. Passaro, M. M. Rohde, D. A. Ross, K. V. Elisevich and B. J. Smith, "Detection of Event-Related Potentials for Development of a Direct Brain Interface", J Clinical Neurophysiol, vol. 16, no.5, pp. 448-455, Sep.1999.
[INC06] Ince, N. F., et. al, "Using multiresolution space-time-frequency features for the classification motor imagery EEG recordings in a brain computer interface task", in the Proc. 14th IEEE Conf. on Signal Processing and Communications Applications, 17-19 April 2006, pp.1-4
[KAU06] Kauhanen, et. al, "EEG and MEG Brain-computer interface for tetraplegic patients" , IEEE Trans. Neural System and Rehab. Eng., Vol. 14, No.2, June 2006, pp. 190-193.
[KRO05] J. Kronegg, s. Voloshynovskiy and P. Pun, "Analysis of bit rate definitions for brain-computer interfaces," in the Proc. Int. Conf. on Human-Computer Interaction (HCI'05), Las Vegas, Nevada, 2005.
[KUB06] Kubler, A., et. al, "BCI meeting 2005- Workshop on Clinical issues and applications", IEEE Trans. Neural Systems and Rehab. Eng., Vol. 14, No.2, June 2006, pp.131-134.
[LEU06] Leuthardt, E. C., et. al, "Electrocorticography-based brain computer interface- the Seattle experience", IEEE Trans. Neural Systems and Rehab. Eng., Vol. 14, No.,2, March 2006, pp. 194-198.
[MAS06] S. G. Mason, J. Kronegg, J. Huggins, M. Fatourechi and A. and Schloegl, "Evaluating the performance of self-paced BCI technology”, Technical Report, available online: http://www.bci-info.tugraz.at/Research_Info/documents/articles/self_paced_tech_report-2006-05-19.pdf, 2006.
[MCF06] McFarland, D. J., et. al, "BCI meeting 2005- Workshop on BCI Signal Processing: Feature extraction and translation", IEEE Trans. Neur. Syst. Rehab. Eng., Vol.14, No.2, June 2006, pp. 135-138.
[MUL06] Muller-Putz, G. R., et. al, "Steady-state somatosensory evoked potentials: suitable brain signals for brain-computer interfaces?", IEEE Trans. Neural syst. and Rehab. Eng., Vol. 14, No.2, March 2006, pp.30-37.
[PRO01] F. Provost and T. Fawcett, "Robust Classification for Imprecise Environments", Mach. Learning, vol. 42, no.3, pp. 203-231, 2001.
[SCH07] A. Schlögl, J. Kronegg, J. Huggins and S. G. Mason, "Evaluation criteria in BCI research," in Towards Brain-Computer Interfacing (G. Dornhege, J. R. Millan, T. Hinterberger, D. McFarland and K. R. Muller, Eds.), MIT Press, 2007.
[SEL06]Sellers, E. W., et. al, "Brain-computer interface research at the university of south Florida cognitive psychophysiology laboratory: the P300 speller", IEEE Trans. Neural syst. and Rehab. Eng., Vol. 14, No.2, March 2006, pp.221-224.
[TOW06] Townsend, G., et. al, "A Comparison of common spatial patterns with complex band power features in a four-class BCI experiment", IEEE Trans. Biomed. Eng., Vol. 53, No.4, April 2006, pp. 642-651.
[WAN06] Wang, Y., et. al, "A practical VEP-based brain-computer interface, IEEE Trans. Neural syst. and Rehab. Eng., Vol. 14, No.2, March 2006, pp.234-239.
[WOL98] J. R. Wolpaw, D. McFarland and G. Pfurtscheller, "EEG-based Communication: Improved Accuracy by Reponse Verification", IEEE Trans. Rehab. Eng., vol. 6, no.3, pp. 326-333, 1998.
[YAM06] Yamawaki, N., et. al, "An Enhanced time-frequency-spatial approach for motor imagery classification", IEEE Trans. Neural syst. Rehab. Eng., Vol. 14, No.2, June 2006, pp. 250-254.
[ZHU04] J. Zhu and T. Yao, "An evaluation of statistical spam filtering techniques", ACM Transactions on Asian Language Information Processing (TALIP), vol. 3, no.4, pp. 243-269, 2004.





