Experiment and data description
In this study, three datasets were used. The specific information about each dataset and its corresponding experimental paradigm are described as follows:

1.
Dataset 1 from a MI task paradigm. The dataset was made available by Dr Allen Osman of the University of Pennsylvania [30] for a data analysis competition during the Neural Information Processing System (NIPS2001) [31]. In this experimental paradigm, subjects were asked to imagine left or right hand movement once the letter “L” or “R” was shown on a computer screen. Subjects then performed sustained hand movement imagery in a fixed period of time. The specific experimental process is illustrated in Fig. 1a. Data from eight subjects were used in this study, and the scalp EEG was recorded from 59 channels (international 10–20 system) with a sampling rate of 100 Hz. For each subject, the total number of trials is 180:90 trials for left and 90 trials for right. The length of each trial is consistent, 2.25 s.

2.
Dataset 2 from a twoclass control paradigm. The experiments were conducted in the Biomedical Functional Imaging and Neuroengineering Laboratory at the University of Minnesota according to a protocol approved by the Institutional Review Board of the University of Minnesota [13]. In this experimental paradigm (Fig. 1b), at the beginning, the screen was blank. Two seconds later, a target appeared at one of two locations on the periphery of the screen. At 5 s, the cursor appeared in the center of the screen and subjects can move the cursor horizontally through imagination of left or right hand movement. Provided with visual feedback of the cursor position, subjects could make realtime adjustment of their imagination patterns to control the cursor movement, and this is the major difference with the MI task paradigm. A trial is finished if the cursor reached the target bar within 6 s, reached the wrong target, or failed to reach the target within 6 s. So each trial lasts 5–11 s. However, not all the time points of a trial carry information about the EEG modulation of motor imageries, so only the execution period of each trial was extracted for analysis, usually 0.5–3 s. Data from eight subjects were used, and the scalp EEG was recorded from 62 channels (international 10–20 system) with a sampling rate of 200 Hz. For each subject, the total number of trials is 180:90 trials for left and 90 trials for right.

3.
Dataset 3 from a fourclass control paradigm. The experiments were also conducted in the Biomedical Functional Imaging and Neuroengineering Laboratory at the University of Minnesota according to a protocol approved by the Institutional Review Board of the University of Minnesota. This paradigm is similar with the twoclass control paradigm. The only difference is that the fourclass control paradigm controls the movement of a cursor in four directions, i.e., MI of left hand, right hand, both hands, and nothing control cursor move to left, right, up, and down. The experimental process is illustrated in Fig. 1c. Each trial lasts 5–11 s, and the execution period of each trial was extracted for analysis, usually 0.5–3 s (same with Dataset 2). Data from eight subjects were used here, and the scalp EEG was recorded from 62 channels (international 10–20 system) with a sampling rate of 1000 Hz. For each subject, the total number of trials is 280:70 trials for each direction.
Note that there is a little difference in electrode setups between Dataset 1 and Dataset 2/3 (The electrode setup of Dataset 3 is the same with Dataset 2). Only three peripheral electrodes were not recorded during the Dataset 1 experiment. The other electrode locations were consistent with Dataset 2, so this does not affect the results.
Data processing
Dataset 3 had a relatively high sampling rate, so before signal processing, we down sampled the EEG data from Dataset 3 from 1000 to 200 Hz. In order to be consistent with the sampling rate of Dataset 1, Dataset 2 and Dataset 3 were also down sampled to 100 Hz; we found that whether the sampling rate was consistent or not, it did not affect the result. So Dataset 2 and Dataset 3 were only down sampled to 200 Hz for the results presented here. Then, the data processing was carried out in the following steps: spatial filtering, feature extraction and feature normalization.
Surface Laplacian filtering
Raw scalp EEG has a low signaltonoise ratio, since it is spatially smeared due to the head volume conductor effect. Thus, in order to improve the quality of EEG, a surface Laplacian filter [32] was adopted to accentuate localized activity and to reduce diffusion in the multichannel EEG [33, 34]. The formulation for the surface Laplacian filter is as follows:
$$V_{j}^{Lap} = V_{j}  {1 / n}\sum\limits_{{k \in s_{j} }} {V_{k} }$$
(1)
where V
_{
j
} represents the target channel, V
_{
k
} stands for neighboring channels, and S
_{
j
} is the index set of the surrounding channels. Parameter n is the number of neighboring channels, and in most cases, n is set to 4, whereas n is 2 or 3 when the target channel is at the periphery.
Feature extraction
A frequency range from 5 to 35 Hz was focused on, since many studies have indicated that this range mainly represents the intent of user in MIbased BCI. Furthermore, the phenomenon of eventrelated desynchronization (ERD) and synchronization (ERS) in the mu band (8–12 Hz) and beta band (18–26 Hz) [11, 13, 35] occuring during motor imagination over sensorimotor cortex are also included in this frequency range. Considering that the related frequency bands are narrow banded, we decomposed the entire 5–35 Hz into 13 partially overlapping subbands [36], using constantQ scheme, which is also known as the proportional band width. According to the computing principle, the 13 subbands are 5.25–6.75 Hz, 6.0–7.71 Hz, 6.86–8.82 Hz, 7.84–10.08 Hz, 8.96–11.52 Hz, 10.24–13.16 Hz, 11.70–15.04 Hz, 13.37–17.19 Hz, 15.28–19.64 Hz, 17.46–22.45 Hz, 19.96–25.66 Hz, 22.81–29.32 Hz, and 26.07–33.51 Hz, respectively. We were mainly concerned about the power changes of these frequency bands, and therefore we extracted the envelope of each subband using the Hilbert transform, given that the envelope could quantitatively reflect the instantaneous power change of a frequency band. In this study, each envelope was extracted from a trial data. Each feature is the average of single envelope over trial time.
Feature normalization
Min–max normalization was adopted here to transform feature values into the range (−1, 1). The normalization was performed according to the formula below.
$$XN = (newMax  newMin)\frac{X  Min}{Max  Min} + newMin$$
(2)
where Min and Max are the minimum and maximum values of a feature. The parameters newMin and newMax are the low and upper bound of new range, and X is the feature value that needs to be transformed.
Through the above process, we could obtain 13 normalized features during a trial from a single channel, and the 13 features corresponded to the aforementioned 13 subbands. When all the channels (59 or 62) were initially used, the total number of features would be 767 or 806 (13 × 59/62). During the computing process, features from different channels are concatenated into a feature vector representing one trial.
Channel selection
Performance assessment
Channel selection is an optimization problem which chooses the optimal subset of channels from the full set available. Our ultimate goal is to maximize the classification accuracy of distinct motor imageries in MIbased BCIs, so optimal here means that achieving best accuracy by using least number of channels. Here, the best accuracy means the optimal testing accuracy in classification, which is obtained by computing the mean value of testing accuracy of each fold in 10fold cross validations.
Relief and Relieff
Relief [24] is a widely used method for feature selection in binary classification problems, due to its effectiveness and simplicity of computation. Relieff [37] is an extension of it, designed for feature selection in multiclass classification problems. Relief and Relieff are similar in algorithm principle. So here we only take Relieff as an example to describe the computing principle. For technical details about Relief, please refer to [24]. A key idea of Relieff is to estimate the quality of attributes according to their abilities of distinguishing among samples that are near to each other, given in Fig. 2a. At the beginning, the weights for all the features are initialized to zeros (line 1). Then it randomly selects a sample R from the training dataset T (line 3), and selects the nearest k samples from the same class training set (called “Near Hits”) (line 4) and nearest k samples from each of the other classes (called “Near Misses”) (lines 5 and 6). It updates the feature weight vector W for all attributes according to Eq. (3). The whole process would be repeated m times, where m is a userdefined parameter. Through the above described process, it could be easily understood that a large weight means the feature is important for the classification and a small one means less important. The weight computing equation is as follows:
$$\begin{aligned} W(f) & = W(f)  {{\sum\limits_{{j = 1}}^{k} {d(f,R,H_{j} )} } \mathord{\left/ {\vphantom {{\sum\limits_{{j = 1}}^{k} {d(f,R,H_{j} )} } {(m \cdot k)}}} \right. } {(m \cdot k)}} \\ & \quad + {{\sum\limits_{{C \ne class(R)}} {\left[ {\frac{{p(C)}}{{1  p(class(R))}}\sum\limits_{{j = 1}}^{k} {d(f,R,M_{j} (C))} } \right]} } \mathord{\left/ {\vphantom {{\sum\limits_{{C \ne class(R)}} {\left[ {\frac{{p(C)}}{{1  p(class(R))}}\sum\limits_{{j = 1}}^{k} {d(f,R,M_{j} (C))} } \right]} } {(m \cdot k)}}} \right. } {(m \cdot k)}} \\ \end{aligned}$$
(3)
where W(f) represents the weight of feature f. Parameter k is the number of nearest samples selected, and m represents the number of repeated times of computing process. C represents one class except for the R’s class. Parameter p is the prior probability of a certain class. Function d(f, R, R
_{
o
}
) calculates the difference between sample vector R and sample vector R
_{
o
} at the feature f, and R
_{
o
} could be H
_{
j
} (Near Hit) or M
_{
j
} (Near Miss). For numerical attributes, d(f, R, R
_{
o
}
) is defined as:
$$d ( {f,R,R_{o} } )= \frac{{\left {value( {f,R})  value ( {f,R_{o} } ) } \right}}{{\hbox{max} ( f )  \hbox{min} ( f )}}$$
(4)
IterRelCen. In Relief (Relieff) algorithm, the weight computation is affected by selection of target sample and selection of nearest neighbors (samples). However, these two aspects may provide a chance to bring in errors in weight computation. Specifically, (1) target sample is randomly chosen from the training dataset. Such strategy gives noisy samples (particularly the samples far away from the center of sample data in the same class) a chance to be selected. And it is undoubted that the use of noisy samples would easily bring in bias in weight computation. (2) The selection of nearest neighbors depends on the distances away from target sample, however the distance computation is determined by the features participated in. In our EEG dataset, the samples are with highdimension features which could be mixed with noises or redundant features. Certainly such features would interfere with the distance computation between samples, and cause an error in feature weight computation. To solve the above two mentioned problems, an enhanced method “IterRelCen” based on the principle of Relief (Relieff) was proposed in this paper, given in Fig. 2b. The method IterRelCen reconstructed Relief (Relieff) algorithm from two aspects: First, the target sample selection rule is adjusted. Instead of randomly selecting sample, samples close to the center of dataset from the same class have the priority of being selected first (lines 5 and 6) according to formula (5), given that such samples could discriminate different class samples more accurately. Second, the idea of iterative computation is borrowed to eliminate the noisy features in samples (lines 1 and 14). In each iteration, the N features with the smallest weights (N is a userdefined parameter, depending on the required iterative speed) are removed from the current feature set (line 12). The left features are fed into the classifier for accuracy calculation (line 13). Repeat this process until the current feature set is empty. Usually, first removed ones are the features performing worst in discrimination. With the noisy features being gradually removed, the distance between samples computed from the rest features will reflect the relationship between samples more and more accurately.
$$\hbox{min} Dist=\left {S_{i}  Ct} \right,\forall S_{i} , \,\,where\,\,Ct = {1 \mathord{\left/ {\vphantom {1 n}} \right. } n}\sum\limits_{1}^{n} {S_{i} }$$
(5)
Classification algorithm
A support vector machine (SVM) was used as the classification algorithm in this study. Classical SVM is a technique developed previously [38] to solve the twoclass classification problem. The main idea of this typical SVM is to separate a twoclass dataset by finding the maximum geometrical margin between the two classes.Given a training set of instancelabel pairs \((X_{i} , y_{i} ),\, i = 1,\ldots ,l\), where \(X_{i} \in R^{n}\) and \(y_{i} \in \{ 1,{ 1} \}\). In order to build the SVM model, the following optimization problem needs to be solved:
$$\begin{aligned} \mathop {\hbox{min} }\limits_{\omega ,b,\xi } \frac{1}{2}w^{T} w + C\sum\limits_{i = 1}^{l} {\xi_{i} } \hfill \\ subject\,to\,\,y_{i} (w^{T} \phi (x_{i} ) + b) \ge 1  \xi_{i} , \hfill \\ \xi_{i} \ge 0, i = 1, \ldots , l \hfill \\ \end{aligned}$$
(6)
The slack variables \(\xi_{i}\) are introduced in case a small part of the data is nonlinearly separable. The margin is defined as \(\gamma = 1/2 \ w \\), so our goal is to make a best tradeoff between low training error \(\sum {\xi_{i} }\) and large margin \(\gamma\). We also adopted a kernel function \(\phi (X_{i} )\) to map training vectors \(X_{i}\) onto a higher dimensional space. Thus, combined with slack variables, most of the nonlinear problems can be transformed into linear problems. There are several kernel functions provided to be chosen, e.g. linear function, polynomial function or radial basis function (RBF). Here, RBF was used.
Classical SVM only has the ability to discriminate between two classes; however, one of our datasets was from fourclass control paradigm. Thus, we implemented the “oneagainstone” approach [39] here to solve this multiclass classification problem. If k is the number of classes, then \(k( k  1)/2\) classifiers need to be constructed and each classifier model will then be trained from twoclass data.
As shown in Fig. 2c, the channel selection process binds the channel selection method (IterRelCen) with classification algorithm. The generalization accuracies were estimated by tenfold cross validation (tenfold CV) [40], in which a whole dataset is split up into ten folds. In each fold, feature selection using IterRelCen is performed first based on training set, resulting in a specific ranking of all features. And then it employs selected features to train the SVM model. Finally SVM is tested on the corresponding features of the test set to evaluate the testing accuracy. The classification accuracy was the average of testing accuracies of each fold, denoted by the following equation:
$$Acc = \frac{1}{k} \sum\limits_{i = 1}^{k} {acc_{i} } , \,\, where\, acc_{i} = \frac{{Correct\, Num_{test} }}{{Total\, Num_{test} }}$$
(7)
where k is 10, and acc
_{
i
} is the testing accuracy of one fold.
Through the above method, the feature subset that achieved highest accuracy is recognized as optimal features. And channels hold at least one feature are considered as optimal channels. Note that by such method some optimal channels may only include several feature bands (less than 13), while some may retain all the 13 feature bands.