Experimental protocol
EEG data from the BCI challenge in IEEE EMBS NER 2015 conference were chosen for evaluation [35]. Perrins et al. designed the experimental protocol and collected the EEG data [36]. Twenty-six healthy subjects took part in this study (13 males and 13 females, mean age = 28.8 ± 5.4, range 20–37). All subjects went through five copy-spelling sessions. Each session consisted of 12 five-letter words, except the fifth which consisted of 20 five-letter words.
All subjects reported normal or corrected-to-normal vision and had no previous experience with the P300 speller paradigm or any other BCI applications. EEG data were recorded with 56 passive Ag/AgCl EEG sensors whose placement followed the extended international 10–20 system. Their signals were all referenced to a reference sensor at the nose. The ground electrode was placed on the shoulder and impedances were kept below 10 kΩ. Signals were sampled at 600 Hz.
In order to evaluate the generalization ability across subjects, the data from 16 participants was used for training and the rest 10 for testing.
Preprocessing
The downloaded EEG data have been downsampled to 200 Hz.
Since previous literatures indicate that information of error related potentials mainly falls into the theta band and mu rhythm [17, 25], before further processing, we firstly used a fourth order Butterworth bandpass filter (1–20 Hz) to remove DC component and high-frequency noise [37]. After that, independent component analysis (ICA) was applied to filtered EEG data to remove common artifacts, such as eye movements, electrocardiography (ECG) and so on. EEG data from all channels were then referenced to a common average reference (CAR) to further increase signal-to-noise ratio [38]. At last, all data points of each epoch between 200 and 1000 ms after feedback onset were selected as one sample.
Feature extraction
The features from temporal, spectral, and spatial domains were extracted from EEG signals, and then the back-propagation neural network (BP neural network) was adopted to perform two times of dimensionality reduction, in the end the acquired three levels of features from temporal, spectral, and spatial domains were used in another BP neural network for classification. The procedure is detailed in Fig. 1.
-
Step 1:
Extract temporal feature F
1
from each EEG channel as the level-1 features.
-
Step 2:
Using the level-1 features F
1
from the training group to train a BP neural network, which was applied to classify F
1
features. The derived one-dimensional post-hoc probabilities were the level-2 features F
1
′.
-
Step 3:
Using the level-2 features F
1
′ from all channels to train another BP neural network, which was applied to classify the 56-dimensional level-2 features F
1
′, resulting in one-dimensional level-3 temporal features F
1
′′.
-
Step 4:
Extract level 1 features F
2
in the spectral domain and repeat step 2 and 3 to achieve one-dimensional spectral features F
2
′′.
-
Step 5:
Extract level 1 features F
3
from in the spatial domain and repeat step 2 and 3 to achieve one-dimensional spatial features F
3
′′.
-
Step 6:
Combine three features [F
1
′′ F
2
′′ F
3
′′] from the training group to train a feedforward neural network, which was applied to classify samples from the testing group.
To extract the level-1 features at different domains, a series of algorithms were implemented and described as below.
①Extraction of the level-1 features in the temporal domain (F
1
): the training data were separated into two classes based on their labels, i.e., positive and negative feedbacks. \(\bar{y} \in [\rm Position,Negative]\) denotes the mean of each class. Then the correlation R
xy
and covariance C
xy
between each sample x and \(\bar{y}\) were computed as the feature set F
1
, using
$${R_{xy}}(m) = \left\{ \begin{array}{l} \sum\limits_{j = 0}^{N - m - 1} {{x_{j + m}}{{\overline y }_j}} ,\;\;\;m \ge 0\\ {R_{yx}}( - m),\;\;\;\;\;\;\;\;m < 0 \end{array} \right.$$
(1)
$${C_{xy}}(m) = \left\{ \begin{array}{l} \sum\limits_{j = 0}^{N - m - 1} {({x_{j + m}} - \frac{1}{N}\sum\limits_{i = 0}^{N - 1} {{x_i}} )({{\overline y }_j} - \frac{1}{N}\sum\limits_{i = 0}^{N - 1} {{{\overline y }_i}} )} ,\;\;m \ge 0\\ {C_{yx}}( - m),\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;m < 0 \end{array} \right.$$
(2)
where x denotes each sample, N is the length of each sample and m is the corresponding latency.
②Extraction of the level-1 features in spectral domain (F
2
): the extraction was achieved following the approach from Huang et al. [39]. The empirical mode decomposition (EMD) was firstly performed to decompose samples from each channels into intrinsic mode functions (IMF) using
$$x(t) = \sum\limits_{i = 1}^{n} {c_{i} } + r_{n}$$
(3)
where c
i
is IMF, n is the number of IMF decomposed, r
n
is residue after EMD decomposition. Then Hilbert transformation was performed on each IMF component:
$$y_{i} (t) = \frac{1}{\pi }\int_{ - \infty }^{\infty } {\frac{{c_{i} (\uptau)}}{{t -\uptau}}} d(\uptau)$$
(4)
The analytic signal z
i
(t) was achieved by:
$$z_{i} (t) = c_{i} (t) + jy_{i} (t) = a_{i} (t)e^{{i\theta_{i} (t)}}$$
(5)
where a
i
(t) and θ
i
(t) were instantaneous amplitude and phase respectively, which were calculated by:
$$a_{i} (t) = \sqrt {y_{i}^{2} (t) + c_{i}^{2} (t)}$$
(6)
$$\theta_{i} (t) = \arctan \frac{{y_{i} (t)}}{{c_{i} (t)}}$$
(7)
Then instantaneous frequency of the ith IMF component was acquired by taking the derivative of θ
i
(t) as
$$\omega_{i} (t) = \frac{{d\theta_{i} (t)}}{dt}$$
(8)
Thus, signal x(t) can be describe as below in reflecting its changing amplitudes along time and frequency:
$$x(t) = \sum\limits_{i = 1}^{n} {a_{i} (t)e^{{j\int {\omega_{i} (t)dt} }} } = H(\omega ,t)$$
(9)
The Hilbert spectrum for each IMF component was denoted as:
$$H_{i} (\omega ,t) = a_{i} (t)e^{{j\int {\omega_{i} (t)dt} }}$$
(10)
Finally, relative energy coefficient (E), mean frequency (\(\Phi\)), mean slope (MS), and coefficient of variance (CV) were calculated as below to form the level-1 features in the spectral domain \(F_{2} = \left[ {E_{i} \quad \varPhi_{i} \quad MS_{i} \quad CV_{i} } \right],\quad \left( {\text{i} \in 1,2, \ldots ,\text{n}} \right)\).
$$E_{i} = \frac{{\int_{ - \infty }^{\infty } {H_{i}^{2} (\omega ,t)d\omega } }}{{\sum\nolimits_{i = 1}^{3} {\int_{ - \infty }^{\infty } {H_{i}^{2} (\omega ,t)d\omega } } }}$$
(11)
$$\Phi _{i} = \frac{1}{N}\sum {\omega_{i} (t)}$$
(12)
$$MS = \frac{1}{N}\sum {\frac{{dc_{i} (t)}}{dt}}$$
(13)
$$CV_{i} = \frac{{\sigma_{i} }}{{\mu_{i} }}$$
(14)
where μ
i
and σ
i
are the mean and standard deviation of the ith IMF component.
③Extraction of the level-1 features in the spatial domain (F
3
): the extraction was implemented through four steps based on the approach from Ramoser et al. [27]
-
a.
Calculate the mean covariance matrices \(\bar{R}_{p}\) and \(\bar{R}_{n}\) for the two classes (positive and negative feedbacks), and eigenvalue decomposition as \(\bar{R}_{p} + \bar{R}_{n} = U_{C} \lambda_{C} U_{C}^{'}\)
-
b.
Calculate the whitening transfer matrix \(P = \sqrt {\lambda_{C}^{ - 1} } U_{C}^{\prime }\)
-
c.
Whitening transformation on the mean covariance matrix \(S_{i} = P\bar{R}_{i} P^{T} ,i \in [n,p]\)
-
d.
S
n
and S
p
share common eigenvectors B, i.e., S
i
= Bλ
i
B
T, i ∊ [n, p]
-
e.
Each row in the projection matrix W = B
T
P is the common spatial pattern of the two classes
-
f.
The feature set F
3
consisted of Y
i
= W
T
i
X, (i ∊ 1, 2, …, 56).
While the common spatial pattern filter (each row of W
56×56
) provides a mathematical mean of combining features in the spatial domain, manual adjustment is still required to further improve the performance [33, 40]. Otherwise, overfitting could occur in classification due to the hyper-dimensional space [33]. However, in our method, we need not choose the filter manually, because we have realized the dimensionality reduction of spatial features using neural network from level 2 to 3 and could use all spatial filters, bypassing the redundant manual work.
Dimensionality reduction
Each type of features was with different dimensionalities. There were 164, 12 and 161 dimensions in temporal, spectral and spatial features, respectively. Thus, the total length of the level-1 feature vector was 56 × (164 + 12 + 161). For convenience, we wrote it as 56 × 3 × M, where 56 was the channel number, the number 3 represented the number of kinds of features, and M ∊ [164, 12, 161] represented the length of corresponding features. The whole dimensionality reduction process is illustrated in Fig. 2. The 1st dimensionality reduction led to the collapse of level-1 features from a 3D space to level-2 features on a 2D plane, by replacing samples in level-1 features with posteriori probabilities. The dimension of the level-2 feature from all channels was then further collapsed, which could be visualized as the linearization of a plane (Fig. 2).
The feedforward BP neural network was used to reduce dimensionality of features. By inputting multi-dimensional level-1 features F, one-dimensional level-2 features F′ were acquired after dimension deduction, by
$$F_{i}^{\prime } = {\text{tansig}}(W^{T} F_{i} + b)$$
(15)
where i ∊ [1, 2, 3] denotes different features. W
T and b are weights of the neural network and bias, respectively, acquired from training datasets. Tansig indicates the hyperbolic tangent sigmoid transfer function that calculated a layer’s output from its net input.
$${\text{tansig}}(n) = \frac{2}{{1 + e^{ - 2n} }} - 1$$
(16)
When repeating the same steps with level-2 features, the outputs were level-3 features as F″.
Classification
For classification, the feedforward neural network was implemented after obtaining level-3 features p = [F
1
″F
2
″ F
3
″]. The neural network can be described as
$$Output = {\text{logsig}}(W^{T} p + b)$$
(17)
where Output is the classification results. W
T and b are weights of the neural network and bias, respectively, obtained from level-3 features p using training data. Logisg is a transfer function as
$${\text{logsig(n) = }}\frac{1}{{1 + e^{ - n} }}$$
(18)