Experimental protocol
EEG data from the BCI challenge in IEEE EMBS NER 2015 conference were chosen for evaluation [35]. Perrins et al. designed the experimental protocol and collected the EEG data [36]. Twentysix healthy subjects took part in this study (13 males and 13 females, mean age = 28.8 ± 5.4, range 20–37). All subjects went through five copyspelling sessions. Each session consisted of 12 fiveletter words, except the fifth which consisted of 20 fiveletter words.
All subjects reported normal or correctedtonormal vision and had no previous experience with the P300 speller paradigm or any other BCI applications. EEG data were recorded with 56 passive Ag/AgCl EEG sensors whose placement followed the extended international 10–20 system. Their signals were all referenced to a reference sensor at the nose. The ground electrode was placed on the shoulder and impedances were kept below 10 kΩ. Signals were sampled at 600 Hz.
In order to evaluate the generalization ability across subjects, the data from 16 participants was used for training and the rest 10 for testing.
Preprocessing
The downloaded EEG data have been downsampled to 200 Hz.
Since previous literatures indicate that information of error related potentials mainly falls into the theta band and mu rhythm [17, 25], before further processing, we firstly used a fourth order Butterworth bandpass filter (1–20 Hz) to remove DC component and highfrequency noise [37]. After that, independent component analysis (ICA) was applied to filtered EEG data to remove common artifacts, such as eye movements, electrocardiography (ECG) and so on. EEG data from all channels were then referenced to a common average reference (CAR) to further increase signaltonoise ratio [38]. At last, all data points of each epoch between 200 and 1000 ms after feedback onset were selected as one sample.
Feature extraction
The features from temporal, spectral, and spatial domains were extracted from EEG signals, and then the backpropagation neural network (BP neural network) was adopted to perform two times of dimensionality reduction, in the end the acquired three levels of features from temporal, spectral, and spatial domains were used in another BP neural network for classification. The procedure is detailed in Fig. 1.

Step 1:
Extract temporal feature F
_{
1
} from each EEG channel as the level1 features.

Step 2:
Using the level1 features F
_{
1
} from the training group to train a BP neural network, which was applied to classify F
_{
1
} features. The derived onedimensional posthoc probabilities were the level2 features F
_{
1
}
′.

Step 3:
Using the level2 features F
_{
1
}
′ from all channels to train another BP neural network, which was applied to classify the 56dimensional level2 features F
_{
1
}
′, resulting in onedimensional level3 temporal features F
_{
1
}
′′.

Step 4:
Extract level 1 features F
_{
2
} in the spectral domain and repeat step 2 and 3 to achieve onedimensional spectral features F
_{
2
}
′′.

Step 5:
Extract level 1 features F
_{
3
} from in the spatial domain and repeat step 2 and 3 to achieve onedimensional spatial features F
_{
3
}
′′.

Step 6:
Combine three features [F
_{
1
}
′′ F
_{
2
}
′′ F
_{
3
}
′′] from the training group to train a feedforward neural network, which was applied to classify samples from the testing group.
To extract the level1 features at different domains, a series of algorithms were implemented and described as below.
①Extraction of the level1 features in the temporal domain (F
_{
1
}): the training data were separated into two classes based on their labels, i.e., positive and negative feedbacks. \(\bar{y} \in [\rm Position,Negative]\) denotes the mean of each class. Then the correlation R
_{
xy
} and covariance C
_{
xy
} between each sample x and \(\bar{y}\) were computed as the feature set F
_{
1
}, using
$${R_{xy}}(m) = \left\{ \begin{array}{l} \sum\limits_{j = 0}^{N  m  1} {{x_{j + m}}{{\overline y }_j}} ,\;\;\;m \ge 0\\ {R_{yx}}(  m),\;\;\;\;\;\;\;\;m < 0 \end{array} \right.$$
(1)
$${C_{xy}}(m) = \left\{ \begin{array}{l} \sum\limits_{j = 0}^{N  m  1} {({x_{j + m}}  \frac{1}{N}\sum\limits_{i = 0}^{N  1} {{x_i}} )({{\overline y }_j}  \frac{1}{N}\sum\limits_{i = 0}^{N  1} {{{\overline y }_i}} )} ,\;\;m \ge 0\\ {C_{yx}}(  m),\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;m < 0 \end{array} \right.$$
(2)
where x denotes each sample, N is the length of each sample and m is the corresponding latency.
②Extraction of the level1 features in spectral domain (F
_{
2
}): the extraction was achieved following the approach from Huang et al. [39]. The empirical mode decomposition (EMD) was firstly performed to decompose samples from each channels into intrinsic mode functions (IMF) using
$$x(t) = \sum\limits_{i = 1}^{n} {c_{i} } + r_{n}$$
(3)
where c
_{
i
} is IMF, n is the number of IMF decomposed, r
_{
n
} is residue after EMD decomposition. Then Hilbert transformation was performed on each IMF component:
$$y_{i} (t) = \frac{1}{\pi }\int_{  \infty }^{\infty } {\frac{{c_{i} (\uptau)}}{{t \uptau}}} d(\uptau)$$
(4)
The analytic signal z
_{
i
}(t) was achieved by:
$$z_{i} (t) = c_{i} (t) + jy_{i} (t) = a_{i} (t)e^{{i\theta_{i} (t)}}$$
(5)
where a
_{
i
}(t) and θ
_{
i
}(t) were instantaneous amplitude and phase respectively, which were calculated by:
$$a_{i} (t) = \sqrt {y_{i}^{2} (t) + c_{i}^{2} (t)}$$
(6)
$$\theta_{i} (t) = \arctan \frac{{y_{i} (t)}}{{c_{i} (t)}}$$
(7)
Then instantaneous frequency of the ith IMF component was acquired by taking the derivative of θ
_{
i
}(t) as
$$\omega_{i} (t) = \frac{{d\theta_{i} (t)}}{dt}$$
(8)
Thus, signal x(t) can be describe as below in reflecting its changing amplitudes along time and frequency:
$$x(t) = \sum\limits_{i = 1}^{n} {a_{i} (t)e^{{j\int {\omega_{i} (t)dt} }} } = H(\omega ,t)$$
(9)
The Hilbert spectrum for each IMF component was denoted as:
$$H_{i} (\omega ,t) = a_{i} (t)e^{{j\int {\omega_{i} (t)dt} }}$$
(10)
Finally, relative energy coefficient (E), mean frequency (\(\Phi\)), mean slope (MS), and coefficient of variance (CV) were calculated as below to form the level1 features in the spectral domain \(F_{2} = \left[ {E_{i} \quad \varPhi_{i} \quad MS_{i} \quad CV_{i} } \right],\quad \left( {\text{i} \in 1,2, \ldots ,\text{n}} \right)\).
$$E_{i} = \frac{{\int_{  \infty }^{\infty } {H_{i}^{2} (\omega ,t)d\omega } }}{{\sum\nolimits_{i = 1}^{3} {\int_{  \infty }^{\infty } {H_{i}^{2} (\omega ,t)d\omega } } }}$$
(11)
$$\Phi _{i} = \frac{1}{N}\sum {\omega_{i} (t)}$$
(12)
$$MS = \frac{1}{N}\sum {\frac{{dc_{i} (t)}}{dt}}$$
(13)
$$CV_{i} = \frac{{\sigma_{i} }}{{\mu_{i} }}$$
(14)
where μ
_{
i
} and σ
_{
i
} are the mean and standard deviation of the ith IMF component.
③Extraction of the level1 features in the spatial domain (F
_{
3
}): the extraction was implemented through four steps based on the approach from Ramoser et al. [27]

a.
Calculate the mean covariance matrices \(\bar{R}_{p}\) and \(\bar{R}_{n}\) for the two classes (positive and negative feedbacks), and eigenvalue decomposition as \(\bar{R}_{p} + \bar{R}_{n} = U_{C} \lambda_{C} U_{C}^{'}\)

b.
Calculate the whitening transfer matrix \(P = \sqrt {\lambda_{C}^{  1} } U_{C}^{\prime }\)

c.
Whitening transformation on the mean covariance matrix \(S_{i} = P\bar{R}_{i} P^{T} ,i \in [n,p]\)

d.
S
_{
n
} and S
_{
p
} share common eigenvectors B, i.e., S
_{
i
} = Bλ
_{
i
}
B
^{T}, i ∊ [n, p]

e.
Each row in the projection matrix W = B
^{T}
P is the common spatial pattern of the two classes

f.
The feature set F
_{
3
} consisted of Y
_{
i
} = W
^{T}_{
i
}
X, (i ∊ 1, 2, …, 56).
While the common spatial pattern filter (each row of W
_{
56×56
}) provides a mathematical mean of combining features in the spatial domain, manual adjustment is still required to further improve the performance [33, 40]. Otherwise, overfitting could occur in classification due to the hyperdimensional space [33]. However, in our method, we need not choose the filter manually, because we have realized the dimensionality reduction of spatial features using neural network from level 2 to 3 and could use all spatial filters, bypassing the redundant manual work.
Dimensionality reduction
Each type of features was with different dimensionalities. There were 164, 12 and 161 dimensions in temporal, spectral and spatial features, respectively. Thus, the total length of the level1 feature vector was 56 × (164 + 12 + 161). For convenience, we wrote it as 56 × 3 × M, where 56 was the channel number, the number 3 represented the number of kinds of features, and M ∊ [164, 12, 161] represented the length of corresponding features. The whole dimensionality reduction process is illustrated in Fig. 2. The 1st dimensionality reduction led to the collapse of level1 features from a 3D space to level2 features on a 2D plane, by replacing samples in level1 features with posteriori probabilities. The dimension of the level2 feature from all channels was then further collapsed, which could be visualized as the linearization of a plane (Fig. 2).
The feedforward BP neural network was used to reduce dimensionality of features. By inputting multidimensional level1 features F, onedimensional level2 features F′ were acquired after dimension deduction, by
$$F_{i}^{\prime } = {\text{tansig}}(W^{T} F_{i} + b)$$
(15)
where i ∊ [1, 2, 3] denotes different features. W
^{T} and b are weights of the neural network and bias, respectively, acquired from training datasets. Tansig indicates the hyperbolic tangent sigmoid transfer function that calculated a layer’s output from its net input.
$${\text{tansig}}(n) = \frac{2}{{1 + e^{  2n} }}  1$$
(16)
When repeating the same steps with level2 features, the outputs were level3 features as F″.
Classification
For classification, the feedforward neural network was implemented after obtaining level3 features p = [F
_{
1
}″F
_{
2
}″ F
_{
3
}″]. The neural network can be described as
$$Output = {\text{logsig}}(W^{T} p + b)$$
(17)
where Output is the classification results. W
^{T} and b are weights of the neural network and bias, respectively, obtained from level3 features p using training data. Logisg is a transfer function as
$${\text{logsig(n) = }}\frac{1}{{1 + e^{  n} }}$$
(18)