Acquisition systems
ANT
The ANT acquisition system (Advanced Neuro Technology, ANT, Enschede, The Netherlands) is composed of a high-density WaveGuard cap system and the corresponding full-band DC amplifier. As shown in Figure 1, the WaveGuard cap has 128 Ag/AgCl electrodes covering all the major brain cortical areas (based on the International 10–20 system) with shielded wires in order to be less influenced by exterior noise. Moreover, three different cap sizes are available to adapt as well as possible to the subject’s head specificities. Regarding the full-band DC amplifier, it can reach a sampling rate of 2048 Hz.
In our experiment, electrode impedance was measured and maintained under 20 k Ω for each channel using electrode gel and signals were visually checked before recording. The studied electrode positions are the same as provided by the Emotiv Epoc headset. The hardware and software overall cost is around 30,000–50,000<DOLLAR/>.
The ANT device is provided with the ASA software. It is composed of several main tools: pre-processing, Event-Related Potential analysis (ERP), source reconstruction using inverse models and time-frequency analysis. All these aspects allow more advanced users, typically researchers/physicians to deeply study the brain signals.
Emotiv EPOC
As announced on the Emotiv website, the Emotiv Epoc headset and its Software Development Kit for research mainly includes 14 channels (plus CMS/DRL references, P3/P4 locations) each based on saline sensors. Available channels (also based on the International 10–20 locations) are depicted in Figure 2. This headset has not the ability to cope with all the BCI paradigms with the same success without modifying the hardware. For instance, the motor imagery paradigm, which requires central electrodes, should provide bad performance. As shown in Figure 3, the headset is completely wireless and has a large Lithium-based battery autonomy of 12 hours. The sampling rate can reach 128 Hz. Additionally, gyroscope generates optimal positional information for cursor and camera controls.
In our experiment, all the standard available electrodes of the Emotiv Epoc headset were used. Electrode impedance was decreased by using saline liquid until the level required by the software was reached (in the 10–20 k Ω range) and was checked along the experiment. For the research edition, the total cost is 750<DOLLAR/>.
The Emotiv headset is provided with three different detection software: Expressiv, Affectiv and Cognitiv suites. The Expressiv suite aims at interpreting the user’s facial expressions in real-time. The Affectiv suite aims at monitoring the user’s emotional states in real-time. The Cognitiv suite aims at performing standard BCI-like control.
BCI system
P300-based approach
As illustrated in Figure 4, the P300 evoked potential is an involuntary positive potential that arises around 300 ms after the user has perceived a relevant and rare stimulus [16]. This is commonly used in an odd-ball paradigm, in which the user is requested to attend to a random sequence composed of two kind of stimuli with one stimulus much less frequent than the other one. If the infrequent stimulus is relevant to the user who is putting his attention on it (e.g. silently counting it), its actual appearance activates a P300 waveform in the user’s EEG, which is mainly located in the parietal areas [17].
Following previous work [13, 18] and inspired from the 6×6 matrix P300-speller text editor [19], we are interested in a four-state BCI as depicted in Figure 5. Indeed, if mobile applications have to be considered (e.g. control of a prosthesis or augmented interaction in daily life), one can not afford using a lot of states.
The recognition of a given state is quite simple. At the beginning of one trial, one of the four letters is highlighted in green. Then, the subject has to look at this letter when each row/column is flashed 12 times to increase the low SNR due to disturbances of other brain, muscular and ocular activities. At the intersection of the detected P300 responses, the computer is able to determine which letter/symbol the subject was looking at.
Obviously, in this approach, the interface is not strictly an odd-ball paradigm. Actually, each letter is flashed 50% of the time, which is not really a rare event. However, previous work showed that this approach provides good results and was thus used in this paper [18].
For out-of-the-lab applications, the requirement of an external screen to activate stimuli used in this experiment could be problematic. However, thanks to specific emerging and well-designed VUZIX augmented reality eyewears (Vuzix, Rochester, NY, USA), this problem could be circumvented. As shown in Figure 6, by displaying stimuli on a semi-transparent module containing all the key hardware elements, the device should allow ambulatory P300 applications. Again, the tradeoff of four different states represented by a letter at the four corners of the semi-transparent glasses appears to be a more realistic solution.
P300 pipeline
Considering the Emotiv electrodes on the ANT device (using a common average reference for both devices), the pipeline of the approach (shown in Figure 7) includes different main parts: a temporal high-pass filter, an xDAWN-based spatial filter, an epoch averaging and a Linear Discriminant Analysis (LDA) classifier using a voting rule for the final decision. In order to provide more precise coherent comparison results, this is the same pipeline as developed in our conference paper initiating the comparison of both systems [13]. As discussed in [20], gait-related artefact removal techniques do not bring significant better performance and are thus not used in this pipeline. Ocular and muscular artefacts are basically not linked to the P300 task (except the first gaze to the letter) leading to a strongly mitigated effect by the averaging. This procedure was implemented in the open source OpenVibe software [21].
In order to obtain a better performance, the EEG signal is high-pass filtered at 1 Hz using a 4th order Butterworth filter. By trial and error, we observed that downsampling (an anti-aliasing filter to avoid adding noise was used in the OpenVibe downsampling box) the data at 32 Hz allows a better behaviour of the LDA classifier while decreasing observed noise. Indeed, the P300 potential is mainly located below 16 Hz whereas an undesired slow biological drift can interfere with the pipeline [22]. This also removes high-frequency noise such as power line interference.
Afterwards, a spatial filter is designed thanks to the xDAWN algorithm [23]. This algorithm aims at magnifying the P300 response by considering both signal and noise contrary to a common principal component analysis. Target/non-target epochs have to be separately fed in the algorithm. By linearly combining EEG channels, this algorithm defines P300 and noise subspaces. When projecting EEG signals into these subspaces, P300 detection is enhanced. In this paper, three projection components were retained as the authors basically advise to divide the number of channels by four.
Then, as proposed in the Openvibe software, we use a 600 ms time window epoch. The beginning of the epoch is synchronized with the flashed target. In order to obtain a better SNR, groups of two epochs corresponding to a specific row/column are averaged. The flash, no flash and inter-repetition duration are respectively 0.2 s, 0.1 s and 1 s.
Finally, a 12-fold Linear Discriminant Analysis classifier (LDA) is used to detect whether a P300 was elicited in the brain. In the k-fold approach, the training set is splitted into k uniform groups. Then, k−1 groups are used for training the LDA classifier and the test is performed on the remaining group. After performing this k times, the classifier obtaining the best results is chosen. The reported k-fold value is the average of the k training performance. For each two-grouped time window, the output value of the classifier represents the distance to a hyperplane separating at best the target/non-target P300 classes. This value could also be considered as a confident measure. For a given trial, in a voting classifier, the row/column which has been activated is determined by summing six consecutive LDA outputs (12 repetitions) and by choosing the most probable target.
Performance evaluation
Performance measures
In this paper, two performance measures are assessed: k-fold classification and test set classification rates. The former measure is the single two-grouped epoch classification accuracy, i.e. without any temporal averaging. This measure is obtained only on the training set. This helps to assess the difficulties to learn data due to a different hardware device and could be interpreted as an indirect measure of the SNR by their intrinsic correlation. Indeed, if the SNR is increased, the classification task is made easier leading to enhanced performance. Because a specific care for the Emotiv Epoc electrode positioning was performed, the effect of misalignment should be highly mitigated. Furthermore, the P300 response has an inter-subject distribution variability and thus, the effect should be averaged across subjects.
The test set classification rate introduces an averaging in the decision process. In the P300 pipeline, this is performed by a voting classifier on six consecutive repetitions. This measure thus assesses the overall system performance and may be considered as an indication of the perceived usability, although it is incomplete [24].
Experiment description
In order to compare both devices, two different experiment conditions were tested: sitting on a chair and walking at 3 km/h on a treadmill, which is a convenient speed for subjects. The ambulatory condition was considered to detect whether the devices have similar relative performances when realistic movement artefacts are produced. It also assesses if the recording systems are fixed enough. To train classifiers and to assess the entire system for each condition separately, each session was composed of one training set and one test set of 25 randomly chosen trials (around 12 minutes for each session). The total duration of the experiment per device was around one hour and a quarter (including breaks and data checking). Recordings were performed on different days for a given subject in a random order.
Eight healthy male and one healthy female subjects participated in this experiment with age between 24 and 34 years old. During the experiment, a 20-inch screen (refreshing rate = 60 Hz) in both conditions was placed at about 1.5 meter in front of the subject for the P300 experiment. Subjects were healthy and did not have any known locomotion-related or P300 disturbing diseases or handicap. All procedures were approved by the Université Libre de Bruxelles Internal Review Board and complied with the standards defined in the Declaration of Helsinki.
Statistical analysis
In order to detect whether the Emotiv Epoc headset is competitive with respect to the medical device, our statistical assessment is only focusing on comparisons between both devices under each condition excluding cross condition analysis. Indeed, the null hypothesis H0 assumes that the Emotiv headset is the best device or is equivalent. Then, we are looking for evidence of rejecting it meaning that it is likely to be outperformed.
As applied in [24], although our design follows a repeated measure analysis of variance (ANOVA) for each performance measure [25, 26], the omnibus F test was not performed. Actually, the omnibus ANOVA F test is not a necessary condition to control family-wise error rate (FWER) whatever the applied post-hoc tests [26, 27]. In this case, the degrees of freedom are spent for somehow useless statistical tests, i.e. tests that do not really correspond to the research question. Furthermore, omnibus F test might show no significance while some of the underlying t-tests are significant leading to a decrease of power and an overall more conservative test. The important information to remind is that researchers can not continue running different statistical analyses until they obtain the results they desire as FWER quickly inflates.
Instead of the widely used procedure, we thus defined only a limited amount of a priori comparisons by applying the prescription of [24, 26, 27]. First, we defined all the pairwise comparisons. Thereby, we performed the standard paired t-tests, whose single assumption is data normality, with a standard alpha level of 5%. Given that those comparisons equal the degrees of freedom, we make sure to control FWER inflation without any further adjustments, which leads to a much more powerful test.
However, obtaining significant results is not enough and effect-size is at least as important [24, 26, 28]. Significance only assesses if there is enough evidence to determine whether there is a likely effect between two or more groups. It does not provide information about the size of this effect. If the difference is significant but trivial, the best method is not really outperforming the other ones. The normalized unbiased Hedge’ g∗ effect-size measure somehow tackles this problem by standardly evaluating this effect and by providing some rules of thumb of how big the effect-size is. For instance, absolute Hedge’ g∗ values around 0.1 (≤ 0.16), 0.2 (0.17–0.32), 0.5 (0.33–0.55) or 0.8 (0.56–1.2) respectively mean a trivial, small, medium or a large effect according to [29, 30].
Furthermore, a single value effect size is not sufficient and a 95% interval should be studied. This helps to provide information about how the current effect-size is a good estimation of the underlying one. Obviously, if more data are used, a more precise interval is provided allowing more reliable conclusions. To compute all these values, Matlab and a neuroscience toolbox were used [28].