The effects of semantic congruency: a research of audiovisual P300-speller

Background Over the past few decades, there have been many studies of aspects of brain–computer interface (BCI). Of particular interests are event-related potential (ERP)-based BCI spellers that aim at helping mental typewriting. Nowadays, audiovisual unimodal stimuli based BCI systems have attracted much attention from researchers, and most of the existing studies of audiovisual BCIs were based on semantic incongruent stimuli paradigm. However, no related studies had reported that whether there is difference of system performance or participant comfort between BCI based on semantic congruent paradigm and that based on semantic incongruent paradigm. Methods The goal of this study was to investigate the effects of semantic congruency in system performance and participant comfort in audiovisual BCI. Two audiovisual paradigms (semantic congruent and incongruent) were adopted, and 11 healthy subjects participated in the experiment. High-density electrical mapping of ERPs and behavioral data were measured for the two stimuli paradigms. Results The behavioral data indicated no significant difference between congruent and incongruent paradigms for offline classification accuracy. Nevertheless, eight of the 11 participants reported their priority to semantic congruent experiment, two reported no difference between the two conditions, and only one preferred the semantic incongruent paradigm. Besides, the result indicted that higher amplitude of ERP was found in incongruent stimuli based paradigm. Conclusions In a word, semantic congruent paradigm had a better participant comfort, and maintained the same recognition rate as incongruent paradigm. Furthermore, our study suggested that the paradigm design of spellers must take both system performance and user experience into consideration rather than merely pursuing a larger ERP response.

been a study that compares congruent stimuli and incongruent semantic stimuli from the perspective of the BCI speller.
With the exception of system performance, the user comfort of spellers is also becoming a higher priority for researchers. Earlier studies demonstrated the potential for physical and mental discomfort [23] for long-time use of spellers. Besides, semantic congruency may cause some psychological effects on the comfort, fatigue and mental workload of users. Thus, from the perspective of ergonomics, a well-designed speller should give consideration to both system performance and user comfort. This study focused on this issue, and attempted to compare the congruent and incongruent paradigms from both the aspect of behavioral data and ERP, to gain insight into the effect of semantic congruency on audiovisual stimuli paradigm based spellers.

Method
Participants 11 healthy participants (5 females) aged 22-33 (mean ± SD, 23.9 ± 1.14 years) took part in the experiment. All of the participants had normal hearing and normal or correctedto-normal vision. All participants provided written informed consent prior to the experiment, and they volunteered to take part in the experiment.

Stimuli
Two paradigms combining visual and audio were adopted in this experiment. The visual stimuli were the same between the two audiovisual stimuli paradigms, however, the audio stimuli had two different choices. For the semantic congruent paradigm, the pronunciation of each audio stimuli was congruent with the corresponding visual stimuli. For instance, if the current visual stimuli was the letter "a", the corresponding audio stimuli was the syllable 'ei' (the pronunciation of the letter 'a'), as shown in Fig. 1(A1). For the semantic incongruent paradigm, the audio stimuli had little relationship with the visual stimuli letters, as shown in Fig. 1(A2). For both models, all audio stimuli and the corresponding visual stimuli were presented simultaneously.
Visual stimuli contained four different letters: 'a' , 'b' , 'c' , 'd' , and each letter had a unique color as Fig. 1 shows. These letters were presented in a random sequence at the center of a 19′ TFT screen with a refresh rate of 60 Hz. The inter stimulus interval (ISI) between two adjacent stimuli was 200 ms with a duration of 130 ms (the same as the duration of an audio stimuli).
Audio stimuli consisted of two conditions, For semantic congruent audio stimuli, we selected four short spoken syllables ('ei' , 'bi' , 'si' , and 'di') to match the visual stimuli, with 'ei' and 'si' on the left channel, 'bi' and 'di on the right channel. For semantic incongruent condition, the syllables 'ti' , 'to' , 'it' , and 'ot' used in previous studies [24,25] were adopted, with 'ti' , 'it' on the left channel and 'to' , 'ot' on the right channel. The duration of each stimulus was also 130 ms, and the ISI was 200 ms in both conditions. The audio stimuli were presented through comfortably positioned in-ear headphones, as Fig. 1 shows.

Experimental procedure
The experimental paradigm was implemented in e-prime 2.0. Participants were instructed to sit in a comfortable position and keep their eyes staring at the center of the screen, with minimum eye movements or any other muscle artifacts throughout the whole experiment. Experiment for each participant consisted of two paradigms (semantic congruent paradigm and semantic incongruent paradigm). Each paradigm repeated twice, comprising four blocks. In each block, a random sequence of four single trial stimuli was repeated ten times. The experiment lasted about 8 min, and then the subjects were asked to relax for 5 min, before repeating the experiment.
At the end of the experiment, all participants were asked about which stimuli paradigm they preferred based on their participant comfort. In detail, evaluation of comfort in this study mainly took into account three questions: which paradigm is more difficult; which paradigm makes you feel more fatigue; which paradigm will you choose if there is another experiment. When a paradigm was chosen in two or three of the three questions, it was assumed the participant preferred this paradigm. There is currently no single evaluation method of participant comfort. Garcia et al. [26] compared the user comfort of different P300 speller configurations with NASA Task Load Index (NASA-TLX), while Ekandem et al. [27] evaluated the user comfort with the duration of BCI show the stimuli of the two experiment paradigms, and B1, B2 demonstrate the experimental procedure of the two paradigms. A1, B1 represent the stimuli and procedure of semantic congruent paradigm, A2, B2 represent the stimuli and procedure of the semantic incongruent paradigm. The coordinate axis in (B1), (B2) represent time. The first character means the target character in the following block and 2 s means the time interval between the target prompting and the block onset in (B1), (B2). Note that the four different random stimuli sequence consisted of blocks repeated 10 times. The visual stimuli and the auditory stimuli were presented simultaneously use. These studies considered only one factor of comfort evaluation, neither of them was the best choice. Our study considered the difficulty degree of experiment, the fatigue degree of participants and the willingness of re-participating in the experiment comprehensively in comfort evaluation, which was a more convincing choice.

Data acquisition and preprocessing
The EEG signal was recorded from 64 Ag/AgCl scalp electrodes placed according to the positions of the extended International 10-20 system, and amplified by a Neuroscan NuAmp amplifier with a sampling rate of 500 Hz. During the data acquisition, all channels were referenced to the nose tip, with the ground electrodes placed at the frontal area neighboring the AFz channel. Electrode impedances were kept below 10 KΩ during data acquisition.
In data preprocessing, the raw EEG data were first re-referenced to the average signal of the left and right mastoid. Next, eye potential and motion artifacts were removed through the method of independent component analysis (ICA). After removal of eye potential and motion artifacts, the data was filtered with a 0.5-40 Hz band-pass filter, and down-sampled to 200 Hz for further analysis. Finally, the epoch data were extracted from −200 to 800 ms after each stimulus onset and the baseline was removed by subtracting the mean value between −200 and 0 ms.

Classifiability and classification algorithm
For better classification performance, the channels and the features used for offline classification were selected based on classifiability analysis. Classifiability is a parameter indicating the difference between target and non-target stimuli, and it is usually expressed by the r 2 -value [28] defined as where, M T and M N are the sample size of the target and non-target respectively; X T and X N are the selected features vector of the target and non-target respectively.
Usually, the classification performance of speller depends not only on the amplitude of ERP data, but also on the classifiability of selected ERP features between target and non-target stimuli. Thus, the analysis of r 2 -value can provide the mathematic foundation for selecting channels and the features of each channel. Specifically, ERP data of 3 time intervals of 180-280, 300-450, and 480-530 ms down-sampled to 40 Hz with the greatest average r 2 value of 10 channels (Cz, Pz, CPz, Oz, PO7, PO8, FC1, FC2, FC5, and FC6) based on the whole dataset were selected as features. Thus, 10*12 = 120 features for each stimuli were used for classifying.
Two classic classification algorithms, support vector machine (SVM) and stepwise linear discriminant analysis (SWLDA) were implemented for the binary classification. These two algorithms have been shown to be the most effective classifiers in previous speller study [29]. The SVM approach shows many unique advantages in solving small sample, nonlinear, and high dimensional pattern recognition problems. Two patterns are separated by SVM with a hyper-plane that has the maximum distance from the two (1) patterns. Linear learning kernel was chosen for the SVM model since previous study indicated that linear kernel had a better performance compared with nonlinear method like Gaussian kernel [29], and the penalty coefficient was optimized by tenfold cross validation when training the SVM model. The algorithm has previously been implemented in LibSVM [30]. SWLDA is an extensive algorithm based on LDA. SWLDA can select the features used for calculation for better classification performance compared with LDA. To predict the target label, input features for analysis were weighted by ordinary least-squares regression. In this way, at last 60 features were selected for final analysis with the union of backward and forward stepwise calculations [29,31]. Besides, three quarters of the total data was used for training the classifier, and the remaining one quarter was used for testing. To investigate differences in performance between the congruent and incongruent paradigm in detail, single trial accuracy, character accuracy, and the recognition rate with increased repetition were compared by bootstrap t test.

Statistical method
Bootstrapping-based t tests and ANOVAs were performed to analyze the effect of the semantic congruency on system performance and brain activities. As a statistic method published by Efron [32], bootstrapping approach does not depend on the normality of the sample distribution, which is a significant advantage compared with the traditional parametric statistic method. In detail, bootstrapping reestablishes the distribution of the parent samples by repeated sampling of the original samples, usually, with an iteration of 1000. In this way, the confidence interval under a certain significance level can be obtained, which can help evaluate whether the difference between two conditions is significant or not. In addition, false discovery rate (FDR) correction was performed for multiple comparisons. Further background information about the features and advantages of bootstrapping analysis is provided in Hesterberg et al. [33].

Result
Behavior data and participant comfort record Figure 2 shows the result of grand-averaged character recognition rate. Figure 2a shows the recognition rate along with the number of repetitions, and Fig. 2b shows the p values of t test between the recognition rates of the two paradigms. As depicted in Fig. 2a, the character recognition rate increased with the increase of the repetition number for both SWLDA and SVM classification algorithms. For further analysis, the bootstrapping t test was performed to compare the recognition rate between the incongruent and congruent paradigm in the two classification methods. Shown in Fig. 2b, the t test result suggests that the data does not reach the significance level of 0.05, and clearly demonstrates that there is no significant difference between the congruent and incongruent paradigm in the character recognition rates obtained by SWLDA and SVM.
The offline single trial classification accuracy was determined by SWLDA and SVM, and the paradigm preferences of the 11 subjects were recorded. As Table 1 shows, for incongruent stimuli paradigm, the average single trial accuracies across 11 subjects obtained by algorithms of SWLDA and SVM are 70.1 and 69.3%, respectively. For congruent stimuli paradigm, the corresponding average single trial accuracies are 67.8 and  67.6%, respectively. The result indicates little difference between the two paradigms in single trial classification accuracies obtained by both SWLDA and SVM according to Table 1, which was verified by 1000 iteration bootstrapping t test analysis. The single trial accuracy obtained by SWLDA of the two paradigms were compared: t(10) = 1.66, p = 0.12, and by SVM: t(10) = 1.06, p = 0.316. Overall, there is no significant difference between the paradigms in single trial classification accuracy. Additionally, the recorded comfort information indicates that most subjects preferred the semantic congruent paradigm: 8 of 11 felt more comfortable under the congruent paradigm, two found no difference between the two paradigms, and only one subject preferred the incongruent condition. Figure 3a shown the significant p values over time for 62 scalp electrodes of bootstrapping comparison between ERPs of congruent and incongruent paradigm. ERPs evoked by both target and non-target stimuli were compared, and the result was corrected by FDR, since hundreds of comparisons were implemented simultaneously. To examine the differences of some ERP components in different brain regions, scalp maps of four specific time points including 200, 350, 440, and 620 ms (representing N2, P3, N4, and P5, respectively) are shown at the bottom of Fig. 3a. According to the figure, the main conclusions can be summarized as follows: from the aspect of spatio-temporal distribution difference, the paradigm type had a great influence on brain response for both target and non-target stimuli. In detail, for non-target stimuli, the significant difference was mainly observed in the posterior brain area for the time interval of 190-210 ms, the whole brain area for the time intervals of 275-300 and 490-520 ms, and the anterior brain area for the time interval of 310-340 ms. For target stimuli, there was significant difference mainly in the 200-220 ms interval for the whole brain area, 280-300 ms for the anterior brain area, and 390-410 ms for the posterior brain area. From the aspect of scalp map, the main components of ERP differed in different brain areas between target and non-target stimuli. Specifically, for non-target stimuli, there was significant difference in the whole brain area of N2, P3, and the posterior brain area of P5. For target stimuli, there was significant difference in almost the whole brain area of N2, the anterior and the center brain area of P3, and the posterior brain area of N4. Figure 3b depicts the grand-averaged ERP waveforms for the selected electrodes CZ, POZ, and FC6 of 11 participants for both target and non-target stimuli. As main components of ERP evoked by oddball paradigm, N2 and P3 can be observed clearly in Fig. 3b. For further analysis, the amplitude and the latency of N2 and P3 components for the three selected electrodes were selected, and 1000 iterations of the bootstrap t test were performed to compare the amplitude and the latency of N2 and P3 components between the two paradigms. As shown in Fig. 4, there was significant difference of amplitude at N2 of electrode CZ (t(10) = −2.56, p = 0.03), and P3 of electrodes CZ (t(10) = 2.82, p = 0.02), POZ (t(10) = 2.31, p = 0.04), and FC5 (t(10) = 3.24, p = 0.01), but no significant difference in latency was observed in both N2 and P3 for the three electrodes.

Classifiability comparison
Calculated using formula (1), the classifiabilites between target and non-target stimuli over time for all 62 electrodes are depicted in Fig. 5a. Classifiabilities were analyzed for both paradigms based on the whole dataset. The average values for three time intervals selected as features including 180-280, 300-450, and 480-530 ms are depicted as scalp maps in the bottom panel of Fig. 5a. According to the data shown in Fig. 5a, two major conclusions can be obtained: (1) from the distribution of spatio-temporal classifiability, higher classifiability values occur in the time interval of 200-500 ms for both paradigms.
(2) from the aspect of scalp map, higher classifiability values were observed in the posterior brain area for 180-280 ms, and the center brain area for 300-450 ms, for both paradigms. In addition, for time interval 480-530 ms, higher values were observed in the  right and left posterior brain area for the incongruent paradigm, and only the posterior brain area for the congruent paradigm. Figure 5b depicts the bootstrapping t test results for classifiabilities of the incongruent and congruent paradigm. FDR correction was performed as hundreds of comparisons were implemented simultaneously. It can be clearly seen in Fig. 5b that very few significant differences are observed around PZ electrode at 370 ms approximately. Beyond that, there was little difference in time or space, and classifiability can account for the comparison result of offline classification accuracy.

Discussion
The ERP-based speller is one of the most stable communication systems for patients with severe neuro-muscular diseases. However, there have been only few studies investigating factors that may affect system performance or user comfort for these systems. In this study, the effect of semantic congruency toward audiovisual BCI was investigated for overall system performance and participant comfort. Furthermore, high-density electrical mapping of ERPs were analyzed to explain the obtained results.
First, the t test result of offline classification accuracy suggested that semantic congruency between auditory and visual stimuli had little effect on system performance. However, we found an interesting phenomenon that significant larger P300 waveforms were obtained when the incongruent paradigm was applied compared with the congruent paradigm, which was completely opposite to the result that larger P300 waveforms lead to higher accuracies [34,35]. Additionally, semantic congruency had a significant influence on participant comfort, since 8 of 11 reported that they felt more comfortable while in the semantic congruent paradigm. Semantic incongruent stimuli was a more complex paradigm in general, thus, participants must maintain increased focus and attention and stronger brain activity during an incongruent paradigm experiment compared with a congruent one. Spontaneously, a sense of discomfort emerges slowly as the experiment time increases. Furthermore, the results of ERP analysis confirmed our inference. The bootstrapping results in Fig. 3a indicated significant difference around 280-350 ms in almost the whole brain area. Additionally, the main component of ERPs, P3 and N2 were captured and compared in Fig. 4. Previous results indicated that the N2 component was associated with the interaction of auditory and visual [36]. P3 component was found to have a close relationship with workload and attention allocation for a certain task [37,38]. Therefore, higher P3 amplitude implies stronger brain activities. These conclusions supported our experimental results directly.
As we mentioned above, there has been increased attention paid to user comfort in BCI research, resulting in this becoming an important indicator for BCI evaluations, and more generally, for human computer interface (HCI) evaluations. Kübler et al. adapted the user-centered design (UCD) concept to BCI research and development, and assessed user satisfaction with questionnaires and visual-analogue scales [39]. Ekandem et al. compared user comfort, experiment, and preparation time between two different BCI devices from an ergonomic perspective [27]. However, the quantification of user comfort remains a challenging problem because there has been no publication of a general evaluation for it. The result of our research indicated that a larger P3 amplitude of ERP corresponds to poorer participant comfort, while a smaller P3 amplitude of ERP corresponds to better participant comfort on the contrary. This phenomenon revealed a potential relationship between ERP amplitude and participant comfort, which suggested that the evaluation of user comfort in BCI might be researched from the perspective of physiological parameters.
In addition, our research found that a larger ERP amplitude did not lead to higher system performance. A large part of the reason may be that non-target stimuli elicited brain response increases simultaneously with the brain response to target stimuli, as shown in Fig. 3b. This was an interesting phenomenon and an important research issue, because it indicated that non-target stimuli can also elicit a steady brain response that differs in different stimulus paradigms. Furthermore, these findings can provide some suggestions for future design of a speller paradigm. Specifically, the design of speller paradigms should not focus solely on obtaining a large ERP amplitude. For one thing, a larger ERP amplitude may not lead to a higher system performance. For another, we might not have both better system performance and better user comfort at the same time. In another word, the design of speller paradigm must take both system performance and user comfort into consideration, and keep balance between the two factors.
Finally, it should be noted that only an offline experiment was implemented in this study, and our results lack a criterion or a detailed scale for participant comfort evaluation. To further explore the difference between the two paradigms, future studies should include an online experiment and a detailed participant comfort evaluation scale.

Conclusion
In conclusion, this study was designed to investigate the effects of semantic congruency of auditory and visual stimuli for audiovisual speller. Behavioral data and ERP data were recorded for analysis and comparison. The result suggested that although congruency between auditory and visual stimuli in an audiovisual BCI speller had no significant effect on system performance, it had a great influence on participant comfort and the brain activities of participants. Furthermore, our study suggested that the paradigm design of spellers must take both system performance and user experience into consideration, rather than merely pursuing a larger ERP response.