### Data acquisition

A g.tec EEG system was employed for EEG data acquisition using 16 EEG electrodes positioned at locations Fp1, Fp2, F3, F4, Fz, Fc1, Fc2, Cz, C1, C2, Cp3, Cp4, P1, P2, P5, and P6. Reference electrode was placed over left mastoid. The collected data were sampled with 256 samples/s sampling rate. Moreover, data were filtered using the g.tec amplifier’s bandpass filter (corner frequencies: 2 and 62 Hz) and the amplifier’s notch filter with 58 and 62 Hz corner frequencies.

fTCD data collection was performed using a SONARA TCD system with two 2 MHz transducers placed on the right and left sides of the transtemporal window which is located above the zygomatic arch [17]. Given that middle cerebral arteries (MCAs) are responsible of approximately 80% of brain blood perfusion [18], the fTCD depth was set to the depth of the mid-point of the MCAs which is 50 mm [19].

### Presentation paradigms

We designed two different presentation paradigms to be used with the proposed hybrid BCI. The first paradigm employed motor imagery (MI) tasks while the other paradigm used flickering mental rotation (MR) and word generation (WG) tasks as shown in Fig. 7. For both paradigms, while acquiring EEG and fTCD simultaneously, two tasks and a fixation cross that represents the baseline were presented on the screen. Total of 150 trials were presented to each user and during each trial, a vertical arrow randomly selected one of the three visual icons representing the two tasks and the baseline. The vertical arrow pointed to the selected icon for 10 s and the user was asked to perform the task identified by that arrow until the arrow points to another visual icon.

During the MI-based presentation scheme, a basic MI task was presented to the users as shown in Fig. 7a. In particular, a horizontal white arrow that points to the right represented right arm MI while a horizontal white arrow that points to the left represented left arm MI. The baseline was represented by the fixation cross shown in the middle [20].

During MR/WG presentation paradigm, since MR and WG tasks are known to be differentiated using fTCD only, to make them differentiable in terms of EEG, the visual icons of MR and WG tasks were textured with a flickering checkerboard pattern as seen in Fig. 7b and they flickered at 7 Hz and 17 Hz, respectively, to induce different SSVEPs in EEG [21]. During WG task, the user was asked to silently generate words starting with the letter shown on the screen while during MR task, the user was given two 3D shapes and was asked to mentally rotate one of these shapes and decide if they were identical or mirrored.

### Participants

The local Institutional Review Board (IRB) of University of Pittsburgh approved all the study procedures (IRB number: PRO16080475). All the subjects were consented before starting the experiment. A total of 21 healthy individuals participated in this study. In particular, to assess flickering MR/WG paradigm, data were collected from 11 individuals (3 females and 8 males) with ages ranging from 25 to 32 years while, to test MI paradigm, data were collected from 10 subjects (4 males and 6 females) with ages ranging from 23 to 32 years. None of the subjects participated in the study had a history of heart murmurs, concussions, migraines, strokes, or any brain-related injuries. Each subject attended one session that included 150 trials and each trial lasted for 10 s.

### Feature extraction

In this section, we describe our feature extraction approaches applied to EEG and fTCD signals collected using both MI and flickering MR/WG paradigm.

#### EEG

##### MI paradigm

We employed common spatial pattern (CSP) to analyze EEG MI data [6]. CSP is known to be an efficient feature extraction technique for MI-based EEG BCIs as it can extract EEG spatial patterns that characterize different MI tasks [22]. CSP aims at finding a linear transformation that changes the variance of the observations representing two different classes such that the two classes are more separable [23]. More specifically, CSP learns the optimal spatial filters that result in maximizing the variance of one class in a certain direction, and in the mean time, minimize the variance of the second class in the same direction [24]. These filters can be found by solving the following optimization problem:

$$\text{max} _{W} {\text{tr}}\;W^{T} \varSigma_{c} W$$

(1)

$${\text{s}}.{\text{t}}. \, W^{T} \left( {\varSigma_{\left( + \right)} + \varSigma_{\left( - \right)} } \right)W = 1,$$

where \(\varSigma_{c}\) is the average trial covariance for class \(c \epsilon \left\{ { + , - } \right\}\) and \(W\) the transformation matrix.

Assume each trial data are represented as a matrix \(R^{N \times T}\) where \(N\) is the number of EEG electrodes and \(T\) represents the number of the samples for each electrode. Sample covariance of each trial \(m\) can be calculated as follows:

$$S_{m} = \frac{{{\text{RR}}^{T} }}{{{\text{tr}}\left( {{\text{RR}}^{T} } \right)}}.$$

(2)

Using (2), the average trial covariance can be calculated as given below

$$\sum_{c} = \frac{1}{M}\mathop \sum \limits_{m = 1}^{M} S_{m} ,$$

(3)

where \(M\) is the number of trials belonging to class \(c\).

(1) is solved through simultaneously diagonalizing the covariance matrices \(\varSigma_{c}\) which can be represented as follows:

$$\begin{aligned} W^{{}} \varSigma_{\left( + \right)} W = \varLambda_{\left( + \right)} \hfill \\ W^{T} \varSigma_{\left( - \right)} W = \varLambda_{\left( - \right)} , \hfill \\ {\text{s}}.{\text{t}}. \, \varLambda_{\left( + \right)} + \varLambda_{\left( - \right)} = I \hfill \\ \end{aligned}$$

(4)

where \(\varLambda_{c}\) is a diagonal matrix with eigenvalues \(\lambda_{j}^{c} ,\)\(j = 1,2,3, \ldots N\) on its diagonal.

Solution of (4) is similar to the solution of the generalized eigenvalue problem below:

$$\varSigma_{\left( + \right)} w_{j} = \lambda \varSigma_{\left( - \right)} w_{j}$$

(5)

where \(w_{j}\) is the \(j{\text{th}}\) generalized eigenvector and \(\lambda = \frac{{\lambda_{j}^{\left( + \right)} }}{{\lambda_{j}^{\left( - \right)} }}\). (4) is satisfied when the transformation matrix is equivalent to \(W = \left[ {w_{1} ,w_{2} , \ldots w_{N} } \right]\) and \(\lambda_{j}^{c}\) is given by

$$\lambda_{j}^{c} = w_{j}^{T} \varSigma_{c} w_{j}$$

(6)

where \(\lambda_{j}^{c}\) are the elements on diagonal of \(\varLambda_{c}\). \(\lambda_{j}^{\left( + \right)} + \lambda_{j}^{\left( - \right)} = 1\), since \(\varLambda_{\left( + \right)} + \varLambda_{\left( - \right)} = I\).

It can be noted that a higher value of \(\lambda_{j}^{\left( + \right)}\) will result in a higher variance in the data representing class \(c = +\) when filtered using \(w_{j}\). Given that a high value of \(\lambda_{j}^{\left( + \right)}\) results in a low \(\lambda_{j}^{\left( - \right)}\) value, when filtering the data of class \(c = -\) using \(w_{j}\), a low variance will be obtained. In this study, we solved 3 binary MI selection problems by considering different numbers of eigenvectors. More specifically, MI EEG data were spatially filtered using 1, 2, …., and 8 eigenvectors from both ends of the transformation matrix \(\left( W \right)\). For each trial, log variance of each filtered signal was computed and considered as a feature.

##### MR/WG paradigm

As explained before in our previous study, we used template matching to extract features from EEG data [5]. More specifically, for each class, since each trial is represented by 16 EEG segments collected from 16 electrodes, we extract 16 templates corresponding to the 16 EEG electrodes by averaging EEG training trials over each electrode. To extract features representing each trial, cross-correlations between the segments of that trial and the corresponding 16 templates representing each class were calculated. Maximum cross-correlation score across each of 16 cross-correlations was considered as a feature resulting in a total of 16 features. Given that the problems of interest are binary classification problems, the feature vector representing each trial contained a total of 32 features.

#### fTCD

5-level wavelet decomposition [25] was used to analyze the two fTCD data segments of each trial with Daubechies 4 mother wavelet. To decrease the fTCD feature vector dimensions, instead of considering each wavelet coefficient as a feature, we calculated statistical features for each of the 6 wavelet bands resulting from the wavelet analysis. These features included mean, variance, skewness, and kurtosis [26, 27]. Therefore, each trial was represented by 24 features for each fTCD data segment and a total of 48 features.

### Feature selection and projection

Wilcoxon rank-sum test [28] with a *p* value of 0.05 was employed for the selection of the significant features from both EEG and fTCD feature vectors of MR/WG paradigm while it was used to select only fTCD significant features of MI paradigm. As for MI EEG, the feature vector representing a certain trial was composed of \(2f\) features obtained through transforming the data of the trial using \(f =\) 1, 2, …., and 8 eigenvectors from both ends of the transformation matrix \(\left( W \right)\).

EEG and fTCD feature vectors of each trial were then projected separately into 2 SVM scalar scores (EEG and fTCD evidences). To evaluate the performance of both MI and MR/WG paradigms, these evidences were combined under the Bayesian fusion approach explained in “Bayesian fusion and decision making” section. Performance of the MI hybrid system was evaluated using \(2f\) (2, 4,…, and 16) EEG CSP features. The highest performance measures obtained with and without transfer learning were reported and compared in the results section while for MR/WG paradigm, performance measures with and without transfer learning were calculated and compared only at *p* value of 0.05.

### Bayesian fusion and decision making

We developed a Bayesian fusion approach of EEG and fTCD evidences to infer user intent at a given trial considering three different assumptions [5, 6]. Under assumption (\(A1\)), EEG and fTCD evidences are assumed to be jointly distributed while under assumption (\(A2\)), EEG and fTCD evidences are assumed to be independent. Under assumption (\(A3\)), evidences of EEG and fTCD are assumed to be independent, but they contribute unequally toward taking a right decision. For each binary selection problem, tenfold cross validation was used to define training and testing trials. Our previous work showed that the best performance was achieved under assumption \(A3\) for MI paradigm; therefore, for MI, we utilized the assumption *A*3 in this paper [6]. On the other hand, for flickering WG/MR paradigm, *A*2 and \(A3\) both had high performance without any statistically significant differences [5]. However, \(A3\) is more computationally complex compared to \(A2\); therefore, for WG/MR paradigm, we performed probabilistic fusion under assumption \(A2\) [5].

Given that \(N\) trials are introduced to each participant, these trials are represented by a set of EEG and fTCD evidences \(Y = \left\{ {y_{1} , \ldots y_{N} } \right\}\) where \(y_{k} = \left\{ {e_{k} , f_{k} } \right\}\), \(e_{k}\) and \(f_{k}\) are EEG and fTCD evidences of a test trial \(k\). User intent \(x_{k}\) for the test trial \(k\) can be inferred through joint state estimation using EEG and fTCD evidences which can be represented as follows:

$$\widehat{{x_{k} }} = \arg \mathop {\text{max} }\limits_{{x_{k} }} p\left( {x_{k} |Y = y_{k} } \right)$$

(7)

where \(p(x_{k} |Y)\) is the state posterior distribution conditioned on the observations \(Y\). Using Bayes rule, (7) can be rewritten as

$$\widehat{{x_{k} }} = \arg \mathop {\text{max} }\limits_{{x_{k} }} p(Y = y_{k} |x_{k} ) p\left( {x_{k} } \right)$$

(8)

where \(p(Y|x_{k} )\) is the state conditional distribution of the measurements \(Y\) and \(p\left( {x_{k} } \right)\) is the prior distribution of user intent \(x_{k}\). Since the trials are randomized, \(p\left( {x_{k} } \right)\) is assumed to be uniform. Therefore, (8) can be reduced to

$$\widehat{{x_{k} }} = \arg \mathop {\text{max} }\limits_{{x_{k} }} p(Y = y_{k} |x_{k} ).$$

(9)

\(p(Y|x_{k} )\) of each class can be estimated using the EEG and fTCD evidences computed for the training trials. To infer user intent at a test trial \(k\), Eq. (9) is solved at \(Y = y_{k} .\) Here, the distributions \(p(Y|x_{k} = 1)\) and \(p(Y|x_{k} = 2)\) are evaluated under two assumptions as explained below.

#### Assumption 2: independent distributions

Here, the evidences of EEG and fTCD, conditioned on \(x_{k}\), are assumed to be independent. Therefore, (9) can be rewritten as

$$\widehat{{x_{k} }} = \arg ,\mathop {\text{max} }\limits_{{x_{k} }} p(e = e_{k} |x_{k} )p(f = f_{k} |x_{k} )$$

(10)

where \(p(e|x_{k} )\) and \(p(f|x_{k} )\) are the distributions of EEG and fTCD evidences conditioned on the state \(x_{k}\) respectively. To find \(p(e|x_{k} )\) and \(p(f|x_{k} )\), kernel density estimation (KDE) with Gaussian kernel was employed using evidences of EEG and fTCD of the training trials. Kernel bandwidth was computed using Silverman’s rule of thumb [29]. \(e_{k}\) and \(f_{k}\) are plugged in (10) to infer the user intent of a test trial \(k\) where the user intent \(x_{k}\) that maximizes the likelihood is selected.

#### Assumption 3: weighted independent distributions

Here, we assume that evidences of EEG and fTCD are independent, but they contribute unequally toward taking a right decision. Therefore, we propose weighting \(p(e|x_{k} )\) and \(p(f|x_{k} )\) conditional distributions with weights of \(\alpha\) and \(1 - \alpha\), respectively. (9) can be rewritten as

$$\widehat{{x_{k} }} = \arg ,\mathop {\text{max} }\limits_{{x_{k} }} p(e = e_{k} |x_{k} )^{\alpha } p(f = f_{k} |x_{k} )^{1 - \alpha }$$

(11)

where \(\alpha\) is a weighting factor ranging from 0 to 1. \(p(e|x_{k} )\) and \(p(f|x_{k} )\) are computed as mentioned in “Assumption 2: independent distributions” section. Finding the optimal *α* value is performed through applying a grid search over \(\alpha\) values ranging between 0 and 1 with a step of 0.01.

### Transfer learning algorithm

With the aim of decreasing calibration requirements and improving the performance of the hybrid system, we propose a transfer learning approach that identifies the top similar datasets collected from previous BCI users to a training dataset collected from a current BCI user and uses these datasets to augment the training data of the current BCI user. The proposed transfer learning approach is intended to be used for both MI and flickering MR/WG paradigms. Therefore, the performance of the proposed approach was tested using the 6 binary selection problems of both paradigms.

#### Similarity measure

To apply transfer learning to a certain binary selection problem, for each dataset from previous BCI users, EEG and fTCD feature vectors of trials corresponding to that problem were projected into scalar SVM scores. Therefore, each trial was represented by a scalar EEG SVM score and a scalar fTCD SVM score. Using KDE, 2 EEG class conditional distributions and 2 fTCD class conditional distributions were learnt from these scores. KDE was performed using Gaussian kernel. EEG and fTCD class conditional distributions of the current BCI user were also estimated using his/her training trials.

To measure the similarity between the class conditional distributions of the current BCI user and those of the previous users, Bhattacharyya distance [30], given by (12), was used since it is a symmetric measure that can be applied to general distributions especially if these distributions are diverging from normal distributions and it provides bounds on Bayesian misclassification probability, which overall fits very well to our approach of making Bayesian decisions on binary classification problems using the estimated density functions.

$$d = - \ln \mathop \sum \limits_{i = 1}^{N} P_{i} Q_{i} ,$$

(12)

where \(P\) and \(Q\) are 2 probability distributions and \(N\) is the number of points composing each distribution.

Bhattacharyya distance between EEG class conditional distribution of class \(i \left( {i = 1, 2} \right)\) and the corresponding EEG class conditional distribution of the current BCI user was calculated. Bhattacharyya distance was also calculated between the fTCD class conditional distributions of each previous BCI user and the current BCI user. Sum of these 4 distances (2 EEG distances and 2 fTCD distances) represented the total distance between the current BCI user and a certain previous BCI user.

#### Proposed transfer learning algorithm

The proposed transfer learning approach is described in detail in Figs. 8 and 9. Given a group of previous BCI users where each user is represented by one dataset, the objective is to find the most similar datasets to the training dataset of the current BCI user and to combine the trials from these datasets with small number of training trials from the current user to train a classifier that can predict the labels of the test trials of that user with high accuracy. In particular, for each binary selection problem, the dataset of the current user was divided into training and testing sets. Initially, given that each binary selection problem is represented by 100 trials, we used the first 10 trials from the current BCI user for training the prediction model and the remaining 90 trials for testing. As seen in Fig. 8, features are extracted from training trials of the current user as well as the trials corresponding to the binary problem of interest from each of the previous BCI users. Extracted EEG and fTCD features vary depending on the paradigm used for data collection. In particular, CSP and wavelet decomposition were used to extract features from the data of the MI paradigm while template matching and wavelet decomposition were used to extract features from the data of the flickering MR/WG paradigm as explained in “Feature extraction” section. After applying the feature selection step detailed in “Feature selection and projection” section, EEG and fTCD feature vectors of each trial were projected into 2 scalar SVM scores.

For each class within the binary selection problem of interest, we learnt class conditional distributions of the EEG and fTCD scores obtained from SVM projection as seen in Fig. 8. Distance between class conditional distributions of the current BCI user and those of each of the previous BCI users was computed as explained in “Similarity measure” section. To identify the top similar datasets, these distances were sorted ascendingly. At this point, it was required to decide on how many similar datasets should be considered to train the classifier besides the training trials from the current BCI user. Here, we considered a maximum of 3 datasets to be combined with the training trials of the current BCI user. Through crossvalidation, the number of top similar datasets that maximize the performance accuracy when combined with the training trials of the current user was chosen to be used later to predict test trials of the current BCI user as shown in Fig. 9. Here, for each participant, we used up to 3 datasets to be used for transfer learning. However, the maximum number of datasets could be increased or decreased depending on the needs of the designers. Moreover, the presented framework could be used to identify person-specific maximum number of datasets. For future versions of this algorithm, instead of using a maximum of 3 datasets to be combined with the training trials of the current BCI, such number can be optimized for each subject separately by means of model order selection techniques [31].

To study the impact of the training set size (from the current BCI user) on the performance of the proposed transfer learning approach, we applied the proposed approach on training sets of size ranging from 10 to 90 trials which corresponds to test sets of size ranging from 90 to 10 trials.

### Performance evaluation

For both MI and flickering MR/WG paradigms, to assess the significance of the transfer learning (TL) compared to the no transfer learning case (NTL), for each participant, accuracy and information transfer rate (ITR) [32] were calculated and compared at different number of training trials from the current BCI user. In particular, at every number of training trials, accuracy and ITR were calculated at time points 1, 2….,10 s. For each number of training trials, maximum accuracy and ITR across the 10-s trial length were reported for TL and NTL cases. ITR can be calculated as follows:

$$B = \log_{2} \left( N \right) + P\log_{2} \left( P \right) + \left( {1 - P} \right)\log_{2} \left( {\frac{1 - P}{N - 1}} \right)$$

(13)

where *N* represents the number of BCI selections, *P* represents the classification accuracy, and *B* is the information transfer rate per trial.

To compute the reduction in calibration requirements for each binary problem when using TL compared to NTL case, at each training set size, we formed a vector containing performance accuracies obtained for all participant at that training set size. We statistically compared the accuracy vectors of TL at training set sizes of 10, 20…,90 with accuracy vector obtained for NTL case at maximum training set size (90 trials). Initially, at 10 training trials, we performed one-sided Wilcoxon signed rank test between the accuracy vector of TL with 10 training trials and NTL accuracy vector at 90 training trials. Such statistical comparison is repeated with TL applied at bigger training set sizes until there is no statistically significant difference between the performance of TL and the performance of NTL at 90 trials. The number of trials \(N\) at which that statistical insignificance occurs is used in (14) to compute percentage of reduction.

$${\text{Reduction}}\% = \frac{1}{P}\mathop \sum \limits_{i = 1}^{P} \frac{{{\text{Calibration length}}_{\text{NTL}} \left( i \right) - {\text{Calibration}}\;{\text{length}}_{\text{TL}} \left( i \right)}}{{{\text{Calibration length}}_{\text{NTL}} \left( i \right)}} \times 100\% .$$

(14)

Equation (14) is equivalent to

$${\text{Reduction\% }} = \frac{1}{P}\mathop \sum \limits_{i = 1}^{P} \frac{{N \times {\text{Trial length}}_{N} \left( i \right)_{\text{NTL}} - m \times {\text{Trial length}}_{m} \left( i \right)_{\text{TL}} }}{{N \times {\text{Trial length}}_{N} \left( i \right)_{\text{NTL}} }} \times 100{\text{\% }}$$

(15)

where \(N\) is the maximum number of training trials (\(N\) = 90) from the current BCI user and \(m\) is the minimum number of trials at which TL performance is at least equivalent to NTL performance where \(m\) ranges from 10 to 90 trials.

To guarantee that TL will improve or at least achieve the same average performance accuracy obtained for the NTL case, we checked if the TL average performance accuracy at \(m\) training trials was similar to or outperforms the average performance accuracy of NTL case at 90 training trials. If this condition is not satisfied, we consider statistical comparisons at training set sizes \(> m\) until this condition is satisfied.