2.1 Data collection
In this paper, we reexamine data from a subset of participants originally reported in [18]. Briefly, we recruited 30 patients (aged 65.47 ± 13.4 years, 15 male) with suspicion of neurogenic dysphagia who were referred to routine videofluoroscopic examination at one of two local hospitals. Patients had dysphagia secondary to stroke, acquired brain injury, neurodegenerative disease, and spinal cord injury. Research ethics approval was obtained from both participating hospitals.
The data collection setup is shown in Figure 1.
Sagittal plane videofluoroscopic images of the cervical region were recorded to computer at a nominal 30 frames per second via an analog image acquisition card (PCI1405, National Instruments). Each frame was marked with a timestamp via a software frame counter. A dualaxis accelerometer (ADXL322, Analog Devices) was taped to the participants neck at the level of the cricoid cartilage. The axes of the accelerometer were aligned to the anatomical anteriorposterior (AP) and superiorinferior (SI) axes. Signals from both the AP and SI axes were passed through separate preamplifiers each with an internal bandpass filter (Model P55, Grass Technologies). The cutoff frequencies of the bandpass filter were set at 0.1 Hz and 3 kHz. The amplifier gain was 10. The signals were then sampled at 10 kHz using a data acquisition card (USB NI6210, National Instruments) and stored on a computer for subsequent analyses. A trigger was sent from a custom LabView virtual instrument to the image acquisition card to synchronize videofluoroscopic and accelerometric recordings. The above instrumentation settings replicate those of previous dualaxis swallowing accelerometry studies [7, 9, 13–15, 19, 20].
Each participant swallowed a minimum of two or a maximum of three 5 mL teaspoons of thin liquid barium (40%w/v suspension) while his/her head was in a neutral position. The number of sips that the participant performed was determined by the attending clinician. The recording of dualaxis accelerometry terminated after the participant finished his/her swallows. However, the participant's speechlanguage pathologist continued the videofluoroscopy protocol as per usual. In total, we obtained 224 individual swallowing samples from the 30 participants, 164 of which were labeled as unsafe swallows (as defined below) and 60 as safe swallows.
2.2 Data segmentation
To segment the data for analysis, a speechlanguage pathologist reviewed the videofluoroscopy recordings. The beginning of a swallow was defined as the frame when the liquid bolus passed the point where the shadow of the mandible intersects the tongue base. The end of the swallow was identified as the frame when the hyoid bone returned to its rest position following bolus movement through the upper esophegeal sphincter. The beginning and end frames as defined above where marked within the video recording using a custom C++ program. The cropped video file was then exported together with the associated segments of dualaxes acceleration data. An unsafe swallow was defined as any swallow without airway clearance. Typically, this would include penetration and aspiration. Residue would be considered a situation of swallowing inefficiency that is not unsafe swallowing unless the residue was subsequently aspirated. Backflow is extremely rare in the oropharynx, and would only be classified as unsafe should it lead to penetrationaspiration. This definition of unsafe swallowing is in keeping with the industry standard PenetrationAspiration Scale [21].
2.3 PreProcessing
It has been shown in [12] that the majority of signal power in swallowing vibrations of healthy adults lies below 100 Hz. However, given that we were dealing with patient data, we estimated the bandwidth of each of the 224 swallows as the spectral range from 0 Hz up to the frequency at which 95% of the signal energy was captured. We obtained average bandwidths of 175 ± 73 Hz and 226 ± 84 Hz for the AP and SI axes, respectively. Moreover, spectral centroids were < 70 Hz in both axes, suggesting that there is no appreciable signal energy beyond a few hundred Hz. Therefore, we downsampled all signals to 1 kHz. Vocalization was removed from each segmented swallow according to the normalized crosscorrelation periodicity detector proposed in [7]. Whitening of the accelerometry signals to account for instrumentation nonlinearities was achieved using inverse filtering and autoregressive modeling [15]. Finally, the signals were denoised using a Daubechies8 wavelet (8db) transform with soft thresholding. As detailed in [20], both the decomposition level and the wavelet coefficients were chosen to minimize the reconstruction error within a reduced wavelet subspace. Figures 2 and 3 exemplify preprocessed safe and unsafe swallowing signals, respectively.
2.4 Feature Extraction
Let S be a preprocessed acceleration time series, S = {s
_{2}, s
_{2}, ..., s
_{
n
}}. As in previous accelerometry studies, signal features from multiple domains were considered [13, 16]. The different genres of features are summarized below.

1.
Time Domain Features

The sample mean is an unbiased estimation of the location of a signal's amplitude distribution and is given by,
{\mu}_{s}=\frac{1}{n}\sum _{i=1}^{n}{S}_{i}.
(1)

The variance of a distribution measures its spread around the mean and reflects the signal's power. The unbiased estimation of variance can be obtained as
{\sigma}_{s}^{2}=\frac{1}{n1}\sum _{i=1}^{n}{\left({S}_{i}{\mu}_{s}\right)}^{2}.
(2)

The median is a robust location estimate of the amplitude distribution. For the sorted set S, the median can be calculated as
MED\left(s\right)=\left\{\begin{array}{cc}\hfill {S}_{v}+1,\hfill & \hfill \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}n=2v+1;\hfill \\ \hfill \frac{{s}_{v}+{s}_{v+1}}{2},\hfill & \hfill \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}n=2v.\hfill \end{array}\right.
(3)

Skewness is a measure of the symmetry of a distribution. This feature can be computed as follows.
{\gamma}_{1,s}=\frac{\frac{1}{n}{\sum}_{i=1}^{n}{\left({S}_{i}{\mu}_{s}\right)}^{3}}{{\left(\frac{1}{n}{\sum}_{i=1}^{n}{\left({S}_{i}{\mu}_{s}\right)}^{2}\right)}^{1.5}}.
(4)

Kurtosis reflects the peakedness of a distribution. A high kurtosis value indicates a distribution with a sharp, narrow peak and heavy tails while a low kurtosis value signifies a distribution with a flattened peak and thin tails. This feature was computed as:
{\gamma}_{2,s}=\frac{\frac{1}{n}{\sum}_{i=1}^{n}{\left({S}_{i}{\mu}_{s}\right)}^{4}}{{\left(\frac{1}{n}{\sum}_{i=1}^{n}{\left({S}_{i}{\mu}_{s}\right)}^{2}\right)}^{2}}.
(5)

2.
Frequency Domain Features

The peak magnitude value of the Fast Fourier Transform (FFT) of the signal S was also used as a feature. All the FFT coefficients were normalized by the length of the signal, n.

The centroid frequency of the signal S[15] was estimated as
\widehat{f}=\frac{{\int}_{0}^{{f}_{max}}f{F}_{s}\left(f\right){}^{2}df}{{\int}_{0}^{{f}_{max}}{F}_{s}\left(f\right){}^{2}df},
(6)
where F
_{
s
}(f) is the Fourier transform of the signal S and f
_{
max
}is the Nyquist frequency (effectively 500 Hz after downsampling).

3.
Information TheoryBased Features

The entropy rate [22] of a signal quantifies the extent of regularity in that signal. The measure is useful for signals with some relationship among consecutive signal points. We first normalized the signal S to zeromean and unit variance. Then, we quantized the normalized signal into 10 equally spaced levels, represented by the integers 0 to 9, ranging from the minimum to maximum value. Now, the sequence of U consecutive points in the quantized signal, \u015c=\left\{{\u015d}_{1},{\u015d}_{2},...,{\u015d}_{3}\right\}, was coded using the following equation
{a}_{i}={\u015d}_{i+U1}\cdot 1{0}^{U1}+...+{\u015d}_{i}\cdot 1{0}^{0},
(8)
with i = 1, 2, ..., n  U + 1. The coded integers comprised the coding set A
_{
U
}= {a
_{1}, ..., a
_{
nU+1}}. Using the Shannon entropy formula, we estimated entropy
E\left(U\right)=\sum _{t=0}^{{10}^{U}1}{P}_{{A}_{U}}\left(t\right)\cdot ln{P}_{{A}_{U}}\left(t\right),
(9)
where {p}_{{A}_{U}}\left(t\right) represents the probability of observing the value t in A
_{
U
}, approximated by the corresponding sample frequency. Then, the entropy rate was normalized using the following equation
\hat{E\left(U\right)}=\frac{E\left(U\right)E\left(U1\right)+E\left(1\right)\cdot \beta}{E\left(1\right)},
(10)
where \hat{E\left(U\right)} denotes the normalized entropy, and β was the percentage of the coded integers in A
_{
L
}that occurred only once. Finally, the regularity index ρ ∈ [0, 1] was obtained as
\rho =1min\hat{E\left(U\right)},
(11)
where a value of ρ close to 0 signifies maximum randomness while ρ close to 1 indicates maximum regularity.

To calculate the memory of the signal [13], its autocorrelation function was computed from zero to the maximum time lag (equal to the length of the signal) and normalized such that the autocorrelation at zero lag was unity. The memory was estimated as the time required for the the autocorrelation to decay to 1/e of its zero lag value.

LempelZiv (LZ) complexity [23] measures the predictability of a signal. To compute the LZ complexity for signal S, first, the minimum and the maximum values of signal points were calculated and then, the signal was quantized into 100 equally spaced levels between its minimum and maximum values. Then, the quantized signal, {B}_{1}^{n}=\left\{{b}_{1},{b}_{2},...,{b}_{n}\right\}, was decomposed into T different blocks, {B}_{1}^{n}=\left\{{\psi}_{1},{\psi}_{2},...,{\psi}_{T}\right\}. A block ψ was defined as
\Psi ={B}_{j}^{\ell}=\left\{{b}_{j},{b}_{j+1},...,{b}_{\ell}\right\},1\le j\le \ell \le n.
(12)
The values of the blocks can be calculated as follows:
\Psi =\left\{\begin{array}{cc}\hfill {\psi}_{m}={b}_{1},\hfill & \hfill \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{m}}=1,\hfill \\ \hfill {\psi}_{m+1}={B}_{{h}_{m}+1}^{{h}_{m+1}},\hfill & \hfill m\ge 1,\hfill \end{array}\right.
(13)
where h
_{
m
}is the ending index for ψ
_{
m
}, such that ψ
_{
m+1}is a unique sequence of minimal length within the sequence {B}_{1}^{{h}_{m+1}1}. Finally, the normalized LZ complexity was calculated as
LZ=\frac{T{log}_{100}n}{n}.
(14)
2.5 ReputationBased Classification
Reputation typically refers to the quality or integrity of an individual component within a system of interacting components. The notion of reputation has been widely used to ascertain the health of nodes in wireless networks [24], identify malicious hosts in a distributed system [25] and detect freeriders in peertopeer networks [26], among many other practical applications. Here, we apply the concept of reputation to judiciously combine decisions of multiple classifiers for the purpose of differentiating between safe and unsafe swallows. The general idea is to differentially weigh classifier decisions on the basis of their past performance.
The past performance of the i^{th}classifier is captured via its reputation, {r}_{i}\in \Re ,0\le {r}_{i}\le 1, where 1 signifies a strong classifier (high accuracy) and 0 denotes a weak classifier. Briefly, the classifier is formulated as follows. Let Θ = {θ
_{1}, θ
_{2},..., θ
_{
L
}} be a set of L ≥ 2 classifiers and Ω = {ω
_{1}, ω
_{2}, ..., ω
_{
c
}} be a set of c ≥ 2 class labels, where ω
_{
j
}≠ ω
_{
k
}, ∀j ≠ k. Without loss of generality, Ω ⊂ ℕ. The input of each classifier is the feature vector x\in {R}^{{n}_{i}}, where n
_{
i
}is the dimension of the feature space for the i^{th}classifier θ
_{
i
}, whose output is a class label ω
_{
j
}, j = 1, ..., c. Let p(ω
_{
j
}) be the prior probability of class ω
_{
j
}.

1.
For a classification problem with c ≥ 2 classes, we invoke L ≥ 2 individual classifiers.

2.
After training the L classifiers individually, the respective accuracy of each is evaluated using a validation set and expressed as a real number in [0, 1]. This number is the reputation of the classifier.

3.
For each feature vector, x, in the test set, L decisions are obtained using the L distinct classifiers:
\Omega \left(x\right)=\left\{{\theta}_{1}\left(x\right),{\theta}_{2}\left(x\right),...,{\theta}_{L}\left(x\right)\right\}.
(15)

4.
We sort the reputation values of the classifiers in descending order,
{R}^{*}=\left\{{r}_{{1}^{*}},{r}_{{2}^{*}},...,{r}_{{L}^{*}}\right\},
(16)
such that r
_{1*} ≥ r
_{2*} ≥ ··· ≥ r
_{
L*}. Then, using this set, we rank the classifiers to obtain a reputationordered set of classifiers, Θ*.
{\Theta}^{*}=\left(\begin{array}{c}\hfill {\theta}_{{1}^{*}}\hfill \\ \hfill {\theta}_{{2}^{*}}\hfill \\ \hfill \vdots \hfill \\ \hfill {\theta}_{{L}^{*}}\hfill \end{array}\right).
(17)
The first element of this set corresponds to the classifier with the highest reputation.

5.
Next, we examine the votes of the first m elements of the reputationordered set of classifiers, with
m=\left\{\begin{array}{cc}\hfill \phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\frac{L}{2},\hfill & \hfill \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}L\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{is}}\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{even}},\hfill \\ \hfill \frac{L+1}{2},\hfill & \hfill \mathsf{\text{if}}\phantom{\rule{2.77695pt}{0ex}}L\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{is}}\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{odd}}.\hfill \end{array}\right.
(18)
If the top m classifiers vote for the same class, ω
_{
j
}, we accept the majority vote and take ω
_{
j
}as the final decision of the system. However, if the votes of the first m classifiers are not equal, we consider the classifiers' individual reputations (Step 2) in arriving at the final decision, as detailed in step 6.

6.
The probability that the combined classifier decision is ω _{
j
}given the input vector x and the individual local classifier decisions is denoted as the posterior probability,
p\left({w}_{j}{\theta}_{1}\left(x\right),{\theta}_{2}\left(x\right),...,{\theta}_{L}\left(x\right)\right)
(19)
which can be estimated using Bayes rule as
p\left({w}_{j}{\theta}_{1},...,{\theta}_{L}\right)=\frac{{\prod}_{i=1}^{L}p\left({\theta}_{i}{w}_{j}\right)p\left({w}_{j}\right)}{{\sum}_{t=1}^{c}{\prod}_{i=1}^{L}p\left({\theta}_{i}{w}_{t}\right)p\left({w}_{t}\right)}.
(20)
when the classifiers are independent. For notational convenience, we have dropped the argument for θ above, but it is understood to be a function of x. The local likelihood functions, p(θ
_{
i
}ω
_{
j
}), are estimated by the reputation values calculated in Step 2. When the correct class is ω
_{
j
}and classifier θ
_{
i
}classifies x into the class ω
_{
j
}, i.e., θ
_{
i
}(x) = ω
_{
j
}, we can write
p\left({\theta}_{i}={w}_{j}{w}_{j}\right)={r}_{i}.
(21)
In other words, p(θ
_{
i
}= ω
_{
j
}ω
_{
j
}) is the probability that the classifier θ
_{
i
}correctly classifies x into class ω
_{
j
}when x actually belongs to this class. This probability is exactly equal to the reputation of the classifier. On the other hand, when the classifier categorizes x incorrectly, i.e., θ
_{
i
}(x) ≠ ω
_{
j
}given that the correct class is ω
_{
j
}, then
p\left({\theta}_{i}\ne {w}_{j}{w}_{j}\right)=1{r}_{i}.
(22)
When there is no known priority among classes, we can assume equal prior probabilities. Hence,
p\left({w}_{1}\right)=p\left({w}_{2}\right)=...=p\left({w}_{c}\right)=\frac{1}{c}.
(23)
Thus, for each class, ω
_{
j
}, we can estimate the a posteriori probabilities as given by (20) using (21), (22), and (23). The class with the highest posterior probability is selected as the final decision of the system and the input subject x is categorized as belonging to this class.
2.6 Classifier evaluation
We ranked the signal features introduced above using the Fisher ratio [27] for univariate separability. In the time domain, mean and variance in the AP axis and skewness in the SI axis were the topranked features. Similarly, in the frequency domain, the peak magnitude of the FFT and the spectral centroid in the AP direction and the bandwidth in the SI direction were retained. Finally, in the information theoretic domain, entropy rate for the SI signal and memory of the AP signal were the highest ranking features. Subsequently, we only examined these feature subsets for classification, i.e., in total 8 different features were selected. For comparison between single and dualaxes classifiers, we also considered classifiers that employed feature subsets (as identified above) from a single axis.
Swallows from all 30 participants were pooled together. Given the disproportion of safe and unsafe samples, we invoked a smooth bootstrapping procedure [28] to balance the classes. All features were then standardized to zero mean and unit variance. Three separate support vector machine (SVM) classifiers [29] were invoked, one for each feature genre (time, frequency and information theoretic). Hence, the feature space dimensionalities for the classifiers were 3 (SVM with time features), 3 (SVM with frequency features) and 2 (SVM with informationtheoretic features).
The use of different feature sets for each classifier increases the likelihood that the classifiers will perform independently [30].
Classifier accuracy was estimated via a 10fold cross validation with a 90%10% split. In each fold, performance on the training set was used to estimate the individual classifier reputations. Classifiers were then ranked according to their reputation values. Without loss of generality, assume r
_{1} ≥ r
_{2} ≥ r
_{3}. If θ
_{1} and θ
_{2} cast the same vote about a test swallow, their common decision was accepted as the final classification. However, if they voted differently, the a posteriori probability of each class was computed using (20) and the maximum a posteriori probability rule was applied to select the final classification.