A brief explanation on the two important techniques (EMD and SVM) employed in this work is provided in the beginning of this section. Later a detailed description of the proposed methodology for the extraction of statistical features and classification of FHR signals as 'normal' and 'at risk' (Figure 2) is given. Finally, the four important stages in this work, namely, data acquisition, preprocessing, feature extraction and classification, are defined.
Empirical Mode Decomposition
EMD, proposed by Huang et al
[33], is a method to decompose nonlinear and nonstationary time series into several monotonic components termed as intrinsic mode functions (IMF) of different time scales. The most appealing nature of EMD is its dependency on the datadriven mechanism which does not require a priori known basis unlike Wavelet and Fourier transform.
The EMD method identifies all the local maxima and minima for a given input signal x(t) which are connected by spline curves to form the upper and the lower envelopes, e
_{up}
(t) and e
_{low}
(t), respectively. The mean of the two envelopes is calculated as m(t) = [e
_{up}
(t) + e
_{low}
(t)]/2 and is subtracted from the signal using q(t) = x(t)  m(t).
An IMF c
_{
i
}
(t) is obtained if q(t) satisfies the two conditions of IMF, these are, the number of extrema and number of zero crossings is either equal or differs at most by one, and the envelopes defined by the local maxima and minima are symmetric with respect to zero mean. This procedure is called as the sifting process. Then x(t) is replaced with the residual r(t) = x(t)q(t). If q(t) is not an IMF, x(t) is replaced with q(t). The above process is repeated until the residual satisfies the stopping criterion called sum of difference, S_{D} shown in equation (1), which is generally set between 0.2 and 0.3.
{S}_{D}={\displaystyle \sum _{t=0}^{T}\frac{{q}_{k1}(t){q}_{k}(t){}^{2}}{{q}_{k1}^{2}(t)}}
(1)
At the end of this process the signal x(t) would result in N IMFs and a residue signal as in equation (2).
x(t)={\displaystyle \sum _{n=1}^{N}{c}_{n}(t)+{r}_{N}(t)}
(2)
Where n represents the order of IMFs, n = 1 to N and r
_{
N
}denotes the final residue which can also be considered as an IMF, shown in equation (3).
{x}_{d}(t)={\displaystyle \sum _{i=1}^{{N}_{f}}{c}_{i}(t)}
(3)
Here, N
_{
f
}is the total number of IMFs including the residue. The signal x(t) is decomposed such that the lowerorder components represent fast oscillation modes and higherorder components represent slow oscillation modes. A detailed explanation of the method is provided in [40].
Support Vector Machine
SVM is a supervised learning tool that can be used for pattern classification. The main goal of SVM is to construct an optimal hyperplane as the decision surface in such a way that the margin of separation between the closest data points belonging to different classes is maximized. SVM is based on the principle of structural risk minimization method [46]. In a binary classification problem which is of interest in this work, each one of the set of points belongs to either one of the two classes.
Consider a training set {\{({x}_{i},{d}_{i})\}}_{i=1}^{l} where x
_{
i
}is the input pattern for the i^{th}sample and d
_{
i
}∈{1,+1} is the corresponding desired output and l is the number of observations. If the input patterns belonging to two different classes are linearly separable then there exists a hyperplane that maximizes the margin of separation. For the optimal hyperplane the Euclidian norm of the weight vector w is minimum and at the same time satisfies the constraints in equation (4).
{d}_{i}({w}^{T}{x}_{i}+b)\ge 1{\forall}_{i}
(4)
This is a constrained quadratic optimization problem that may be solved using the method of Lagrange multipliers. Pattern classification problems in real life are not linearly separable. Here, SVMs depend on two mathematical operations: nonlinear mapping of an input vector into a highdimensional feature space and construction of an optimal hyperplane for separating the features.
Nonlinear mapping is performed in accordance with the Cover's theorem [46] on the separability of patterns. Nonlinearly separable patterns in the input space when transformed to a high dimensional feature space, they can be linearly separable with high probability. Therefore for each input pattern vector x_{i} in the m_{0} dimensional input space we define a vector consisting of a set of realvalued functions{ψ
_{
i
}(x)  i = 1,2,..m
_{1}, as shown by ψ(x) = [ψ
_{1}(x), ψ
_{2}(x),.......ψ
_{m}
_{1}]^{T} that map the m_{0} dimensional input points to m_{1} dimensional new feature space. An optimal hyperplane in the feature space is found as in equation (5).
\begin{array}{c}\text{Minimizing}{\scriptscriptstyle \frac{1}{2}}{w}^{T}w+C{\displaystyle \sum _{i=1}^{n}{\zeta}_{i}}\\ \text{subjectto:}{d}_{i}({w}^{T}\psi ({x}_{i})+b)\ge 1{\zeta}_{i}\\ \text{and}{\zeta}_{i}\ge 0{\forall}_{i}\end{array}
(5)
ζ
_{
i
}are called slack variables, a set of nonnegative scalar variables; they measure the deviation of a data point from the ideal condition of pattern separability. Parameter C is a user specified positive value that controls the tradeoff between maximizing the margin and minimizing the error. ψ(x
_{
i
}) is the nonlinear mapping of input patterns from input space to feature space. The optimal discriminating function is given by equation (6).
f(x)=sign\left[{\displaystyle \sum _{i=1}^{n}{d}_{i}{\alpha}_{i}({\psi}^{T}({x}_{i})\psi (x))+b}\right]
(6)
The coefficients α
_{
i
}are derived from the maximization of dual Lagrangian as in equation (7).
\begin{array}{l}Q(\alpha )={\displaystyle \sum _{i=1}^{n}{\alpha}_{i}}\frac{1}{2}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1}^{n}{\alpha}_{i}{\alpha}_{j}{d}_{i}{d}_{j}({\psi}^{T}({x}_{i})\psi (}}{x}_{j}))\\ \text{subjectto:}{\displaystyle \sum _{i=1}^{n}{\alpha}_{i}{d}_{i}=0}\\ \text{and}0\le {\alpha}_{i}\ge C{\forall}_{i}\end{array}
(7)
The points for which α
_{
i
}> 0, are called support vectors. The term ψ^{T}(x
_{
i
})ψ(x
_{
j
}) represents the inner product of two vectors in the feature space. We may introduce the innerproduct kernel denoted byK (x
_{
i
}, x
_{
j
}), written as shown in equation (8).
K({x}_{i},{x}_{j})={\psi}^{T}({x}_{i})\psi ({x}_{j})
(8)
A kernel function is a function in the input space and hence, we may use the innerproduct kernel K (x
_{
i
}, x
_{
j
})to construct the optimal hyperplane without having to perform explicitly the nonlinear mapping [42, 46].
For classification problems dealing with medical data where the numbers of data in different classes are unbalanced, some researchers [1, 28, 47] have proposed the use of different penalty parameters in the SVM formulation as in equation (9).
Minimizing,
\begin{array}{l}\frac{1}{2}{w}^{T}w+{C}^{}{\displaystyle \sum _{i;{y}_{i}=1}^{n}{\zeta}_{i}+{C}^{+}{\displaystyle \sum _{i;{y}_{i}=+1}^{n}{\zeta}_{i}}}\\ \text{subjectto:}{d}_{i}({w}^{T}\psi ({x}_{i})+b)\ge 1{\zeta}_{i}\\ \text{and}{\zeta}_{i}\ge 0{\forall}_{i}\end{array}
(9)
C^{+} and C^{} are the penalty parameters used to penalize more heavily the undesired type of error, and the errors related to the class with the smallest population [47].
Data Acquisition
CTG is a routine noninvasive fetal monitoring tool based on ultrasound Doppler combined with an external pressure transducer to record uterine activity. In this technique a transducer placed on the mother's abdomen transmits an ultrasound beam towards the fetal heart. The FHR is derived from the Doppler shifted echoes created by contractions of the fetal heart. An autocorrelation method is used to compare successive heart signals and test for similarity. CTG signals used in this research work were recorded at Universiti Kebangsaan Malaysia Medical Center (UKMMC). Data acquisition was carried out with the approval of UKMMC's ethical committee and after obtaining informed consent from all subjects. All 15 subjects were with singleton pregnancy and gestation age ranging from 34 weeks to 40 weeks. Since, the proposed study required both normal and abnormal patterns of FHR we had two different setups. The CTG signals with normal patterns were recorded from the day care clinic using an antepartum fetal monitor (Philips FM 20) and a software (Trium CTG Light 2.0 from Trium Analysis Online GmbH) as in Figure 3a. A galvanic isolator was used between the fetal monitor and the computer for safety purpose. CTG signals with abnormal patterns were obtained from the archived data on a Huntleigh server (Sonicaid™ Centrale, Labour Management System) at UKMMC, using the Export Utility Software as shown in Figure 3b. These data were recorded with the help of a fetal monitor (Philips Series 50IP). All the signals acquired using the above mentioned commercially available softwares had a sampling frequency of 4 Hz.
Preprocessing and Feature Extraction using EMD
Preprocessing
Recorded FHR signals may possess missing beats and spiky artifacts (Figure 4a) due to the displacement of the transducer because of maternal or fetal movement and stress induced after the onset of labour. These artifacts are generally present and difficult to eliminate from the source. Missing beats in FHR can be about 20%  40% of the data, especially during the final stages of labour [28, 40]. For this reason, the quality of the FHR signal is estimated based on the number of missing beats and a poor quality signal is not subjected to further analysis [40]. In this work, missing beats are removed using a recursive algorithm and high frequency noises are suppressed using a method based on EMD explained in [40]. Once missing data segments are eliminated, the FHR signal is decomposed using EMD into several monotonic components (IMFs) as explained earlier. It is reported that the lower order IMFs which may contain noise have zero mean [36, 40, 35]. Hence, a statistical ttest is used to determine the high frequency noise components which are then separated from the signal in the EMD domain (equation (10)).
\begin{array}{l}{H}_{0}:mean({c}_{{P}_{M}S}(t))=0\\ {H}_{1}:mean({c}_{{P}_{M}S}(t))\ne 0\end{array}
(10)
Where, {c}_{{P}_{M}S}(t) is the M thorder partial sum of the IMFs.
This test is applied first on the lowest order IMF and later on the partial sum {c}_{{P}_{M}S}(t)for M = 1, 2,... as shown in equation (11), until a partial sum {c}_{{P}_{t}S}(t) is obtained, where the mean significantly deviates from zero.
{c}_{{P}_{M}S}(t)={\displaystyle \sum _{i=1}^{M}{c}_{i}(t)}
(11)
The above procedure indicates the number of IMFs that can be considered as noise, P
_{
t
}(noise order). Since this technique might result in over smoothing of the FHR signal, the noise order is estimated using equation (12) [40].
{P}_{f}=\mathrm{min}({P}_{t},3)
(12)
Then, the first P
_{
f
}components are eliminated from the set of IMFs for a given FHR signal. A denoised FHR signal (Figure 4b) can be obtained by applying the partial construction method on the remaining components.
Feature Extraction
For each of the remaining components (Figure 5) a standard deviation is estimated (equation (13)).
\begin{array}{c}SD({C}_{i})={\left[\frac{1}{N1}{\displaystyle \sum _{j=1}^{N}{[{C}_{i,j}M({C}_{i})]}^{2}}\right]}^{\frac{1}{2}}\\ \text{for}i={P}_{f}+\text{1to}N\end{array}
(13)
These standard deviations (noise order, P
_{
f
}+1 onwards) are considered as statistical features of the FHR signal and are used as inputs to the SVM classifier. A similar approach has been employed in extracting statistical features from the wavelet coefficients, obtained from discrete wavelet transform (DWT) [1, 48, 49].
Classification using SVM
SVM Classifier
The classification stage follows the feature extraction stage. The main objective of this stage is to classify the FHR signals as normal or at risk represented as '+1' and '1' respectively. The entire process involves a training set and a testing set of data instances. A training set consists of features extracted from the decomposed FHR signal, also called the attributes, and the labels '+1' (normal) or '1' (at risk) which are the target outputs. SVM should produce a model to classify the data instances in the testing set which consists of only features. The classification depends on the innerproduct kernel used that produces different learning machines and hence different decision boundaries [46]. In this work, based on the statistical features extracted from FHR signals the radial basis function (RBF) kernel was used (equation (14)) to obtain reliable results. Parameters C and γ are specified by the user.
k({x}_{i},{x}_{j})=\gamma \mathrm{exp}{\left(\left{x}_{i}{x}_{j}\right\right)}^{2}
(14)
The following procedure was employed during the classification stage:

(a)
Scaling the training and testing data sets.

(b)
General grid search method is considered an intractable problem and estimating accuracy for all possible combinations of C and γ is a time consuming process. Therefore, exponentially increasing values were considered initially to find a better possible set of values for C and γ that yielded better accuracy. Finally, C and γ values thus obtained are varied slightly to gain the best possible accuracy. The estimated set of values for C and γ are 4 and 2, respectively.

(c)
Training SVM using the chosen C and γ values to achieve the best crossvalidation accuracy (CSV) possible.

(d)
Predicting the output of the testing set.

(e)
Estimating the accuracy of classification.
Since unbalanced data are used in this work the ratio of C^{+}/C^{} are set to the inverse of the corresponding cardinalities of the classes.
Description of the Data Set
A total of 129 FHR signals of 20 minutes duration were collected from pregnant women at gestation ages of 34 to 40 weeks. Missing beats were eliminated from all signals using the recursive algorithm. CTG traces were submitted to two experienced obstetricians (20 years in practice) for visual inspection who were asked to classify the traces as 'normal' (+1) or 'at risk' (1). From the 129 records 29 showed disagreement between experts, therefore were eliminated. Out of the remaining 100 signals, 90 (30 normal and 60 at risk) signals were considered for the purpose of training and testing so that the normal and at risk signals ratio can be maintained at 1:2 during training and also in estimating the crossvalidation accuracy. Hence, the proposed methodology was tested on 90 FHR recordings having mutually agreed interpretation from two experienced obstetricians.
From the 90 FHR signals in the data set, Dataset, a training set and a testing set consisting of 60 and 30 data instances, respectively, were created. The training set had 20 'normal' and 40 'at risk' data instances and the testing set was composed of 10 'normal' and 20 'at risk' samples. The training set was further divided into five subsets, as explained later in the results section, in order to obtain more reliable results. Data instances in the Dataset consisted of a maximum of 10 statistical features (standard deviation) extracted from the remaining decomposed components (IMFs) of the FHR signal, after the separation of noisy components (preprocessing method). Since the number of remaining decomposed components of 90 FHR signals in the set varied from 7 to 10, a maximum of 10 input features or parameters were considered. Features in the Dataset extracted from 90 FHR signals were used as inputs to the SVM classifier.