Classification of caesarean section and normal vaginal deliveries using foetal heart rate signals and advanced machine learning algorithms

Background Visual inspection of cardiotocography traces by obstetricians and midwives is the gold standard for monitoring the wellbeing of the foetus during antenatal care. However, inter- and intra-observer variability is high with only a 30% positive predictive value for the classification of pathological outcomes. This has a significant negative impact on the perinatal foetus and often results in cardio-pulmonary arrest, brain and vital organ damage, cerebral palsy, hearing, visual and cognitive defects and in severe cases, death. This paper shows that using machine learning and foetal heart rate signals provides direct information about the foetal state and helps to filter the subjective opinions of medical practitioners when used as a decision support tool. The primary aim is to provide a proof-of-concept that demonstrates how machine learning can be used to objectively determine when medical intervention, such as caesarean section, is required and help avoid preventable perinatal deaths. Methods This is evidenced using an open dataset that comprises 506 controls (normal virginal deliveries) and 46 cases (caesarean due to pH ≤ 7.20—acidosis, n = 18; pH > 7.20 and pH < 7.25—foetal deterioration, n = 4; or clinical decision without evidence of pathological outcome measures, n = 24). Several machine-learning algorithms are trained, and validated, using binary classifier performance measures. Results The findings show that deep learning classification achieves sensitivity = 94%, specificity = 91%, Area under the curve = 99%, F-score = 100%, and mean square error = 1%. Conclusions The results demonstrate that machine learning significantly improves the efficiency for the detection of caesarean section and normal vaginal deliveries using foetal heart rate signals compared with obstetrician and midwife predictions and systems reported in previous studies.


Background
CTG was initially developed as a screening tool to predict foetal hypoxia [7]. However, there is no evidence to suggest that there has been any improvement in perinatal deaths since the introduction of CTG into clinical practice 45 years ago. It is generally agreed that 50% of birth-related brain injuries are preventable, with incorrect CTG interpretation leading the list of causes [8]. Equally, over interpretation of CTG is common and the direct cause of unnecessary caesarean sections, which costs the NHS £1700 for each caesarean performed compared with £750 for a normal vaginal delivery. It is therefore generally agreed that predicting adverse pathological outcomes and diagnosing pathological outcomes earlier clearly have important consequences, for both health and the economy. One interesting approach is machine learning.
Warrick et al. [9] developed a system for the classification of normal and hypoxic foetuses by modelling the FHR and uterine contraction (UC) signal pairs as an input-output system to estimate their dynamic relation in terms of an impulse response function [10]. The authors report that their system can detect almost half of the pathological cases 1 h and 40 min prior to delivery with a 7.5% false positive rate. Kessler et al. [11] on the other hand, using 6010 high risk deliveries, combined CTG with ST waveform to apply timely intervention for caesarean or vaginal delivery, which they report, reduced foetal morbidity and mortality [12].
In comparative studies, Huang et al. [13] compared three different classifiers; a decision tree (DT), an artificial neural network (ANN), and discriminant analysis (DA). Using the ANN classifier, it was possible to obtain a 97.78% overall accuracy. This was followed by the DT and DA with 86. 36 and 82.1% accuracy respectively. The sensitivity and specificity values were not provided making accuracy alone an insufficient performance measure for binary classifiers. This is particularly true in evaluations where datasets are skewed in favour of one class with significant differences between prior probabilities.
In a similar study, Ocak et al. [14] evaluated an SVM and genetic algorithm (GA) classifier and reported 99.3 and 100% accuracies for normal and pathological cases respectively. Similar results were reported in [15,16]. Again, Sensitivity and Specificity values where not provided in these studies. Meanwhile Menai et al. [17] carried out a study to classify foetal state using a naive bayes (NB) classifier with four different feature selection (FS) techniques: Mutual information, correlation-based, ReliefF, and information gain. The study found that the NB classifier in conjunction with features produced using the ReliefF technique produce the best results when classifying foetal state with 93.97, 91.58, and 95.79% for accuracy, sensitivity and specificity, respectively. While the results are high, the dataset is multivariate and highly imbalanced. Alternative model evaluation metrics for multi-class data, such as micro-and macro-averaging, and micro and macro-F-measure, would provide a more informed account of model performance. Furthermore, an appropriate account of how the class skew problem was addressed is missing.
The adaptive boosting (AdaBoost) classifier was adopted in a study by Karabulut et al. [18] who report an accuracy of 95.01%-again no sensivity or specificity values were provided. While Spilka et al. who are the current forerunners of pioneering work in machine learning and CTG classification [6], used a random forest (RF) classifier in conjunction with latent class analysis (LCA) [19] and reported sensitivity and specificity values of 72 and 78% respectively using the CTG-UHB dataset [3]. Producing slightly better results in [20] using the same dataset, Spilka et al. attempted to detect hypoxia using a C4.5 decision tree, naive bayes, and SVM. The SVM produced the best results using a tenfold cross validation method achieving 73.4% for sensitivity and 76.3% of specificity.

Methods
This section describes the dataset adopted in this study and discusses the steps taken to pre-process the data and extract the features from raw FHR signals. The section is then concluded with a discussion on the feature selection technique and dimensionality reduction.

CTG data collection
Chudacek et al. [3] conducted a comprehensive study that captured intrapartum recordings between April 2010 and August 2012. The recordings were collected from the University Hospital in Brno (UHB), in the Czech Republic by obstetricians with the support of the Czech Technical University (CTU) in Prague. These records are publically available from the CTU-UHB database, in Physionet [3].
The CTU-UHB database contains 552 CTG recordings for singleton pregnancies with a gestational age less than 36 weeks that were selected from 9164 recordings. The STAN S21/S31 and Avalon FM 40/50 foetal monitors were used to acquire the CTG records. The dataset contains no prior known development factors (i.e. they are ordinary clean obstetrics cases); the duration of stage two labour is less than or equal to 30 min; foetal heart rate signal quality is greater than 50% in each 30 min' window; and the pH umbilical arterial blood sample is available. In the dataset, 46 caesarean section deliveries are included and the rest are ordinary clean vaginal deliveries. Figure 1 shows a scatter plot of the dataset with eclipses defining the separation between both case and control groups. Note that in this study 46 cases are classified as a caesarean delivery due to pH ≤7.20-acidosis, n = 18; pH >7.20 and pH ≤7.25-foetal deterioration, n = 4; or clinical decision without evidence of pathological outcome measures, n = 24 as defined in [14]. Table 1 provides details of the outcome measures used in the CTU-UHB database.
The ID column in Table 1 maps to the file number in the CTU-UHB database, while the second column provides details about the mothers age. The pH column describes the umbilical artery pH value for each case and BDef provides information on the base deficit in extracellular fluid. The PCO 2 describes the partial pressure of carbon dioxide. BE gives values for the base excess and finally Apgar scores are a subjective evaluation  of the outcome of the delivery. For a more in-depth discussion of the dataset and these parameters please refer to [3]. Each recording begins no more than 90 min before delivery. Each CTG record contains the FHR time series (measured in beats per minute) and uterine contraction (UC) signal-each sampled at 4 Hz. The FHR was obtained from an ultrasound transducer attached to the abdominal wall. In this study only the FHR signal is only considered in this study since it provides direct information about the foetal state.

Pre-processing
Each of the 552 FHR signal recordings were filtered using a 6th order low-pass butterworth filter with fc = 4 Hz and a cut-off frequency of 0.034 Hz. To correct the phase distortion introduced by a one-pass filter, a two-pass filter (forwards and reverse) was used to filter each of the signals. Noise, and missing values were removed using cubic Hermite spline interpolation [21].

FHR features
This section describes the statistical, higher-order statistical and higher-order spectral features extracted from the FHR signals.

Morphological features
The initial set of features considered are those defined by the international federation of gynecology and obstetrics 3 (FIGO) and the National Institute for Health and Care Excellence 4 (NICE). Consider a raw FHR time series signal X with length N, where X = {x n , n = 1, 2…, N}, in which the virtual baseline mean (VBM), x is defined as: Such that x can be used to remove accelerations and decelerations (if x n > 10 + x then: x n =x + 10; if x n < −10 + x then: x n =x + −10 from the FHR signal so that the real baseline FHR (RBL) can be derived [22]: where H and L are the upper and lower limits of the time series signal respectively, X is the signal and N is the length of the signal.
Using the RBL, FIGO accelerations and decelerations can be extracted. These are features commonly used by obstetricians to monitor the interplay between the sympathetic and parasympathetic systems. Accelerations and decelerations within X represent the transient increases and decreases (±15 bpm) that last for 15 s or more [23]. In the case of accelerations, this typically indicates adequate blood delivery and is reassuring for the obstetrician. Calculating accelerations in the signal is defined by: where X is the signal, x i is the ith element of X, RBL is the real baseline defined in (2), and D is the duration of time in which x i remains above RBL +15.
In contrast decelerations represent temporary decreases (−15 bpm) in FHR below the RBL that last for 15 s or more, which can indicate the presence of possible pathological outcomes such as, umbilical cord compression, hypoxia or acidosis [18]. The decelerations in the signal are calculated as: where x i is the ith element of signal X, RBL is the real baseline, and D is the time duration in which x i remains below RBL-15.
Short and long-term variability (STV and LTV respectively) are further indicators used by obstetricians. The presence of both suggests an intact neuromodulation of the FHR and normal cardiac function and is one of the most reassuring measures in neonatal care [24]. When STV or LTV decreases or is absent, it can be a significant indicator for the presence of hypoxia or acidosis. Therefore, they are both considered to be important predictors. STV is calculated according to the following equation: where M is the number of minutes contained in the X signal and R t is defined as: where H is the number of subintervals in 60 s (in this case H = 60/K), K is the sample frequency (4 Hz) multiplied by 2.5 s and S j is the average value of 2.5 s for a subinterval j = {1, 2,…,H}.
In contrast, LTV is defined as the difference between the minimum and maximum value in a 60-s block and is averaged to the duration of the signal if it is more than 1 min long. LTV is defined as: where N is the length of the X signal, b is 240 samples (60-s blocks for a 4 Hz sample frequency).
Collectively, RBL, accelerations, decelerations, STV and LTV define the five main FIGO/NICE features used by obstetricians and midwives and are subsequently consider as predictors for separating caesarean section and normal vaginal deliveries in this study.

Time series features
FIGO feature sets are often extended in automated CTG analysis to include patterns in the signal that are not easily identifiable through visual inspection. Two useful timeseries features that have been heavily utilized in medical signal processing are root mean squares (RMS) and sample entropy (SampEn). RMS is a useful feature for estimating short term variability between accelerations and deceleration [25] and is commonly described for a signal X with length N as: This feature is particularly good at estimating sympathetic/parasympathetic dominance where the later, in a similar way to decelerations, can indicate the presence of possible pathological incidences, such as hypoxia and acidosis.
Whereas, sample entropy, quantifies the nonlinear dynamics of the FHR and the loss of complexity in the FHR signal. Previous studies have reported that it is a worthwhile feature for determining if the foetus is deprived of oxygen [26]. Sample entropy is the negative natural logarithm of the conditional probability that a dataset of length N, having repeated itself for m samples within a tolerance of r, will also repeat itself for m + 1 samples. Based on the calculation in [27] the time series X that contains N points, x i , x 2 . . . , x N subsequences can be defined by length m, and given by: This allows the following quantity to be defined: , where j ranges from 1 to N−m, and j � = i, and set: The parameter SampEn(m, r) is then defined as: Which can be estimated by the statistic: where N is the length of the X signal, m is the length of sequences to be compared, and r is the tolerance for accepted matches.

Frequency domain features
To overcome signal quality variations in the FHR signal, due to electrode placement and the physical characteristics of subjects [28], frequency domain features have been studied using power spectral density (PSD) computed using fast fourier transform (FFT): where X(f) contains the information for the signal and x(t) is obtained from X(f) using the inverse of the fourier transformation: The most notable feature is the peak frequency (FPeak) within the PSD, which has been used extensively in heart rate variability studies [29]. It is regarded as a useful measure of variability and normal sympathetic and parasympathetic function. It describes the dominant frequency in the PSD that has the maximum spectral power. In this study, peak frequency is derived using Welch's method [30]: where s i (i) is the power of the spectrum at bin i. As shown later in the paper, this feature has good discriminative capacity as a confounding coefficient.

Non-linear features
Poincare plots are a geometrical representation of a time series that is also used extensively to measure heart rate variability [20]. This paper shows that it has excellent discriminatory capacity in CTG analysis. Unlike HRV where it is commonly used, in FHR the difference between two beats is given as NN rather than the RR interval. A line of identity is used as a 45° imaginary diagonal line on the plot and the points falling on the line have the property NN n = NN n+1 [31]. Three coefficients of the Poincare plot, SD1 (the standard deviation of points perpendicular to the axis of line-of-identity), SD2 (the standard deviation of points along the axis of line-of-identity) and SDRatio are used as features to describe the cloud of points in the plot. Fundamentally, SD1 and SD2 are directly related to the standard deviation of NN interval (SDNN) and the standard deviation of the successive difference of the NN interval (SDSD) that is given by: where Y NN (0) and Y NN (1) describe the autocorrelation function for lag-0 and lag-1 of the NN interval, respectively. The mean of NN intervals is NN . Equation 16 shows that SD1 and SD2 measures are derived from the correlation and mean of the NN intervals time series with lag-0 and lag-1.
The SD1 feature is an index of instantaneous recording of the beat-to-beat short-term variability (the parasympathetic action) and SD2 describes the long-term variability (the sympathetic action). SD1 and SD2 are combined to form the ratio of SD1/SD2 that shows the relation between short and long-term variations of NN intervals: It is also possible to detect the existence of chaos in the FHR signal since the foetal heartbeat fluctuates on different time scales and has the property of being self-similar. In this study, the box-counting dimension is used to estimate the dynamics of the FHR [32]. It is a quantitative measure of the morphological properties of a signal and its capacity that is determined by covering the signal with N boxes of side length r. The minimal number of optimally sized boxes required to cover the complete signal describes the box-counting dimension coefficient such that: where D is the box counting fractal dimension of the object, r is the side length of the box, and N(r) is the smallest number of boxes of side r to cover the time series signal.
Long-term time-correlations or self-affinity measures of the FHR signal have also proven in previous studies to be useful for separating normal and pathological cases [33]. In this study, detrend fluctuation analysis (DFA) is performed where the returned exponent value indicates the presence or absence of fractal properties, i.e. self-similarity. The DFA probes the signal at different time scales and provides a fractal scaling exponent x. First the times series is integrated as follows: where y(k) is the cumulative sum of the ith sample and X avg is the mean value of the entire signal. Windows are derived from y(k) of equal length n and linear approximations y n are found using least squares fit (this represents a trend in a given window). The average fluctuation F(n) of the signal around the trend is given by: The calculations are repeated for all values of n. In this instance the primary focus is the relation between F(n) and the size of the window n. In general F(n) increases with the size of window n.

Feature selection
Feature selection is performed using the recursive feature eliminator algorithm (RFE) [34]. In this study a feature set was derived from the raw FHR signals based on the feature definitions described and a model fit generated using the RFE algorithm (refer to Algorithm 1) [34].
Each feature within this set is ranked using its importance to the model where S is a sequence of ordered numbers, which are candidate values for the number of features to retain (S 1 > S 2,... ). This process is repeated and the S i top ranked features are retained. The model is refit and the performance is reassessed. The top S i features are used to fit the final model.

Synthetic minority oversampling technique
In a two class balanced dataset the prior probabilities will be equal for each. This is not the case for the CTU-UHB dataset given there are 506 controls (majority class) and 46 cases (minority class). Classifiers are more sensitive to detecting the majority class and less sensitive to the minority class and this leads to biased classification [35]. Therefore, given a random sample taken from the dataset, the probability of a classifier classifying a foetus observation as a control will be much higher (91.6%-506/552) than the probability of it classifying a foetus observation as a case (8.3%-46/552). This imposes a higher cost for misclassifying the minority (predicting that a foetus is normal and the outcome being pathological) than the majority class, (predicting a foetus is pathological and the outcome being normal).
In order to address this problem, it is necessary to resample the dataset [36]. Various resampling techniques are available, and these include under sampling and over sampling. Under sampling reduces the number of records from the majority class to make it equal to the minority class-in this instance it would mean removing 460 records leaving us with a very small dataset. In contrast, data in the minority class can be increased using oversampling. In this study, the synthetic minority over-sampling technique (SMOTE) as defined in Algorithm 2 is used rather than reducing the dataset further [37]. Several studies have shown that the SMOTE technique effectively solves the class skew problem [35,[38][39][40][41][42]. Using SMOTE the minority class (cases) is oversampled using each minority class record, in order to generate new synthetic records along line segments joining the k minority class nearest neighbours. This forces the decision region of the minority class to become more general and ensures that the classifier creates larger and less specific decision regions, rather than smaller specific ones. In [37] the authors indicated that this approach is an accepted technique for solving problems related to unbalanced datasets.

Deep learning classifier
Deep learning neural network architectures have recently proven to be very powerful classifiers [43]. To the best of our knowledge, this algorithm has not been used in CTG studies, and this paper is thus the first to consider its use in automated CTG analysis. A multi-layer feedforward neural network architecture is used based on theoretical proofs in [44]. The supervised training phase is based on uniform adaptive optimized initialization that is determined by the size of the network. A Tansigmoid nonlinear activation function f is utilized and defined as: where x i and w i represent the firing neuron's input values and their weights, respectively; and α denotes the weighted combination.
The multinomial distribution is adopted with the cross-entropy loss function, which is typically used for classification in deep learning. For each training example j, the objective is to minimize a loss function: where W is the collection {W i } 1:N −1 , W i denotes the weight matrix connecting layers i and i + 1 for a network of N layers. Similarly B is the collection {b i } 1:N −1 , where b i denotes the column vector of biases for layer i + 1. In the case of cross entropy, the loss function can be calculated by: where t (j) and O (j) are the predicted and actual outputs, respectively, training example j, y represents the output units, and O the output layer.
The process used in this study to minimize the loss function defined in (22) is stochastic gradient descent (SGD) (refer to Algorithm 3) [45].
To address the problem of overfitting the dropout regularization technique proposed in [45] is used. This ensures that during forward propagation, when a given training example is used, the activation of each neuron in the network is suppressed within probability P. This coefficient is typically <0.2 for input neurons and <=0.5 for hidden neurons. Dropout allows an exponentially large number of models to be averaged as an ensemble, which helps prevent overfitting and improve generalization.
Momentum and learning rate annealing are used to modify back-propagation to allow prior iterations to influence the current version. In particular a velocity vector, v, is defined to modify the updates: where θ describes the parameters W and B, µ the momentum coefficient, and α the learning rate. Using the momentum parameter helps to avoid local minima and any associated instability [46]. Learning rate annealing is used to gradually reduce the learning rate α t to "freeze" into a local minima in the optimized landscape and is based on the principles described in [47].

Fishers linear discriminant analysis classifier
Before the more advanced random forest classification model is considered this section discusses the fishers linear discriminant analysis (FLDA) classifier as a baseline classification model. FLDA finds a linear combination of features that determines the direction along which the two classes are best separated. In this study the criterion proposed by Fisher is used which is the ratio of between-class to within-class variances. The data is projected onto a line, and the classification is performed in this one-dimensional space. The projection maximizes the distance between the means of the two classes while minimizing the variance within each class: where α is the bias, W is calculated using Fishers LDA, and X is the training data without class labels such that f y ≥ 0 for normal records and <0 for pathological records. W is derived from X such that the within class scatter matrix S W is minimized by: where C is the number of classes, X i is the set of all points that belong to class i, µ i is the mean of class i, and X k is the kth point of X i . The between class scatter matrix S B is maximized by: where C is the number of classes, N i is the total number of points that belong to class i, µ i is the mean of class i, and µ is the overall mean, i.e. the mean of the data when all classes are considered together.

Random forest classifier
Random forest (RF) classifiers have featured widely in biomedical research [11,[48][49][50]. They are based on an ensemble of many randomized decision-trees that are used to vote on the classification outcome. Many studies have shown that they give classification accuracies comparable with the best current classifiers on many datasets. They are able to handle data with a large number of features. Those features that are important for classification are determined through the calculation of an importance score for each feature. Each decision-tree is randomized using a bootstrap statistical resampling technique, with random feature selection [51].
Given an M feature set, trees are constructed using m features randomly selected from the feature set at each node of the tree. The best split is calculated using these m features, which continues until the tree is fully grown without pruning. The procedure is repeated for all trees in the forest using different bootstrap samples of the data. Classifying new samples can then be achieved using a majority vote. The approach combines bagging with decision tree classifiers to achieve this (refer to Algorithm 4).

Performance measures
k-fold cross validation is used as a prediction metric with fivefolds and 1 and 30 repetitions, respectively. The average performance obtained from 30 simulations is utilized. This number is considered, by statisticians, to be an adequate number of iterations to obtain an acceptable average. Let C k denote the indices of the observations in part k, and n k the number of observations in k: if n is a multiple of K, then n k = n/K. Compute:  16:89 where and x i is the fit for observation i, obtained from the data with part k removed.
Sensitivity (true positives) and specificity (true negatives) measure the predictive capabilities of classifiers in binary classification tests. Sensitivities refer to the true positive rate or recall rate (pathological cases). Specificities measure the proportion of true negatives (normal cases). Sensitivities are considered higher priority than specificities, in this study. It is important to predict a pathological case rather than miss-classify a normal case. To evaluate the performance of classifiers fitted to imbalanced datasets the F-Measure is a useful metric that combines precision and recall into a single value with equal weighting on both measures [52].
The area under the curve (AUC) is an accepted performance metric that provides a value equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (this obviously assumes that positive ranges higher than negative). This has been chosen, as it is a suitable evaluation method for binary classification. Consider a classifier that gives estimates according to p(C i |x), it is possible to obtain values {a 1 , . . . , a n1 ; a i = p(C 1 |x), x i ∈ C 1 } and {b 1 , . . . , b n2 ; b i = p(C 2 |x), x i ∈ C 2 } and use them to measure how well separated the distributions of p(x) for class C 1 and C 2 patterns are [53].
Using the estimates, {a 1 , . . . , a n1 , b 1 , . . . , b n1 } they can be ranked in increasing order. The class C 1 test points can be summed to see that the number of pairs of points, one from class C 1 and one from C 2 with p(x) smaller for class C 2 than the p(x) value for class C 1 , is: where r i is the ranked estimate, S o is the sum of the ranks of the class C 1 test patterns. Since there are n 1 n 2 pairs, the estimate of the probability that a randomly chosen class C 2 pattern has a lower estimated probability of belonging to class C 1 than a randomly chosen class C 1 is: This is equivalent to the area under the ROC which provides an estimate obtained using the rankings alone and not threshold values to calculate it [51].

Results
This section presents the classification results for control and case records using the CTU-UHB dataset. The features extracted from the FHR signals are used to model each of the classifiers. The performance is measured using sensitivity, specificity, AUC, F-Measure and MSE values.

Using all features from original data
In the first evaluation, all the features in the feature set are used to train the FLDA, RF and DL classifiers.

Classifier performance
The results in Table 2 show that the sensitivities for all classifiers are very low, while corresponding specificities are high. This is expected given that there are a limited number of case records from which the classifiers can learn a suitable fit. The F-Measure is a good metric when using imbalanced datasets and provides a better indication of classifier performance than sensitivity, specificity and AUC. As can be seen the F-Measure for the FLDA and RF are low with slightly better results produced by the DL (only slight better than chance). It is clear that the models are capable of classifying control records but not case records. This is because there are 506 controls and only 46 cases from which the classifiers can learn, which is significantly lower. The AUC values are relatively low for the FLDA with slightly higher values for the RF and higher values again for the DL. Yet, the Sensitivities, which are considered more important in this study, are all low. Table 3 shows the error estimate for fivefold cross-validation using both 1 and 30 repetitions.
The errors are consistent with the expected MSE base-rate of 8.3% (46 pathological/552 FHR records) with the exception of the DL which produced an MSE = 3%.

Model selection
The receiver operator characteristic (ROC) curve is a useful graphic that shows the cutoff values for the false and true positive rates. It is particularly useful in binary classification to illustrate classifier performance. In the current evaluation, Fig. 2 shows that the FLDA performed poorly. The RF and DL classifiers produced slightly better results, which reflect the sensitivity, specificity and AUC values in Table 2.
The primary reason for the low sensitivities (despite the AUC for the RF and DL being relatively high) is that there are insufficient case records to model the class. This is in contrast to the classification of control records that are skewed in its favour. This causes significant problems in machine learning. As such, re-sampling the classes in the absence of real pathological cases is a conventional way of addressing this problem [54].

Using all features from synthetic minority over-sampling technique data
The 46 case records are re-sampled using the SMOTE algorithm. The SMOTE algorithm generates a new dataset containing extra cases derived from the minority class while reducing the majority class samples accordingly. Figure 3 shows the separation of classes following oversampling. Compared with Fig. 1 it is clear that both case and control data are now evenly distributed between the two groups. There is significant overlap between case and controls and no two feature combinations were able to increase this decision boundary any further than that presented in Fig. 3.

Classifier performance
Using the new SMOTEd feature set (300 cases and 300 controls-empirically this distribution produced the best Sensitivity, Specificity, AUC, F-Measure and MSE results), Table 4 indicates that the Sensitivities for all models improved (90% in most cases). This is however at the expense of lower specificities (10% decreases). The results are encouraging given that accurately predicting cases is more important than predicting controls. The F-Measure acts as a support metric in this evaluation and produces encouraging results in the RF and DL classifiers. Table 5 shows a marked improvement in error rates in all classifiers except the FLDA, which has increased by 12%. In the case of the DL classifier, the results indicate a 1.7% error rate, which is significantly less than the expected MSE base-rate of 50% (300 cases/600 FHR records).

Model selection
The ROC curve in Fig. 4 illustrates that all the models have significantly improved with the exception of the FLDA where the overall performance remained more or less the same.  26 Fergus et al. BioMed Eng OnLine (2017) 16:89 The results show that adopting the SMOTE oversampling technique improves classifier performance. While oversampling data is not ideal, it is an accepted technique within many clinical studies when skewed datasets need to be normally distributed [54][55][56].
The remaining evaluations build on these results with a particular focus on dimensionality reduction.    16:89 Using RFE selected features from SMOTE data

Recursive feature extraction (RFE)
Using the RFE algorithm, each feature is assessed to determine their discriminatory capacity. Figure 5 shows the cross-validation results using various feature combinations.
The results indicate that the optimal number of features is eight as can been seen in Table 6. The eight ranked features are DFA, RMS, FPeak, Acc, SD2, SDRatio, SAMPEN, and Dec. The following evaluation determines whether this reduced feature set can improve on or maintain the previous set of results.

Classifier performance
Looking at Table 7 it can be seen that most of the classifiers perform slightly worse using the eight features in terms of sensitivity. This is with exception to the RF classifier, which can maintain similar results using the reduced feature set.
The MSE values, reported in Table 8, for all but the RF classifier (whose error more or less stayed the same) were slightly worse than in the previous evaluation.

Model selection
In this final evaluation, the ROC curve in Fig. 6 illustrates that there are no real improvements on the previous evaluation for the FLDA and DL, but that the RF performs very well with a reduced set of features.

Discussion
Obstetricians and midwives visually inspect CTG traces to monitor the wellbeing of the foetus during antenatal care. However, inter-and intra-observer variability and low positive prediction is accountable for the 3.6 million babies that die each year. This paper, presented a proof-of-concept using machine learning and FHR signals as an ambulatory   decision support to antenatal care. The results indicate that it is possible to provide high predictive capacity when separating normal vaginal deliveries and caesarean section deliveries and in many cases produce much better results than those reported in previous studies (see Table 9). Using the original unbalanced dataset the best classifier (DL classifier) achieves SE = 0%, SP = 99%, AUC = 87%, and F-Measure = 52%. While the specificity values are high, all sensitivity values are below 3%. The low Sensitivity is attributed to the disproportionate number of normal records compared with pathological records and the fact that unbalanced datasets in general cause bias in favour of the majority class. The minimum error rate MSE = 3% was achieved by the DL using 30 repetitions. This relatively small MSE appeared to be a good error rate. However, the classifiers were simply classifying by minimizing the probability of error, in the absence of sufficient evidence to help them to classify otherwise.
The SMOTE algorithm using all 13 features significantly improved the Sensitivity values for all classifiers. While oversampling is not ideal, it is a way to solve the class skew  problem that is widely used in medical data analysis [40,[61][62][63][64]. The best classification algorithm is again the DL classifier, which achieves SE = 94%, SP = 91%, AUC = 100%, F-Measure = 100% and MSE = 2%. The reason for this is that the algorithm has the ability to extract complex non-linear patterns generally observed in physiological data like FHR signals. Through the extraction of these patterns, the DL algorithm uses relatively simpler linear models for data analysis tasks, such as classification. The DL generalizes, and finds the global minima, which allows it to generate learning patterns and relationships beyond immediate neighbours in data. It is able to provide much more complex representations of data by extracting representations directly from unsupervised data without domain knowledge or inference. Using the RFE algorithm as a feature selection technique the algorithm eliminated five features from the original 13 that were considered to have very low discriminatory capacity. The remaining eight features were used to fit the models and the results show that the RF achieved the best overall results with SE = 91%, SP = 91%, AUC = 98%, F-Measure = 91% and MSE = 11%. The primary reason for these good results is that the RF algorithm is based on an ensemble of many randomized decision-trees that are used to vote on the classification outcome. They are able to handle data with a very large number of features (although the feature set in this study is not particularly large) and those features that are important for classification can be determined through the calculation of an importance score for each feature. The score metric based on voting is similar to the approach adopted in k-nearest neighbour classification and the voting mechanism to classify new data points based on the majority surrounding data points of a particular class. The DL classifier performed worse on the reduced dataset but still produces better results than several studies discussed in this paper [6,20,57,60].

Conclusions and future work
The primary aim in this paper was to evaluate a proof-of-concept approach to separating caesarean section and normal vaginal deliveries using FHR signals and machine learning. The results show that using a deep learning classifier it is possible to achieve 94% for Sensitivity, 91% for Specificity, 99% for AUC, 100% for F-score, and 1% for Mean Square Error. This shows significant improvements over the 30% positive predictive value achieved by obstetricians and midwives and warrants further investigation as a potential decision support tool for use alongside the current CTG gold standard.
Nonetheless, despite the encouraging results reported, the study needs further evaluation using truly independent data to fully assess its value. In future work this will be made possible by soliciting support for clinical trials and utilising other open datasets that have adopted a similar study design. Other important work will include regression analysis, using a larger number of classes to predict the expected pathological event, in terms of the number of hours or days to delivery, not just whether the outcome is likely to be a caesarean section or a normal vaginal delivery. We also need to integrate and use the clinical data provided with this study in future analysis tasks.
It will also be important to evaluate different parameter adjustment settings, particularly in the case of the DL algorithm to determine if the results can be further improved. Automatic feature detection will also be explored using the DL to extract features from the raw FHR signals.