Experimental data description
The HS data used in this paper contain three categories—HFrEF, HFpEF and normal. The HS signals of HF patients were acquired from University-Town Hospital of Chongqing Medical University using the HS acquisition system (Patent No.: CN2013093000306700) with the sampling frequency at 11,025 Hz. HF samples were collected from 42 HFrEF and 66 HFpEF patients, respectively. Moreover, all the patients of HFrEF and HFpEF were diagnosed and confirmed by the cardiologists. All patients signed informed consent forms before participating this study, and this study has been ratified by Ethical Commission Chongqing University. The normal HS was obtained from the PhysioNet/Computing in Cardiology Challenge 2016. It contains nine databases from different research groups, and all recordings in the dataset were resampled to 2000 Hz. The dataset includes 2435 normal HS recordings collected from 1297 healthy subjects. Details of the dataset can be referenced in [36, 37]. In this paper, 1286 recordings were randomly selected as the normal group.
Signal preprocessing
HS preprocessing is an essential part to achieve a good identification performance. In this study, the preprocessing includes three steps introduced as follows.
Resampling
In general, HS mainly comprises two components: the first HS (S1) and the second HS (S2). S1 is the transient low-frequency acoustic signals, which is mainly between among 10 and 200 Hz, produced by the vibrations of heart chambers, heart valves and blood in systolic. S2 is produced at the end of systole, following the closure of semilunar valves about aortic and pulmonary [27, 38]. S2 has a higher-pitch than S1, with its frequency range between 20 and 250 Hz [39]. Since the original sampling frequency may cause high computational cost, all recordings are down-sampled at 600 Hz in accordance with Nyquist Sampling Theorem.
S1 marking and segmentation
In order to standardize the input length for the model, one strategy was used in this paper to obtain HS frames. Two main steps are involved in this process: marking S1 onset and segmentation HS with fixed frame length.
Marking S1 onset
Positioning the boundaries of HS components is the critical operation of segmentation. A cardiac period contains four states, namely S1, systole, S2 and diastole. Since S1 is the start of a cardiac cycle, the S1 onset is considered as the boundary of frames.
In this paper, logistic regression-based hidden semi-Markov model (LR-HSMM) is selected to localize the onset of S1. The method of LR-HSMM, developed by Springer et al. [40] and verified by Liu et al. [36], is usually treated as the state-of-the-art method for HS segmentation or marking the onset of cycles, which has great robustness in processing noisy recordings. To preserve more details of HS, the step of signal denoising was skipped in this study. Thanks to the advantages of LR-HSMM, the onset of S1 can be located accurately as shown with the dotted line in Fig. 2.
Segmentation HS with fixed frame length
The mechanical activity of heart is captured in one cardiac period [41]. Moreover, the interval features may vary between each cycle. In view of these two factors, period synchronous segmentation with the fixed frame length was applied in this study. The duration of a cardiac cycle is about 0.6–0.8 s, thus the frame length is fixed as 1.6 s, which includes approximately two cardiac cycles. Depicted in Fig. 7a, we segmented the frames with an interval of one cardiac cycle. Whenever the frame length exceeds two periods, overlap is inherent, which is exemplified in Fig. 7b. A total of 23,120 HS frames have been segmented, which, respectively, include the frames of HFrEF, HFpEF and normal are 7670, 7710 and 7740.
Normalization
Normalization is necessary to eliminate the difference of HS amplitude caused by the differences of acquisition locations and individual variation of subjects [15, 16]. All frames used in this paper were normalized by the following formula:
$$X{\kern 1pt} \,{ = }{\kern 1pt} \,\frac{{x - x_{\text{min} } }}{{x_{\text{max} } - x_{\text{min} } }}.$$
(1)
RNN-based structures
RNN models, including LSTM and GRU, were used in this work to learn deep features from HS. In this part, some detailed information about the RNN, LSTM and GRU are described as follows.
RNN
Generally, neural networks assume that inputs and outputs are independent from each other, while many relatedness exist between outputs and previous inputs in reality. Different from other deep learning models, RNN is a network with memory capabilities that can be used to process time sequence data. Hidden layers inputs \(h^{(t)}\) include both the previous hidden output \(h^{(t - 1)}\) and the current input \(x^{(t)}\). It can be expressed as:
$$h^{(t)} = f(Ux^{(t)} { + }Wh^{{(t{ - 1})}} { + }b),$$
(2)
where \(U\), \(W\) and \(b\) represent the input weight, hidden unit weight and bias, severally. RNN networks can mine information from arbitrarily long sequences theoretically, but they are limited to just a few steps in practice. For engineering application, LSTM and GRU, the improved RNN networks, are used widely.
LSTM
As an advanced version of general RNN, LSTM was proposed by Hochreiter and Schmidhuber [42] firstly and improved by Graves [43]. It solved the problem of weight explosion or gradient disappearing due to recursion under long-term time correlation conditions.
The architecture of LSTM contains a cluster of cyclically connected memory cells, and each LSTM unit is equipped with input gate, forget gate and output gate. These gates control the manner of which internal states are retained or discarded. The structure of LSTM unit is shown in Fig. 8a. The algorithm equations of LSTM cell from inputs to outputs are specified as follows:
$$g^{(t)} = \sigma (b_{g} + U_{g} x^{(t)} + W_{g} h^{(t - 1)} ),$$
(3)
$$f^{(t)} = \sigma (b_{f} + U_{f} x^{(t)} + W_{f} h^{(t - 1)} ),$$
(4)
$$o^{(t)} = \sigma (b_{o} + U_{o} x^{(t)} + W_{o} h^{(t - 1)} ),$$
(5)
$$s^{(t)} = f^{(t)} s^{(t - 1)} + g^{(t)} \sigma (b + Ux^{(t)} + Wh^{(t - 1)} ),$$
(6)
$$h^{(t)} = \tanh (s^{(t)} )o^{(t)} ,$$
(7)
where the \(\sigma\) represents the sigmoid function keeping the weights at 0–1, and \(g^{(t)}\), \(f^{(t)}\), \(o^{(t)}\), \(s^{(t)}\) indicate the external input gate, forget gate, output gate and cell state unit, respectively. The \(b\), \(U\) and \(W\) mean the biases, input weights and circular weights, respectively.
Behind the LSTM layers, a fully connected layer with a softmax function is applied for classification. The softmax function is as follows:
$${\text{softmax}}(x_{i} ) = \frac{{{ \exp }(x_{i} )}}{{\sum\nolimits_{i} {{ \exp }(x_{i} )} }},$$
(8)
where \(x_{i}\) is the output of former layer.
GRU
GRU, a special variant of the LSTM network, was proposed by Cho et al. [44] in 2014. The structure of the GRU is simplified from the LSTM, with two gates, but not separate memory cell. A single update gate \(z^{(t)}\), which replaced the input gate and the forget gate in LSTM, is used to estimate the current state of output. Furthermore, the reset gate \(r^{(t)}\) is introduced to control the influence of the previous hidden state on the \(x^{(t)}\) directly. The update gate and reset gate are described as below:
$$z^{(t)} = \sigma (b_{z} + U_{z} x^{(t)} + W_{z} h^{(t - 1)} ),$$
(9)
$$r^{(t)} = \sigma (b_{r} + U_{r} x^{(t)} + W_{r} h^{(t - 1)} ),$$
(10)
and the state of the hidden layer \(h^{(t)}\) is computed as below:
$$h^{(t)} = z^{(t)} h^{(t - 1)} + (1 - z^{(t)} )\tilde{h}^{(t)} ,$$
(11)
where \(\tilde{h}^{(t)} = \tanh (b_{h} + U_{h} x^{(t)} + W_{h} r^{(t)} h^{(t - 1)} )\), \(U\), \(W\) are the weight matrices of different gate referring to the subscripts, and \(b\) represents the bias. Figure 8b gives the structure of GRU unit.
Output states of GRU are calculated using a softmax function (Eq. (8)), which is the same with LSTM.
Methods compared
FCN: FCN with a softmax output layer has been used for time series classification [45]. The model comprised three convolutional blocks with the filter size of 128, 256, 128 and kernel sizes 8, 5, 3, respectively. A batch normalization layer and a ReLU layer are followed by every block. Then the global average pooling layer is added before the softmax layer to reduce the number of weights. The model is trained for 50 epochs with the batch size and learning rate of 64 and 0.001, respectively.
SVM: A one-versus-one SVM classifier with radial basis function kernel is adopted. Grid search method is used for parameters tuning. Following Ref. [46], we extracted multiple-type features from HS of HFrEF, HFpEF and normal. Three features with P-value less than 0.001 in Tamhane’s T2 one-way ANOVA are chosen as the feature vector for SVM. To ensure the compactness of this paper, the hand-crafted feature selection and analysis are presented in the “Appendix” at the end of the paper.
LSTM: A structure with two layers and 64 hidden units/layer is adopted. The details are explained in the results.
GRU: Proposed method.