Performance analysis of remote photoplethysmography deep filtering using long short-term memory neural network

Botina-Monsalve, Deivid; Benezeth, Yannick; Miteran, Johel

doi:10.1186/s12938-022-01037-z

Research
Open access
Published: 19 September 2022

Performance analysis of remote photoplethysmography deep filtering using long short-term memory neural network

Deivid Botina-Monsalve ORCID: orcid.org/0000-0003-3871-6875¹,
Yannick Benezeth¹ &
Johel Miteran¹

BioMedical Engineering OnLine volume 21, Article number: 69 (2022) Cite this article

4606 Accesses
6 Citations
Metrics details

Abstract

Background

Remote photoplethysmography (rPPG) is a technique developed to estimate heart rate using standard video cameras and ambient light. Due to the multiple sources of noise that deteriorate the quality of the signal, conventional filters such as the bandpass and wavelet-based filters are commonly used. However, after using conventional filters, some alterations remain, but interestingly an experienced eye can easily identify them.

Results

We studied a long short-term memory (LSTM) network in the rPPG filtering task to identify these alterations using many-to-one and many-to-many approaches. We used three public databases in intra-dataset and cross-dataset scenarios, along with different protocols to analyze the performance of the method. We demonstrate how the network can be easily trained with a set of 90 signals totaling around 45 min. On the other hand, we show the stability of the LSTM performance with six state-of-the-art rPPG methods.

Conclusions

This study demonstrates the superiority of the LSTM-based filter experimentally compared with conventional filters in an intra-dataset scenario. For example, we obtain on the VIPL database an MAE of 3.9 bpm, whereas conventional filtering improves performance on the same dataset from 10.3 bpm to 7.7 bpm. The cross-dataset approach presents a dependence in the network related to the average signal-to-noise ratio on the rPPG signals, where the closest signal-to-noise ratio values in the training and testing set the better. Moreover, it was demonstrated that a relatively small amount of data are sufficient to successfully train the network and outperform the results obtained by classical filters. More precisely, we have shown that about 45 min of rPPG signal could be sufficient to train an effective LSTM deep-filter.

Background

Electrocardiography (ECG) and photoplethysmography (PPG) are two methods that are used to measure different physiological parameters of the body, such as heart rate (HR) and heart rate variability (HRV); with which it is possible to monitor the behavior of the heart. ECG is a method that measures the electrical field caused by heart activity. On the other hand, PPG measures variations in light absorption in tissues due to the pulsatile nature of the cardiovascular system and the variation in blood volume [1]. Heart rate monitoring can be conducted by invasive methods such as pulmonary artery catheterization [2], and noninvasive methods classified as contact-based and non-contact-based. ECG and PPG methods perform contact-based HR measurements, and they may cause hygiene issues, discomfort, or even be impossible on fragile skins. Due to these possible disadvantages, in [3], Verkruysse et al., demonstrated that PPG signals could be measured remotely from a standard video camera, using ambient light as an illumination source. This technique, known as remote photoplethysmography offers the advantage of measuring the same parameters as PPG in an entirely remote way. In fact, rPPG is the non-contact equivalent to the reflective mode of PPG using ambient light as a source and a camera as a receptor. The light reflected by the skin is then estimated by capturing subtle skin color variations by the camera as blood volume changes. Several images and signal processing steps allow to obtain a pulse signal, also called the rPPG signal.

With rPPG or PPG signals, several biomedical parameters can potentially be measured: heart rate, pulse rate variability, breathing rate, vascular occlusion, peripheral vasomotor activity, and blood pressure by pulse transit time [4, 5]. Likewise, the applications are multiple, and some examples are mixed reality [6], physiological measurements of car drivers [7], living skin segmentation [8], control of vital signs in the elderly and newborns [9], and face anti-spoofing [10] to name a few.

The first approach implemented to estimate rPPG signals was only based on the green channel [3]. Then, approaches based on blind source separation techniques were proposed, e.g., PCA, ICA, EVM, PVM [11,12,13,14,15], and others based on a light tissue interaction model to determine a projection vector, e.g., PbV, POS, and Chrom [16,17,18]. In-depth state-of-the-art reviews of these rPPG signal estimation techniques are presented in [19,20,21]. More recently, some methods have adopted the strong modeling ability of deep neural networks for physiological measurements from video sequences [22,23,24]. The main advantages of these methods are that it allows achieving good results without the need for the designer to analyze the problem in-depth [25]. With the hand-crafted methods, it is necessary to detect and track the region of interest (ROI) through the frames, combine the RGB channels, filter them and estimate the physiological parameters such as heart rate or respiration rate. In the deep-learning-based measurement, on the other hand, a pipeline-based framework is no longer necessary. Consequently, deep-learning-based methods are less prone to error propagation in their pipeline. However, recent work has focused on heart rate measurement performance rather than understanding [25]. Subsequently, the limitations of the system are not always clear. Besides, it is well known that the training dataset used is critical.

Usually, the rPPG signal estimated from any of these methods is noisy due to the estimation technique, illumination variations, internal noise of the digital camera, and motion. Consequently, once the rPPG signals are acquired, unnecessary information such as frequencies out of the normal physiological range of interest are removed using a filtering process. The smoothing operation is commonly performed by a bandpass filter (BP) [12, 26,27,28,29,30], sometimes by a wavelet-based filter (WV) [13, 31,32,33,34], and recently, by the Savitzky–Golay filter (SG) [35,36,37]. Although these methods do smooth the rPPG signals, they do not necessarily remove particular signal alterations, which can, however, be easily identified by experts. Figure 1 shows the conventional pipeline used in the literature. The rPPG signal extracted from the video is smoothed through a classical filter for subsequent estimation of physiological parameters. However, even after the filtering process, irregular shapes of the rPPG signal are observed (see signal parts in green boxes in Fig. 1). The remaining alterations in the rPPG signal may affect the accuracy of heart rate measurements, but more gravely, avoid further advanced analysis of rPPG signals that can be based on peaks detection and pulse shape characteristics on temporal signals. For example, authors in [38] measured HRV by estimating the time elapsed between consecutive peaks of an rPPG signal to estimate emotional states. In this particular application, a noisy rPPG signal with false peaks would lead to erroneous measurement of HRV and, consequently, a misinterpretation of the emotional state. Therefore, there is a need to improve the accuracy of heart rate measurement and the rPPG signal quality. In this work, we want to benefit from the advantages of deep-learning-based methods, specifically by carrying out an in-depth study on the use of recurrent neural networks to improve the filtering of rPPG signals, proposing different protocols that allow a better understanding of these networks in the filtering application.

Recurrent Neural Networks (RNNs) are commonly used in applications where the structure embedded in the data sequences transfers valuable information, much like an expert would identify the wrong peaks in an rPPG signal. However, these networks show the well-known vanishing/exploding gradients problem during backpropagation, and because of this, RNNs cannot store information for a very long time. The long short-term memory (LSTM), on the other hand, is a network with memory blocks in its recurrent connections, saving information for more extended periods, avoiding the problem present in RNNs networks [23, 39]. The LSTM networks have been used successfully in the literature to process medical signals such as ECG, proving the potential of this type of network [40,41,42]. In [43], the authors propose a method of PPG denoising based on a bidirectional recurrent denoising auto-encoder (BRDAE) to retain the recurrent information in the PPG signal. The network training and testing are performed on an artificial noise-augmented PPG database, along with additional PPG signals acquired from subjects during their daily routine. Slapnicar et al. [44] propose an LSTM-based method to enhance reconstructed rPPG signals obtained by the POS method. For this purpose, they use a bandpass filter and a two-step wavelet filter to finally use a 2-layer LSTM network, using a many-to-one sequence-to-sequence approach. Although the results are satisfactory, using two classical filtering methods before using LSTM networks is necessary, when a single LSTM filtering stage could be sufficient. To the best of our knowledge, the only approach used in the literature to filter rPPG signals directly with an LSTM-based filter is the one proposed in [45], where the authors proposed a two-layer LSTM model to filter the rPPG signals in the MMSE-HR database. The authors affirm that a deep neural network requires thousands of training data, and their way of facing this problem is to train the network in synthetic signals based on sine signals with random Gaussian noise and then fine-tune the network in the database. However, although the proposed synthetic signals manage to simulate the rPPG signals with frequencies corresponding to the heart rate, they do not contain the characteristic shape of the signals in a real acquisition scenario, preventing the network from learning the subtle details of the signal such as saving the dicrotic notch shape or suppressing double peaks as described in [46]. Table 1 shows the publications mentioned above organized according to the type of signal and the method used for their respective objective.

Table 1 Publications related to classical and machine learning methods for filtering ECG, PPG, and rPPG signals

Full size table

In summary, these previous works filter PPG or rPPG signals with different approaches, where the authors agree that it is necessary to obtain these signals with the best possible quality. Although these works already generate considerable progress in extracting quality rPPG signals, there is still room for improvement. In addition, despite the excellent performances of biomedical signal filtering techniques based on machine learning (e.g., [43, 47]), they are still rarely used compared to more classical techniques (like bandpass, etc.). We believe that this is due to the lack of studies investigating the advantages, limitations, and sensitivity of these methods. There is no in-depth study on the specific aspect of neural-network-based filtering, namely the influence of amount of training data, influence of input signal quality and influence of dataset used for training and testing. To our knowledge, our paper is the first to present such a study on biomedical signals and, in particular, on rPPG signals.

Indeed, in deep-learning-based applications, the amount of data used during the training of a network plays a fundamental role in its performance. For example, in computer vision applications, generalization tends to improve with the size of the training sets [48]. However, there is no definitive answer to whether the amount of data during training improves all deep-learning applications. For this reason, it is critical to define in each deep-learning-based application if there is a dependency on the amount of data used during the training. In this way, it is possible to better understand the limitations and sensitivities of these networks.

In [47], we proposed the LSTM-DF filter to be used on rPPG signals estimated by the PVM method in the MMSE database [49], where we considered the filtering of these signals as a sequence-to-sequence regression problem, so we entered multiple inputs (samples) to the LSTM-DF filter to generate multiple outputs. This sequence-to-sequence approach is known as many-to-many (LSTM-MTM). The other sequence-to-sequence approach is the many-to-one (LSTM-MTO), where multiple inputs give a single output, like the one used in [44]. In this article, we analyze the performance of LSTM recurrent networks in rPPG signal filtering. Using the LSTM deep-filter, we develop an in-depth study of the network performance, adding experiments and approaches to determine the limitations and sensitivities of using these networks, allowing us to understand the best configuration to train an LSTM-based model to filter rPPG signals.

To the best of our knowledge, this is the first work where the performance of LSTM recurrent networks for rPPG signal filtering is studied. The main contributions of this work include (1) using three public-domain databases in intra-dataset and cross-dataset scenarios, we experimentally demonstrate the advantages of using LSTM networks in rPPG signal filtering. (2) We present a comparison between two sequence-to-sequence LSTM-based filters, namely MTO and MTM approaches, where MTM proved to be the most successful method. (3) We compare the use of the LSTM-based filter with six state-of-the-art rPPG signal estimation methods: PVM, POS, PbV, G-R, Chrom, and Green, where we observed high stability of the LSTM-based filter concerning different input data obtained with different methods. (4) Interestingly, we also show that a relatively small dataset can be enough to train the network and that there is a significant dependence on the signal-to-noise-ratio average on the training signals.

The remainder of the paper is organized as follows: in “Results” and “Discussion” sections, we present the outcome of the proposed experiments and a discussion. After “Conclusions” section, in “Methods” section, we explain the three public databases used, the protocols proposed to understand the limitations and sensitivities of the LSTM-based network, and the two approaches of the LSTM-based filter, namely MTO and MTM. Finally, we show the evaluation metrics and the implementation details of the network used.

Results

In this section, we present the results of a series of experiments where we compare three classical filters: bandpass, wavelet, and Savitzky–Golay, with two LSTM-based deep-filter approaches—MTO and MTM. We use fivefold subject-independent cross-validation on the MMSE-HR, VIPL-HR, and COHFACE databases. To evaluate the filtering methods rigorously, we use the mean absolute error (MAE) and the Pearson correlation coefficient (r) to describe the quality of the heart rate estimations, and also the signal-to-noise ratio (SNR), and the template match correlation (TMC), to describe the signal quality. Metrics are calculated with a 15-s sliding window with a 0.5-s step. SNR, TMC, and r are to be maximized, while MAE has to be minimized For ease of data visualization, the MAE metric is presented in the figures with the vertical axis inverted; this is done so that the best performances for the four metrics are always at the top of the graphs. Best results in experiments intra-dataset and cross-dataset are presented in bold red. Metrics, databases, and protocols are explained in detail in “Methods” section.

Intra-dataset

Figure 2 presents the results associated with the heart rate measurement metrics in the MMSE-HR, VIPL-HR, and COHFACE databases. On the left are the MAE values, and on the right r, on the upper part are the results of the k-fold cross-validation, and on the lower part is the final evaluation. Figure 3 depicts the metrics related to the signal quality, in the left part the SNR coefficient and the right part the TMC coefficient.

Table 2 contains the rPPG-SNR average of the signals acquired in each method, where the same 2,256 subjects have a total rPPG signal duration of 1,114.5 min for the six sets. As the second part of this experiment, six state-of-the-art methods for rPPG signal estimation: PVM, POS, PbV, G-R, Chrom, and Green, were chosen to be applied to the VIPL-HR images. Figure 4 compares the six state-of-the-art rPPG signal acquisition methods after using the MTM filter. Tables 3 and 4 present the cross-validation and final evaluation results, respectively, for the six methods (best results in bold).

Table 2 RPPG-SNR average in the VIPL-HR signals acquired by the methods: PVM, POS, PbV, G-R, Chrom and Green

Full size table

Table 3 Intra-dataset, cross-validation results for rPPG signals estimated from VIPL-HR dataset by PVM, POS, PbV, G-R (GR), Chrom and Green methods

Full size table

Table 4 Intra-dataset, final evaluation results for rPPG signals estimated from VIPL-HR dataset by PVM, POS, PbV, G-R (GR), Chrom and Green methods

Full size table

Cross-dataset

Figure 5 shows the results of the metrics related to the estimation of the heart rate and signal quality, during the cross-dataset experiment, for the MMSE-HR, VIPL-HR, and COHFACE databases. The first row contains the MAE values, the second one is r, the third one with the SNR coefficient, and the fourth one with the TMC coefficient.

Amount of training data

Table 5 contains the parameters of the rPPG signals used in this experiment, the average rPPG-SNR, the total duration of the signals in minutes, and the number of signals. Note how the average rPPG-SNR is similar for the seven sub-sets; this ensures that the results obtained in this experiment are not affected by the quality of the signals but only by the quantity. Figure 6 depicts the results of this experiment. On the left side of the figure are the metrics MAE and r, and on the right side, SNR along with TMC. MAE, SNR, and TMC are the metric average with its confidence interval.

Table 5 Parameters of the rPPG signals in amount of training data

Full size table

rPPG-SNR dependence

The characteristics of the rPPG signals used in this experiment are presented in Table 6, the average SNR of the rPPG signals, the total duration, and the number of signals. Note how the duration and number of signals for the VIPLA sets are balanced, while the more considerable variation is found in rPPG-SNR due to the quality of the videos and the rPPG estimation method. Figure 7 presents the metrics MAE and r on the left and SNR with TMC on the right for the first part of this experiment, where we trained the LSTM network in VIPLAQi, i=[0,1,2,3,4] and tested in VIPLB.

Table 6 Characteristics of the rPPG signals in rPPG-SNR dependence protocol

Full size table

As the second part of this experiment, the LSTM network is trained in VIPLB to be tested in VIPLAQi, i=[0,1,2,3,4]. Figure 8 depicts the results given by this second part of the current experiment. The first row has the MAE metric, the second one r, the third one SNR, and the final one TMC.

In the following section, we will discuss the results presented.

Discussion

Based on the tables and figures presented in ”Results” section, this section analyzes in detail the behavior of the LSTM network in each of the proposed experiments.

In the intra-dataset experiment, Figs. 2 and 3 show that MTO and MTM present good results in the MMSE-HR database compared with classical filters. The BP and WV filters have results very close to those obtained by LSTM-based filter approaches in MAE and r. We believe this is due to the good quality of the videos and scenarios within the database, verifiable in the minimal difference in MAE and r from the raw rPPG signal and all the filtering methods analyzed. In the VIPL-HR database, the MTO and MTM approaches always have the best results, being the MTM approach the best one. The good performance of the LSTM-based filter in this particular dataset is due to a large number of signals available, allowing the network to learn in a greater variety of signals than the MMSE-HR and COHFACE databases, where there is less variety of the signals. In the COHFACE dataset, the MTM approach is the LSTM-based filter that gives the best results, with the only exception of r in the final evaluation, where the value $r=0.19$ is very close to the best result $r=0.22$. The poor values shown in the raw signal rPPG in MAE and r can be associated with the compression of the videos present in this specific database. It is important to note that for the three databases, the best SNR and TMC values are those acquired by the two approaches of the LSTM network. However, MTM increases the quality of the signals to a greater extent.

In Tables 3 and 4, it can be seen that the LSTM-based filter always gives the best results compared with the filtering methods used in the literature. In MMSE, VIPL, and COHFACE databases, SNR and TMC metrics show significant improvements in signal quality when using the MTM approach. On the other hand, the MTO approach is only slightly higher than MTM in the MAE and r metrics. Finally, Fig. 4 shows that the filter is stable and able to work with rPPG signals from different algorithms. As a conclusion of this experiment, we could say that in an intra-dataset scenario, both sequence-to-sequence approaches outperform the results provided by the filters found in the literature; however, MTM turns out to be better than MTO, especially when it comes to improving the quality of the signal.

The cross-dataset protocol is particularly interesting because based on Fig. 5, there is a dependence between the signals used during training and testing. This behavior may be because each dataset has rPPG signals of a specific quality. For example, the MMSE database has an rPPG-SNR average of $7.65\pm 4$ dB, indicating that the signals are mostly of good quality, VIPL-HR has a larger amount of data, and its rPPG-SNR average is $1.04\pm 4$ dB, COHFACE on the other hand, has an rPPG-SNR average of $-0.96\pm 4$ dB. This signal quality difference between the three datasets is also visible in Fig. 9. Therefore, if there is a dependence between the quality of the signals used for training and testing: first, training the network on high-quality signals (MMSE-HR) and testing on low-quality signals (COHFACE) or vice versa should give poor performance, and second, training on a broad spectrum of good and poor quality signals (VIPL-HR) should give good performance.

Analyzing the metrics in Fig. 5, we notice that when training in VIPL-HR and testing in MMSE-HR, LSTM-MTO performs better in r, SNR, and TMC than other filters, but in MAE, it is the second-best value after WV. LSTM-MTM improves the signal quality metrics but fails to outperform the other methods in the heart rate measurement. Training in VIPL-HR and testing in COHFACE shows how the LSTM-based filter outperforms the other filters. The MTM approach is the best of the two proposed. Training in MMSE-HR and testing in VIPL-HR allows the LSTM-MTM filter to present the best result in MAE, a value very close to the best values given by BP and WV in r, and the best results in SNR and TMC. LSTM-MTO, on the other hand, gives values close to the best values in MAE and r but decreases the values of SNR and TMC. Continuing with the training in MMSE-HR but testing in COHFACE, we can see that LSTM-MTO has the best results for the MAE and r metrics, but some values are lower in SNR and TMC. LSTM-MTM, on the other hand, has a contrary behavior, it improves the other filters in SNR and TMC, and although it improves the result of r, it does not manage to overcome the BP and WV filters in MAE. Therefore, neither of the two proposed methods is sufficiently robust to over-perform the four metrics.

LSTM-DF-MTM shows to be more sensitive to the noise of the training signals, this is visible when we test in VIPL and MMSE having trained in COHFACE, LSTM-DF-MTO on the other hand, is more stable in the same conditions.

Therefore, the cross-dataset scenario shows that for the LSTM-based filter, there is a dependence on training and test data related to the quality of rPPG signals, reflected in the rPPG-SNR value of the databases. We corroborated this hypothesis in the rPPG-SNR experiment.

Regarding the experiment amount of training data, in Fig 6, the many-to-one approach shows that when decreasing the number of data from 100% to 50% and 25%, MAE and r remain stable, when decreasing from 10% and 5%, MAE is stable but the value of r starts to decrease, finally when having 1% of data MAE and r decrease. Concerning signal quality metrics, the behavior is similar; MTO still has higher values than the classic filters in all configurations, although in the 5% and 1% of data, SNR and TMC show lower values than those present between 100% and 10% of data.

In HR measurement, the MTM filter shows a higher sensitivity to the amount of training data than the MTO. Even though the MAE value remains relatively constant, r starts to decrease from taking percentages equal to and less than 25%. In the case of signal quality metrics, the many-to-many approach shows a higher performance than many-to-one regardless of the amount of training data, and it is always higher than the conventional filters.

Both sequence-to-sequence approaches can be trained with a set similar to the VIPL005 (45 min, 90 signals) and start to obtain better results than conventional filters in a test set with similar rPPG-SNR values.

In the HR measurement metrics presented in the experiment rPPG-SNR dependence (Fig. 7), the MTO filter does present dependence on the rPPG-SNR during training; MAE values improve from VIPLAQ0 to VIPLAQ3, and the results decrease only a little in VIPLAQ4, r presents a similar behavior by having better results in the training sets with low values of rPPG-SNR than in the sets with the higher ones. In terms of signal quality, the variation in both SNR and TMC results is small, but the lower values agree with the cleaner VIPLAQ0 and the noisier VIPLAQ4.

In the MTM filter, the dependence is more evident. Related to HR measurement, MTM is more sensitive to train in signals with the lowest and highest indices of rPPG-SNR (VIPLAQ4 and VIPLAQ0, respectively) than the MTO approach. The highest performances are presented in VIPLAQ2 and VIPLAQ3 for MAE and r. In signal quality, SNR and TMC only seem to be sensitive to the training set VIPLAQ4 because for the other four training sets, their values are constant and even higher than those presented by the MTO approach. As a conclusion of the first part of this experiment, we can say that using the VIPLB test set whose rPPG-SNR average is $1.09\pm 4.28$ dB, the two best training sets are VIPLAQ2 and VIPLAQ3, which coincide as they are the closest in quality of the rPPG signals with an rPPG-SNR average of $1.89\pm 6.23$ dB and $0.73\pm 5.56$ dB, respectively. Therefore, there is a dependence with the rPPG-SNR average in the training and testing sets, specifically, the rPPG-SNR in the training set must be similar to that of the test set to have good results.

The second part of the experiment, rPPG-SNR dependence depicts in Fig. 8 that in MTO and MTM, results are better in the test set VIPLAQ0, and they degrade as the test set VIPLAQ4 is reached. The only exception is presented in TMC given by MTM, where the values are independent of the testing set. For the rest of the metrics, all values with their standard deviation get worse when the testing set gets noisier. In HR measurement, the performance of MTO decreases more abruptly than with MTM. On the other hand, interestingly, if the LSTM training can be ensured in a data set with a high and low average rPPG-SNR signal balance, there will be an improvement in the heart rate measurement and in the quality of the rPPG signals compared to classical filters.

Conclusions

In this article, we analyzed the performance of an LSTM network for rPPG filtering using multiple protocols. Two sequence-to-sequence filter approaches were evaluated: many-to-one and many-to-many. We used three public databases in different experiments from which we can draw the following conclusions: the experiment amount of training data showed that a relatively low number of signals is needed to train the LSTM network efficiently. It was shown that even a dataset of approximately 90 signals totaling 45 min in total could be sufficient as a training set. In an intra-dataset scenario, both MTO and MTM over-perform the conventional filters, but it is the MTM approach that gives the best results.

We presented a comparison between the six state-of-the-art methods for rPPG estimation: PVM, POS, PbV, G-R, Chrom, and Green after using the LSTM-based filter. The results suggest stable values better than conventional filters, where using the LSTM-based filter on rPPG signals acquired by the PbV method gave the best performance.

The cross-dataset and rPPG-SNR dependence protocols are perhaps the most interesting experiments as they reflect a clear dependence of the LSTM-based filter on the rPPG-SNR average of the training and testing sets. Concerning this aspect, experiments showed that it is recommended that the rPPG-SNR average of the training set has to be as close as possible to those of the test set. This ensures the LSTM-based filter overperforms classical filters even in a cross-dataset scenario.

The experiments let us appreciate how an LSTM-based filter is a better alternative than the classical filters, which improves not only the heart rate measurement but also the quality of the signal. Finally, as future work, we are using the LSTM-based filter for rPPG signals acquired in the near-infrared NIR spectrum to improve the quality of the signals, and we will evaluate the benefits of our LSTM-DF to improve the performance of methods based on a precise analysis of rPPG signals, such as, for example, the analysis of HRV for the estimation of emotional states.

Methods

Data

We used the following three public databases in our experiments.

MMSE-HR

The Multimodal Spontaneous Emotion Corpus—Heart Rate database (MMSE-HR) [49] includes videos with many facial movements and expressions. The database has been built for further investigation in emotion recognition. Forty subjects are recorded performing 102 tasks, the resolution of each video is 1040 x 1392 pixels with a frame rate of 25 fps, the BIOPAC 150 data acquisition system was used to obtain the blood pressure ground truth at 1 kHz. The duration of each video is between 30 and 60 s.

VIPL-HR

The research in this paper uses the VIPL-HR database collected by the Institute of Computing Technology Chinese Academy of Sciences [22, 50]. The database contains 107 subjects recorded by three different instruments in nine scenarios: stable, motion, talking, dark, bright, long distance, exercise, phone stable, and phone motion. Although this database contains 752 near-infrared videos, we only consider the 2378 visible light videos, the resolutions of the videos are between 960 x 720 and 1920 x 1080, at 25 and 30 fps, respectively. The ground truth photoplethysmography signals were recorded using the CONTEC CMS60C BVP sensor at 60 Hz.

COHFACE

In the COHFACE database [51], four 1-min videos with different luminance were acquired from 40 subjects, generating 160 videos with a resolution of 640x480 pixels and a frame rate of 20 Hz. The BVP ground truth signals of photoplethysmography were acquired at 256 Hz.

Ground truth selection

In the databases used, due to the movement of the subjects, or failures in the acquisition devices, some ground truth signals present anomalies. These inconsistencies (gaps and false peaks) usually happen at the beginning and end of the acquisitions and less frequently during the measurement. However, due to the nature of the LSTM-based networks, it is not suggested to perform the training procedure with ground truth signals that present these types of problems. Therefore, a ground truth selection step is necessary. For this purpose, ground truth signals are checked individually to take only the continuous segment with a reliable ground truth morphology. In most signals, it was sufficient to remove the first and last seconds, while a considerable part of the signal was removed in a small group. In order to maintain the reproducibility of the studies carried out in this article, attached to this document are the supplementary files with the information of the signals that were cropped in each database (Additional files 1, 2, 3). Once only the reliable ground truth signals are selected; Table 7 contains the parameters related to rPPG signals acquired by the PVM method (PVM-rPPG) [13] from the three databases: MMSE-HR, VIPL-HR, and COHFACE. rPPG-SNR is the average of the SNR coefficient of rPPG signals. Figure 9 depicts examples of rPPG signals from the three databases used. Note how the quality of the acquired rPPG signals varies for each database. This is due to the nature of the scenarios where the videos were acquired for each database, i.e., luminance, movement, and other factors present during the acquisition.

Table 7 Parameters of the PVM-rPPG signals presented in the MMSE-HR, VIPL-HR, and COHFACE datasets

Full size table

It is essential to mention that the VIPL-HR database was designed to estimate rPPG signals and contains a large amount of data during various scenarios. Due to this, its average rPPG-SNR is 1.07 dB. MMSE-HR and COHFACE, on the other hand, have a considerably smaller amount of data and differ from each other, being MMSE-HR the database with the best quality in its rPPG signals with an average rPPG-SNR of 7.65 dB, and COHFACE with an average rPPG-SNR of -0.96 dB, being the most challenging database, these negative values are perhaps due to the level of compression of the videos or the luminance in the scenarios.

Protocols

In the following, we will describe the experiments carried out to validate the use of an LSTM network for filtering rPPG signals. We conducted a classical intra-database evaluation followed by a cross-dataset evaluation. Additionally, we present experiments to study in more detail the advantages and limitations of using an LSTM-based network for rPPG filtering. First, we demonstrate that it is not necessary to use a large amount of data to train the network successfully and that there is a clear dependence of the rPPG-SNR average in the rPPG signals during training. It is important to mention that there is no overlap between training and testing data in any of the experiments presented.

Figure 10 shows the distribution made in the databases used to study intra- and cross-dataset evaluations. A lists the databases used: COHFACE, MMSE-HR, and VIPL-HR. B contains the rPPG signals acquired by: PVM [13], POS [17], PbV [16], G-R [17], Chrom [18], and Green [3]. C presents the intra-dataset and cross-dataset evaluations with metrics measurement. Figure 11 resumes the PVM-VIPL signal generated in Figure 10, to be used in the study of the influence of the amount of data and noise on the rPPG signals during the training of the LSTM network, the protocol related to the amount of training is later called amount of training data (top panel), and the protocol related to noise is called rPPG-SNR dependence (bottom panel).