Medicine and Health

A machine learning-based approach for constructing remote photoplethysmogram signals from video cameras

R. C. Ontiveros, M. Elgendi, et al.

Discover how Rodrigo Castellano Ontiveros, Mohamed Elgendi, and Carlo Menon have developed a cutting-edge machine learning model that enhances the accuracy of remote photoplethysmography signals extracted from videos. This breakthrough could revolutionize remote healthcare by making monitoring heart signals as reliable as traditional methods!

00:00

~3 min • Beginner • English

Index

Introduction

Remote photoplethysmography (rPPG) uses video cameras to detect blood volume changes non-invasively and is applied to heart rate monitoring, stress detection, sleep analysis, and hypertension. However, rPPG accuracy is degraded by motion artifacts, illumination changes, and skin tone variations, which introduce noise and distort signal morphology compared with contact PPG (cPPG). The study aims to improve the quality of rPPG by constructing an rPPG waveform that closely resembles cPPG, enabling more accurate extraction of physiologic information such as pulse rate variability and preserving morphological features (e.g., systolic/diastolic peaks and derivatives) relevant for cardiovascular assessment. The authors target the limitations of prior rPPG approaches that often optimize for a specific parameter (e.g., HR) at the expense of waveform fidelity, leading to noisy or incomplete signals. Best practices in PPG acquisition and processing are considered to mitigate common artifacts.

Literature Review

Prior work has largely focused on estimating specific physiological parameters from rPPG rather than reconstructing the raw waveform. Studies have compared HR and HRV from rPPG to cPPG references, and others targeted blood pressure, oxygen saturation, or atrial fibrillation detection. Parameter-specific training (e.g., for HR) can sacrifice other waveform components such as the diastolic peak and derivative-based features informative for cardiovascular disease risk. One earlier study attempted rPPG restoration by correspondence with contact PPG but used a private dataset without activity distinctions and evaluated with limited similarity metrics (Pearson r, cosine similarity). The present work advances the literature by: (1) using three public datasets (PURE, LGI-PPGI, MR-NIRP) with varied activities and conditions; (2) leveraging multiple classic rPPG extraction methods (CHROM, LGI, ICA, POS) as inputs; and (3) evaluating with complementary metrics spanning time and frequency domains (DTW, Pearson r, RMSE, and |ΔHR|).

Methodology

Datasets and ethics: Three public datasets were used. (1) LGI-PPGI: Six participants (5 male, 1 female) performing Rest, Talk, Gym (bicycle ergometer), and Rotation; videos via Logitech HD C270 at 25 fps; cPPG via CMS50E at 60 Hz; mixed lighting (talking outdoors). (2) PURE: Ten participants (8 male, 2 female) performing Steady, Talk, Slow Translation (~7% face height/s), Fast Translation (~14% face height/s), Small Rotation (~20°), Medium Rotation (~35°); camera eco274CVGE 640×480 at 30 fps; cPPG via CMS50E at 60 Hz; daylight, ~1.1 m camera distance. (3) MR-NIRP indoor: Eight participants (6 male, 2 female; skin tones: 1 Asian, 4 Indian, 3 Caucasian); activities Still and Motion (includes talking, head movements); camera FLIR Blackfly BFLY-U3-23S6C-C 640×640 at 30 fps; cPPG via CMS 50D+ at 60 Hz. All datasets are public with prior ethical approvals and informed consent; secondary use of de-identified data did not require additional IRB at ETH Zurich. rPPG extraction: Facial regions of interest (ROIs) were detected using pyVHR with MediaPipe landmarks. Thirty 30×30-pixel ROIs were used: forehead (10 landmarks), left cheek (10), right cheek (10). RGB averages were computed per ROI and passed to four rPPG algorithms: CHROM, LGI, POS, and ICA, chosen for their effectiveness in separating pulse-related color changes from other variations. Preprocessing: rPPG outputs (from CHROM, LGI, POS, ICA) and cPPG were resampled to the same frame rate and filtered with detrending and a 6th-order Butterworth bandpass (0.65–4 Hz). Low-variance rPPG signals were removed. Signals were segmented into non-overlapping 10-s windows and min–max normalized. Spatiotemporal maps underwent histogram equalization, improving performance. Frequency-domain analysis: Welch’s method was applied per window to estimate power spectral density; the highest peak within 39–240 BPM was used for HR. Autocorrelation alternatives produced minimal differences in |ΔHR|. Model and training: Training used PURE only. Each ~1-min video was subdivided into 10-s fragments, yielding 339 samples. For each 10-s sample (30 fps), RGB sequences of 300 time steps were transformed into four input signals via POS, CHROM, LGI, and ICA. A 5-fold cross-validation with 80/20 train/test splits within PURE was performed. Architecture: The model takes aggregated ROI signals from forehead, left cheek, and right cheek. It consists of four LSTM+dropout blocks with decreasing units (from 90 down to 1), followed by a dense layer. Adam optimizer with ReduceLROnPlateau was used. Loss/objectives included RMSE and Pearson correlation (r). Evaluation metrics: Per-window comparisons to cPPG used Dynamic Time Warping (DTW) to account for timing misalignment, Pearson’s r, RMSE, and absolute HR difference |ΔHR| from Welch-based HR. Averages across windows produced per-video metrics. Statistics: Non-parametric Friedman tests assessed differences across methods, followed by post hoc Nemenyi tests with Bonferroni correction (adjusted alpha 0.003) for pairwise comparisons. Baselines/comparators: Outputs from LGI, CHROM, POS, ICA, a GREEN channel baseline, and GRGB were compared against the proposed model across datasets and activities.

Key Findings

- Across datasets (PURE for training/validation; LGI-PPGI and MR-NIRP as out-of-distribution tests), the proposed model consistently achieved the lowest DTW, indicating better morphological similarity and temporal alignment to cPPG than CHROM, POS, LGI, ICA, GREEN, and GRGB. Differences versus POS and CHROM were often statistically significant for DTW; for Pearson r, significance depended on dataset (clearly superior on PURE; often comparable on LGI-PPGI and MR-NIRP). - Activity-wise analysis (Rest, Talk, Translation, Rotation, Gym): The model was best for DTW across all activities. For Pearson r, the model was top-performing in all activities except Gym, where POS achieved a higher r. Reported peak correlations include up to r = 0.84 in Translation and r = 0.77 in Rotation. - Frequency domain (|ΔHR|): On PURE, the model achieved |ΔHR| = 0.52 BPM (best average). On LGI-PPGI, POS and the model performed best with non-significant differences (POS |ΔHR| = 5.07, model |ΔHR| = 6.15), both markedly better than the GREEN baseline (|ΔHR| = 16.09). On MR-NIRP, the model performed best with |ΔHR| = 7.45. - RMSE vs DTW discrepancy on MR-NIRP: Although the model sometimes showed higher RMSE due to misalignment between predicted rPPG and cPPG, DTW and visual inspection indicated better morphology and reduced noise compared with POS and CHROM. - Overall, the proposed model produced more robust, less noisy rPPG signals, maintaining waveform morphology across varying motion and lighting conditions (including outdoor and movement scenarios).

Discussion

The study targeted waveform construction rather than single-parameter estimation (e.g., HR) to preserve physiologically relevant morphology, including systolic/diastolic peaks and derivative features. Incorporating DTW alongside RMSE and Pearson r allowed evaluation that accounts for timing shifts and waveform shape. The model generalized from PURE to LGI-PPGI and MR-NIRP, maintaining superior DTW and competitive r despite domain shifts (lighting, motion, devices). Discrepancies between RMSE and DTW on MR-NIRP reflect temporal misalignment rather than morphological inferiority; alignment via cross-correlation prior to RMSE could reconcile these metrics. Compared to prior works that focused on HR or SpO2, or used private/small datasets, this study used three public datasets with activity distinctions and assessed both time- and frequency-domain performance. While HR-specific models (e.g., ETA-rPPG, Siamese-rPPG) can achieve very low |ΔHR|, they risk losing other waveform information; the proposed approach balances HR estimation with broader morphological fidelity, enabling potential extraction of additional physiological parameters. The method showed robustness across activities, especially under motion (Talk, Rotation, Translation), and produced cleaner rPPG than POS/CHROM, which exhibited higher variability and noise in challenging conditions.

Conclusion

The proposed LSTM-based model constructs rPPG waveforms from facial video that closely resemble contact PPG. Trained only on PURE and evaluated on PURE, LGI-PPGI, and MR-NIRP, it outperformed classical methods (POS, LGI, CHROM, ICA, GRGB, GREEN) in DTW and was superior or comparable in Pearson r and |ΔHR| across datasets and activities. The resulting signals are more robust to motion and environmental variations, supporting applications in contact-free health monitoring, telemedicine, driver monitoring, and biometric authentication. Future work includes extending evaluation to additional physiological parameters (beat-to-beat HR, oxygen saturation), exploring signal alignment to harmonize RMSE with DTW, and assessing robustness across lighting settings and recording devices.

Limitations

- Training data came exclusively from the PURE dataset; while generalization to LGI-PPGI and MR-NIRP was good for DTW and often r, improvements were not always statistically significant across all metrics and activities. - RMSE can be inflated by temporal misalignment between predicted rPPG and cPPG, potentially underestimating performance; alignment steps (e.g., cross-correlation) were not integrated prior to RMSE calculation. - Dataset sizes are modest (6–10 participants per dataset), with limited diversity in activities and recording contexts per dataset; out-of-distribution generalization could benefit from broader training data. - The loss did not explicitly target HR or other clinical endpoints; while waveform fidelity supports multiparameter estimation, the study did not exhaustively validate downstream parameter accuracy beyond HR. - Environmental factors (lighting, camera type) and skin tone variations remain challenging and warrant further systematic evaluation for fairness and robustness.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Pre-deployment risk factors for PTSD in active-duty personnel deployed to Afghanistan: a machine-learning approach for analyzing multivariate predictors

K. Schultebraucks, M. Qian, et al.

Engineering and Technology

Small dataset machine-learning approach for efficient design space exploration: engineering ZnTe-based high-entropy alloys for water splitting

S. V. Oh, S. Yoo, et al.

Business

Extracting Useful Emergency Information from Social Media: A Method Integrating Machine Learning and Rule-Based Classification

H. Shen, Y. Ju, et al.

Engineering and Technology

Machine Learning Techniques for the Performance Enhancement of Multiple Classifiers in the Detection of Cardiovascular Disease from PPG Signals

S. W. Rabkin, A. Cataldo, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny