Medicine and Health
Machine learning workflow for edge computed arrhythmia detection in exploration class missions
C. Mani, T. S. Paul, et al.
NASA’s upcoming exploration-class missions (e.g., Lunar Gateway, future Mars missions) will confront prolonged communication delays, limited ground medical support, and isolation periods, necessitating autonomous onboard healthcare. Spaceflight and microgravity environments are linked to musculoskeletal, hematologic, ocular, pulmonary, and especially cardiovascular alterations, including changes in atrial structure and cardiac contractility that predispose to arrhythmias. Given the frequency of cardiac rhythm disturbances in prior space programs and their potential mission-threatening complications (e.g., stroke, heart failure), there is a need for real-time, interpretable diagnostic systems operable at the point of care. The study aims to develop and validate a modular, interpretable machine learning pipeline to classify NSR, AFIB, and AFL from wearable 2‑lead ECG, and to deploy the model using ONNX for edge computing suitable for deep-space constraints. The central questions are whether an interpretable, feature-based ML model can achieve clinically useful accuracy for arrhythmia detection under realistic noise conditions and whether it can be efficiently deployed on edge devices to support autonomous medical decision-making.
Prior work and operational context indicate that spaceflight induces arrhythmogenic conditions via hemodynamic shifts, hypobaric/hypoxic atmospheres, radiation-induced inflammation and fibrosis, electrolyte disturbances, and structural/electrophysiologic remodeling. Arrhythmias have been among the more common medical issues in spaceflight, reinforcing the need for monitoring and early detection. Edge computing architectures (e.g., AWS Greengrass, Azure IoT Edge, Google Cloud for Edge) and ML runtimes (TFLite, OpenVINO, ONNX Runtime) can enable onboard, resource-constrained inference. NASA’s Exploration Medical Capability (ExMC) CDSS concept emphasizes on-board, real-time diagnostic autonomy with requirements for transparency and interpretability, especially for non-physician crew responders. Wearables (including consumer devices) can detect tachycardia but often function as black boxes, limiting trust and clinical validation in critical contexts. The authors advocate traditional ML over DL for this application due to interpretability needs and lower computational load under edge constraints. In the broader ECG classification literature, multiple models (including CNNs and DenseNets) report macro F1 scores in the 0.70–0.89 range on various datasets, often trading performance for computational complexity. The present work positions a feature-based, interpretable ML approach with nested cross-validation and ONNX edge deployment as a competitive and operationally practical alternative.
Pipeline overview: A modular workflow with seven components—Databases, Pre-processing, Denoising, Feature Extraction, Model Development, Evaluation, and Edge Inference—was implemented, enabling independent updates to data sources, features, and models and seamless transition to deployment via ONNX.
Data sources: Two database types were used. (1) Denoising/stress datasets to simulate noise characteristics analogous to space environments and exercise conditions: MIT-BIH Noise Stress Test Database; MIT-BIH ST Change Database; EPHNCPG stress test recordings. (2) Arrhythmia datasets with cardiologist annotations (AAMI standards) for NSR, AFIB, AFL: MIT-BIH Arrhythmia Database; MIT-BIH Atrial Fibrillation Database; MIT-BIH Normal Sinus Rhythm Database; Long-Term Atrial Fibrillation Database; Chapman University and Shaoxing People’s Hospital Database; China Physiological Signal Challenge 2018. Across sources, 742 hours of ECG were processed into non-overlapping 30‑s samples, with reported totals (Table 1) of 78,083 NSR, 10,488 AFIB, and 513 AFL samples.
Pre-processing: ECG leads were kept separate; patient IDs were preserved as metadata. Samples were downsampled to 128 Hz (respecting Nyquist), cleaned of invalid values relative to rolling averages, mean-centered, and scaled to [−1, 1]. Shorter samples were zero-padded to 30 s. Each sample received a single rhythm label (NSR, AFIB, AFL); in conflicts, tachycardia labels were prioritized, and non-target tachycardia samples were discarded. Metadata tracked patient, temporal order, lead, and rhythm labels.
Denoising: Noise bands were identified via FFT (SciPy) by detecting significant components <4 Hz (baseline wander, respiration, motion) and >50–60 Hz (powerline/instrumentation). A weighted Variational Frequency Mode Decomposition (VFMD) approach was applied due to superior QRS preservation and SNR improvement. Signals were resampled at 256 Hz for decomposition; modes were distributed from 0–128 Hz (Nyquist) using K=32 modes (~4 Hz each). Reconstruction used sub-bands covering 0–60 Hz (15 modes) to retain key ECG components (P/T waves <15 Hz; QRS 30–60 Hz) while removing identified noise.
Feature extraction: Seventeen interpretable features were computed spanning HRV (e.g., min/max heart rate, mean heart rate, RR-based indices including rMSSD, SDNN/SDRR, SDSD, pNN20/pNN50/PRR20/PRR40) and morphology (P/Q/R/S/T amplitudes, QRS delineation). QRS detection used convolutional peak detection with Gaussian kernels; wave delineation used discrete wavelet transforms.
Model selection and training: Initial trials with multinomial regression were superseded by a decision tree classifier to satisfy interpretability and NASA CDSS guidance. A nested, stratified cross-validation scheme was used: an outer 80/20 stratified split for train/holdout; inner stratified k-fold CV (with an embedded 3-fold CV for hyperparameter tuning via randomized search) on the training subset. Multiple feature subsets and model instances were compared, recording feature importances. Reproducibility was ensured by storing random seeds; all samples from a patient/recording stayed within a single split; temporally adjacent samples were not split across folds; at least one completely new ECG source/patient type was reserved for testing; rhythm class proportions were not forced to match across splits to reflect clinical prevalence.
Evaluation: Aggregated confusion matrices from 10-fold CV and the holdout test were used to compute accuracy, precision, recall, and F1 by class; macro-F1 was the average of class-wise F1. A minimum F1 of 0.75 per class was required. ROC curves (one-vs-rest) and AUROC were generated and compared to a baseline logistic regression pipeline.
Selected model and hyperparameters: The chosen decision tree used information gain, max_depth=9, min_samples_split=30, min_samples_leaf=5, and no class weighting per the non-proportional training rule.
Edge inference: The scikit-learn model was converted to ONNX and executed with ONNX Runtime for platform-agnostic edge deployment (Android device). The full signal-processing workflow from development was reused operationally to feed live data to the model for real-time inference.
- The selected decision tree model achieved a reported macro F1-score of 0.899. Class-wise F1 scores reported include NSR 0.993, AFIB 0.938, and AFL 0.767 (as stated in the abstract). Additional results tables also report NSR 0.993, AFIB 0.840, AFL 0.767 in cross-validation.
- ROC performance exceeded a logistic regression baseline in all classes: AUROC for the developed model—NSR 0.988, AFIB 0.902, AFL 0.912; baseline logistic regression—NSR 0.769–0.885, AFIB 0.780, AFL 0.658–0.752 (various mentions).
- Weighted confusion matrices showed high true positive rates for NSR and strong performance for AFIB and AFL in both train and test sets; example rates noted include NSR ≈0.998–0.999, AFIB ≈0.89–0.90, AFL ≈0.90–0.902.
- Reported accuracies include NSR 98.9%, AFIB 96.8%, AFL 89.7% (elsewhere a table lists test set accuracy NSR ≈99.4%, AFIB 98.6%, AFL 80.3%).
- Important predictive features emphasized P-wave amplitude metrics, PRR20/PRR40, mean and max heart rate, and RR variability measures (e.g., SDRR/SDSD), aligning with clinical understanding of AFIB/AFL atrial activity and tachycardia patterns.
- ONNX deployment enabled edge inference on Android; the ONNX-translated pipeline was reported as 9.2 s per sample in the abstract, and elsewhere the end-to-end processing time from raw data to inference was reported as 2.9 s, indicating feasibility for near-real-time onboard diagnostics.
The study demonstrates that a modular, interpretable ML pipeline using HRV and morphological ECG features can accurately classify NSR, AFIB, and AFL under noise conditions relevant to spaceflight. The model outperformed a logistic regression baseline in AUROC across all classes and satisfied a per-class F1 threshold, supporting its robustness in class-imbalanced settings (notably limited AFL). Emphasizing interpretable features (e.g., P-wave characteristics, heart rate variability) aligns with clinical reasoning, facilitating verification by crew medical officers and integration into decision support workflows required by NASA for autonomy and transparency. The authors argue that in the deep-space context, the costs of false positives/negatives are relatively low given continuous monitoring and human-in-the-loop review, making high F1 and AUROC particularly meaningful. Edge deployment via ONNX illustrates operational practicality: platform-agnostic, low-resource inference supports real-time onboard diagnostics with redundancy across devices, consistent with mission constraints on power, computing, and communications.
This work presents a complete, modular ECG analytics pipeline—from denoising and interpretable feature extraction to nested CV-driven model selection and ONNX-based edge deployment—for detecting NSR, AFIB, and AFL in resource-constrained deep-space contexts. The decision tree achieved strong macro F1 and AUROC, outperforming a logistic regression baseline, while maintaining interpretability and operational feasibility on an Android edge device. The approach is extensible: the modular framework can incorporate additional features, pathologies, and model architectures as needed. Future work includes expanding to additional cardiovascular conditions, integrating multi-modal biometrics for holistic health monitoring, refining edge optimization, and validating with spaceflight-representative or in-flight data to further assess generalizability and robustness.
A principal limitation is the lack of open-access ECG data collected during space missions; consequently, the model was trained and evaluated on Earth-based datasets that may not fully replicate in-space physiological and noise conditions. Class imbalance (notably limited AFL samples) poses challenges despite stratification and F1-based evaluation. Reported performance metrics and processing times vary across sections, indicating potential inconsistencies that warrant further standardized benchmarking. Additional validation on diverse, prospective datasets and under operational constraints is needed.
Related Publications
Explore these studies to deepen your understanding of the subject.

