Medicine and Health

A framework for evaluating clinical artificial intelligence systems without ground-truth annotations

D. Kiyasseh, A. Cohen, et al.

Discover how SUDO, developed by Dani Kiyasseh, Aaron Cohen, Chengsheng Jiang, and Nicholas Altieri, is revolutionizing the evaluation of clinical AI systems. This innovative framework leverages real-world data to highlight unreliable predictions and biases, paving the way for the ethical deployment of AI in medicine.... show more

Introduction

Clinical AI systems are typically trained on a source dataset and evaluated on a held-out set under the assumption that held-out data are representative of data encountered after deployment. In practice, data in the wild often differ due to distribution shift and frequently lack ground-truth labels, making it difficult to assess model reliability, select among models, and evaluate algorithmic bias. Over-reliance on confidence scores is problematic because models can be confidently wrong, and calibration methods may fail under shift and still require labels to validate effectiveness. Prior work estimating overall performance with unlabeled data has been largely model-centric and assumes representativeness of held-out data, overlooking data-centric deployment decisions. To address these challenges, the authors propose SUDO (pseudo-label discrepancy), a framework to evaluate AI systems on data in the wild without ground-truth labels by quantifying class contamination across model output probability intervals and using discrepancies in performance of auxiliary classifiers trained with different pseudo-label assignments to infer reliability, guide model selection, and assess bias.

Literature Review

Existing approaches often assume that high-confidence predictions are reliable, despite evidence that neural networks can assign high confidence to incorrect predictions. Calibration techniques such as Platt scaling and deep ensembles can adjust confidence but degrade under distribution shift and require labels for evaluation. Other work estimates model performance with unlabeled data in semi-supervised or transfer settings, but tends to be model-centric, assumes held-out data represent deployment data, or focuses on verification-bias corrections. Related ideas include reverse testing and reverse validation, which pseudo-label unlabeled data and train a classifier to compare models, but typically assign a single pseudo-label per sample and operate across the entire probability range for model selection only, not for identifying unreliable predictions. SUDO differs by cycling through all possible pseudo-labels within discretized probability intervals to quantify class contamination, enabling data-centric decisions (identifying unreliable predictions), model selection on unlabelled data, and fairness assessment without ground-truth labels.

Methodology

SUDO is a five-step framework for evaluating probabilistic AI systems on unlabelled, out-of-distribution data:

Deploy the model to obtain predicted probabilities s ∈ [0, 1] for data in the wild.
Discretize the probability distribution into predefined intervals (e.g., deciles).
For each interval, sample m data points from the wild data and assign a pseudo-label (e.g., class 0). Retrieve an equal number of training samples with ground-truth labels from the opposite class (class 1) to form a binary classification task. Repeat by pseudo-labelling the same sampled wild data as the alternate class (class 1) and pairing with labelled samples from class 0.
Train a lightweight classifier (e.g., logistic regression, random forest) to distinguish the pseudo-labelled wild samples from the labelled opposite-class samples. Repeat multiple times (e.g., k = 5) with different random seeds to ensure robustness.
Evaluate each classifier on a held-out labelled set using suitable metrics (e.g., AUC). The performance indicates the plausibility of the pseudo-label for that interval. Compute the pseudo-label discrepancy (SUDO) as the difference in performance between classifiers trained with alternate pseudo-label assignments for the same wild samples and interval. A larger absolute discrepancy implies lower class contamination and more reliable predictions for that interval. The sign indicates which class is more likely. Reliability-completeness curve: Define reliability by averaging absolute SUDO values at selected low/high probability thresholds, and completeness as the fraction of wild data covered by predictions beyond those thresholds. Vary thresholds to generate a reliability-completeness (RC) curve and compute AURCC for model comparison without labels. Datasets and implementations: The framework is evaluated on (i) Stanford Diverse Dermatology Images (DDI) as wild data with models trained on HAM10000, (ii) Camelyon17-WILDS histopathology patches with hospital shift (train/val/test with test hospital unseen), (iii) Multi-Domain Sentiment as a text modality with overconfident model simulation, (iv) Flatiron Health ECOG Performance Status (ECOG PS) labelled and unlabelled clinical notes, and (v) simulated Gaussian data with class imbalance, third-class contamination, and label noise in held-out evaluation. Image representations were extracted offline to amortize cost. Probability intervals and sampling sizes were adapted to data density (e.g., more granular intervals where predictions concentrate). Classifiers were typically logistic regression; alternate classifiers yielded similar trends. Survival analysis: For ECOG PS unlabelled cohorts, Kaplan–Meier survival was analyzed for groups defined by SUDO-derived reliability thresholds to externally validate clinical plausibility, censoring at last visit and focusing on first line of therapy.

Key Findings

• SUDO reliably proxies model performance and identifies unreliable predictions under distribution shift across modalities.

Stanford DDI (dermatology): Two pre-trained models (DeepDerm and HAM10000) degraded on DDI (AUC=0.56 and 0.67). SUDO strongly correlated with the proportion of positives per probability interval: |ρ|≈0.94 (p<0.005) for DeepDerm and ρ≈0.76 (p<0.008) for HAM10000, revealing class contamination without labels. Reliability-completeness analysis ranked models consistently with labelled performance: AURCC=0.62 (DeepDerm) vs. 0.86 (HAM10000).
Algorithmic bias on DDI: Stratifying by Fitzpatrick I–II vs. V–VI within low-probability intervals, SUDO indicated performance disparities (SUDOAUC=0.60 vs. 0.58) consistent with labelled metrics (NPV=0.83 vs. 0.78), suggesting bias favoring lighter skin tones without needing labels.
Camelyon17-WILDS (histopathology): Despite hospital shift, the model achieved average accuracy≈0.85. SUDO correlated with positives-per-interval (ρ≈0.79, p<0.005), identifying reliable/unreliable regions of the probability spectrum.
Overconfident NLP (Multi-Domain Sentiment): After overtraining to induce overconfidence, SUDO remained well-correlated with performance (Supplementary Fig. 1), outperforming naive reliance on confidence magnitudes.
Flatiron ECOG PS (clinical notes): The supervised NLP model achieved precision=0.97, recall=0.92, AUC=0.93 on labelled data. SUDO correlated with positives-per-interval (ρ≈0.95, p<0.005). On unlabelled wild data, SUDO-derived cohorts showed clinically consistent survival separation: low vs. high ECOG PS groups had median survival ≈2.07 vs. 0.95 years, comparable to labelled cohorts ≈1.87 vs. 0.68 years. SUDO correlated with median survival across intervals (ρ≈0.97, p<0.005).
Simulations: SUDO remained robust under class imbalance and presence of a third unseen class (ρ>0.87, p<0.005), but degraded with high label noise in the held-out evaluation set (ρ dropping from 0.99 to 0.33 when 50% labels were flipped). • Sensitivity analyses showed SUDO is stable across sampling sizes (as low as 50 samples/interval), classifier types (logistic regression, random forest), and metrics, though the sign of SUDO in highly contaminated intervals may flip; the absolute value is the key indicator of reliability.

Discussion

The study addresses the challenge of evaluating clinical AI systems on unlabelled, shifted data by introducing SUDO, which quantifies class contamination within model probability intervals via pseudo-label discrepancies in auxiliary classifiers. Across image and text modalities and multiple datasets with known distribution shifts, SUDO consistently correlates with true performance and clinically meaningful outcomes, enabling three practical applications without ground-truth labels: (i) flagging unreliable predictions for human review, (ii) selecting among AI systems via reliability-completeness trade-offs (AURCC), and (iii) assessing algorithmic bias across patient subgroups. These results mitigate limitations of confidence-based or calibration-only methods under shift and extend beyond prior reverse testing approaches by supporting data-centric deployment decisions. The ability to align SUDO with survival outcomes in ECOG PS further supports its practical interpretability. While SUDO depends on clean held-out labels for auxiliary evaluation and is interval-based rather than per-instance, it provides a scalable triage mechanism for large unlabelled cohorts and a principled basis for post-deployment monitoring and auditing.

Conclusion

SUDO offers a general framework to evaluate clinical AI systems on unlabelled, out-of-distribution data by leveraging pseudo-label discrepancies to estimate class contamination and prediction reliability. It enables identification of unreliable predictions, model selection without labels, and fairness assessment across subgroups, with demonstrated utility on dermatology images, histopathology patches, and clinical notes. Future directions include using SUDO for automated distribution shift detection to trigger model retraining, improving robustness to label noise in held-out evaluation sets, extending to multi-class settings by cycling all pseudo-labels, and expanding applications to domains with prevalent label noise and broader scientific disciplines beyond healthcare.

Limitations

• SUDO operates primarily at the level of probability intervals and is not designed to provide per-instance reliability without sufficient samples to train auxiliary classifiers. • Requires a held-out labelled dataset of sufficient quality; performance degrades with high label noise in this evaluation set. • The framework has not been exhaustively tested across all possible distribution shifts; certain severe shifts or altered class-conditional relationships may reduce utility. • Directionality of SUDO (sign) may be unstable in highly contaminated intervals with small samples; absolute magnitude should be emphasized. • When unlabeled datasets are small and readily annotatable, SUDO may be excessive; it is most valuable for large-scale unlabelled data as a triage mechanism. • External validation may rely on domain knowledge (e.g., survival associations) which may not be available in all settings.

Related Publications

Explore these studies to deepen your understanding of the subject.

Chemistry

A generative artificial intelligence framework based on a molecular diffusion model for the design of metal-organic frameworks for carbon capture

H. Park, X. Yan, et al.

Computer Science

Understanding the dilemma of explainable artificial intelligence: a proposal for a ritual dialog framework

A. Bao and Y. Zeng

Medicine and Health

Artificial intelligence guided screening for cardiomyopathies in an obstetric population: a pragmatic randomized clinical trial

D. A. Adedinsewo, A. C. Morales-lara, et al.

Medicine and Health

Artificial intelligence guided screening for cardiomyopathies in an obstetric population: a pragmatic randomized clinical trial

D. A. Adedinsewo, A. C. Morales-lara, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny