Medicine and Health
Improving the accuracy of medical diagnosis with causal machine learning
J. G. Richens, C. M. Lee, et al.
The study addresses why existing machine learning approaches struggle with differential diagnosis and proposes a causal reformulation. Traditional model-based and deep learning diagnostic systems are associative: they identify diseases correlated with symptoms, which can be confounded by shared causes and lead to sub-optimal or unsafe diagnoses. Clinicians, by contrast, seek causal explanations for a patient’s symptoms. The authors hypothesize that diagnosis is fundamentally a counterfactual inference task and that ranking disease hypotheses by counterfactual causal responsibility will improve diagnostic accuracy, especially in complex differentials and rare diseases. They present counterfactual diagnostic algorithms and evaluate them against a state-of-the-art associative algorithm and practicing doctors using a large set of clinical vignettes to demonstrate clinical relevance and potential impact for healthcare systems where diagnostic errors are common.
Prior diagnostic algorithms, including Bayesian model-based approaches (e.g., Bayesian networks, noisy-OR structures) and modern deep learning, perform associative inference by estimating P(Disease | Evidence) and ranking diseases by posterior probabilities. While effective in simple causal scenarios, associative inference is vulnerable to confounding and cannot, in general, disentangle correlation from causation. Clinical reasoning literature emphasizes diagnosis as causal explanation. Pearl’s hierarchy places counterfactual inference at the top, enabling reasoning about causes. The paper illustrates confounding with examples (e.g., asthma as a protective factor artifact in pneumonia data due to treatment effects and unobserved infection severity), showing how associative learning can yield dangerous conclusions. Despite extensive use of Bayesian networks in medicine and recognition of causality in clinical reasoning, the authors note a lack of modern causal analysis in model-based diagnosis and motivate a counterfactual approach grounded in structural causal models (SCMs).
The authors propose re-defining diagnostic ranking via causal responsibility measures that satisfy three desiderata: (I) consistency with disease posterior likelihood, (II) causality (diseases that cannot cause observed symptoms should not be diagnoses), and (III) simplicity (prefer diseases that explain more symptoms). They introduce two counterfactual measures: (1) Expected disablement, quantifying how many of the patient’s present symptoms would be absent if one intervened to cure a candidate disease (D=F), capturing necessary-cause aspects and the likely benefit of treating D alone; and (2) Expected sufficiency, quantifying how many observed symptoms would persist if all other causes were switched off except the candidate disease, capturing sufficient-cause aspects. Counterfactual probabilities are computed in the framework of Structural Causal Models using Pearl’s do-calculus. The disease models are Bayesian networks (BNs) realized as SCMs, with diseases, risk factors, and symptoms represented as binary nodes. The authors employ noisy-OR assumptions for efficient modeling of multi-causal symptom generation and construct twin diagnostic networks to compute counterfactuals efficiently within a single SCM via standard inference. Closed-form expressions for expected disablement and sufficiency are derived for 3-layer naive noisy-OR BNs (Theorem 2; details in Supplementary Notes). Experimental design: A test set of 1671 independently authored and verified clinical vignettes (simulated patient presentations including symptoms, history, demographics) was used. The BN disease model (three-layer naive noisy-OR) was specified independently of the test set, with disease and risk factor priors from epidemiology and conditional probabilities elicited from multiple independent medical sources and clinicians. Algorithms compared: (a) associative baseline ranking diseases by posterior P(D|E), and (b) counterfactual algorithms ranking by expected disablement and expected sufficiency. A cohort of 44 doctors provided differential diagnoses for subsets of vignettes. For each vignette, algorithms produced full rankings; top-k accuracy and position of true disease were evaluated, with stratification by disease rareness. The counterfactual and associative algorithms shared identical disease models; only the querying/ranking method differed.
- Overall accuracy: Doctors averaged 71.40% accuracy. The associative algorithm achieved 72.52%, placing in the top 48% of doctors. The counterfactual algorithm achieved 77.26%, placing in the top 25% of doctors and achieving expert-level accuracy.
- Top-1 improvement: For k=1 (top ranked disease), the counterfactual algorithm achieved a 2.5% higher accuracy than the associative algorithm.
- Rare diseases: Improvements were pronounced for rarer conditions. The counterfactual algorithm ranked the true disease higher than the associative algorithm in 29.2% of rare and 32.9% of very-rare cases.
- Ranking positions: Across 1671 vignettes, the counterfactual approach reduced the mean position of the true disease relative to associative rankings across most rareness strata (full stratified statistics provided, with many vignettes where counterfactual outperformed or tied associative).
- Equivalence of counterfactual measures: Expected disablement and expected sufficiency produced nearly identical accuracies on this test set.
- Backward compatibility: Performance gains were achieved without changing the disease model structure or parameters; only the ranking criterion changed.
The findings support the hypothesis that diagnosis is fundamentally a causal, counterfactual inference task. Associative rankings can be misled by confounding, whereas counterfactual measures better capture whether a disease explains observed symptoms and what would happen under interventions. The counterfactual algorithms substantially outperformed both the associative baseline and the average clinician, especially for rare and very-rare diseases where diagnostic errors are more prevalent and consequential. Analyses comparing doctor and algorithm performance across vignette difficulty suggest complementarity: doctors tended to do better on simpler cases, while the algorithm showed larger gains on more challenging cases. The backward-compatible nature of the approach provides a practical pathway to improve existing Bayesian diagnostic systems in medicine and beyond. These results bolster the case for embedding causal and counterfactual reasoning into clinical decision support to achieve expert-level performance.
The paper introduces a causal definition of diagnosis and two counterfactual diagnostic measures—expected disablement and expected sufficiency—implemented via twin diagnostic networks over noisy-OR Bayesian/structural causal models. Without altering the underlying disease model, counterfactual ranking markedly improves diagnostic accuracy over associative ranking, achieving expert-level performance and notable gains for rare diseases. This provides the first empirical evidence, in a clinical task, of counterfactual methods surpassing associative approaches. Future work should focus on integrating causal and counterfactual reasoning more broadly into machine learning for healthcare, advancing methods to learn causal models from data, and exploring deep generative/causal models to further enhance decision-making and generalize across settings.
- Use of simulated clinical vignettes rather than real-world EHR data may limit generalizability to clinical practice, although vignettes reduce labeling and confounding issues common in EHRs.
- Counterfactual inference requires structural modeling assumptions; counterfactuals cannot be identified from observational data alone. The approach depends on the correctness of the SCM/noisy-OR assumptions and parameter elicitation.
- The disease model parameters were derived from epidemiological sources and expert elicitation, which can introduce biases and uncertainty.
- The study compares ranking criteria within a fixed model; performance may vary with different model structures, parameterizations, or domains not tested here.
- While improvements were shown across rareness strata, detailed clinical outcomes (e.g., impact on treatment decisions or patient harm reduction) were not evaluated.
Related Publications
Explore these studies to deepen your understanding of the subject.

