
Medicine and Health
Comparison of NLP machine learning models with human physicians for ASA Physical Status classification
S. B. Yoon, J. Lee, et al.
This groundbreaking study explores the potential of natural language processing (NLP) in automating ASA-PS classification, achieving superior performance compared to human anesthesiologists. Conducted by Soo Bin Yoon and colleagues, the ClinicalBigBird model showcased an impressive AUROC of 0.915, indicating a transformative step towards streamlined clinical workflows.
~3 min • Beginner • English
Introduction
The American Society of Anesthesiologists Physical Status (ASA-PS) classification, a fundamental scoring system used to evaluate comorbidities and predict perioperative mortality and morbidity, is widely used in anesthetic guidelines for non-anesthesia care, ambulatory surgery, and pre-procedure evaluations. This has enabled anesthesiologists to provide patients with the benefits of sedation or analgesia while minimizing the associated risks. The ASA-PS classification has significantly impacted the healthcare system, particularly on billing and reimbursement by health insurance companies. However, poor to moderate agreement has been observed for the use of the ASA-PS system among healthcare professionals across various departments and patient groups, leading to inconsistencies, which have hindered its objective use. Moreover, significant discrepancies persist in different patient scenarios despite a 2014 update providing approved examples for each ASA-PS class. Thus, the development of a reliable tool capable of assigning ASA-PS classes accurately by extracting meaningful data from unstructured patient information is necessary.
Recent advances in natural language processing (NLP) have improved management of unstructured medical text. GPT-4 has demonstrated exceptional accuracy on general medical examinations but shows lower performance on specialized tasks, indicating a need for targeted improvements. BioClinicalBERT, trained on biomedical corpora, showed promising ASA-PS classification performance (macro AUROC 0.845) but is limited to 512 tokens. ClinicalBigBird can process up to 4,096 tokens and leverage long-term dependencies, showing advantages in clinical QA and document classification. Such specialized models may enhance privacy and reduce misclassification by utilizing longer context and domain knowledge.
This study aims to develop an NLP-based ASA-PS classification model using free-text pre-anesthesia evaluation summaries and to compare its performance with that of anesthesiology residents and board-certified anesthesiologists, to enable an automatic and objective framework for perioperative risk prediction, shared decision-making, and resource allocation.
Literature Review
Prior work indicates variability and limited inter-rater reliability in ASA-PS assignment despite guideline updates with examples. Large language models (LLMs) such as GPT-4 achieve high accuracy on broad medical exams (e.g., USMLE) but can underperform on domain-specialized tasks. BioClinicalBERT, pretrained on biomedical text, previously achieved macro AUROC 0.845 for ASA-PS classification but is constrained by a 512-token input limit, necessitating truncation/segmentation of long notes. ClinicalBigBird, capable of handling up to 4,096 tokens and pretrained on large medical corpora, has shown strong performance on long clinical text tasks (clinical QA and document classification), suggesting potential advantages for comprehensive pre-anesthesia narratives. FastText can process all text but lacks medical pretraining, limiting performance. These findings motivate evaluating domain-pretrained, long-context transformers for ASA-PS classification and benchmarking against human clinicians.
Methodology
Study design: Observational, retrospective study approved by the IRB of Seoul National University Hospital (approval 2306-167-1444) with consent waived. Adhered to STARD guidelines.
Data source and cohort: All surgical cases at Seoul National University Hospital from Oct 2004 to May 2023 (n = 717,389). Exclusions: no pre-anesthesia records (n = 101,032), no ASA-PS scores (n = 5,468), ASA-PS VI/brain death (n = 168). Eligible cases: n = 610,721.
Dataset splits: By calendar period—training (Oct 2004–Dec 2022; n = 593,765), tuning (Jan–Mar 2023; n = 10,444), and test (Apr–May 2023; n = 6,512). For tuning and test, stratified random subsampling of 120 adult cases per class (I, II, III, IV-V). During labeling, cases with insufficient information were excluded, and any patients overlapping with the test set were removed from training/tuning to ensure disjoint sets. Final sets used for modeling: training n = 593,510; tuning n = 426; test n = 460.
Reference labels: A consensus committee of five board-certified anesthesiologists provided reference labels for the tuning and test sets.
Data preparation: Pre-anesthesia evaluation summaries (free-text) were translated from mixed Korean/English to English using a proprietary translator. Byte-pair encoding was used for tokenization. Regular expressions and manual review ensured removal of any residual ASA-PS mentions in the notes. ASA-PS classes IV and V were merged (IV-V), and the emergency modifier “E” was removed, resulting in four classes (I, II, III, IV-V).
Models: Two domain models were developed—ClinicalBigBird (long-context transformer pretrained on medical corpora) and BioClinicalBERT. Pipeline had three sequential stages: (1) masked language modeling (self-supervised) on training texts; (2) supervised learning with input-output pairs; (3) fine-tuning on the tuning dataset. Hyperparameters tuned via grid search; 80:20 holdout for pre-finetuning stages; 5-fold cross-validation with grid search for fine-tuning. Cross-entropy loss and AdamW optimizer; early stopping by validation loss. GPT-4 was evaluated via static few-shot prompting (five demonstrations, one per class) selected by the consensus committee; for each case, 10 iterations were run to obtain class probability distributions.
Evaluation: Metrics included AUROC, AUPRC, sensitivity, specificity, precision, recall, and F1-score at class level; aggregated macro, micro, and weighted (prevalence) averages. Average physician performance per group (residents n=3; board-certified n=5) was derived by group mode or nearest integer of mean when all different. Confusion matrices were generated. Subgroup analyses stratified by note length around the median token count. Uncertainty quantification used Monte Carlo Dropout (1000 forward passes) to estimate aleatoric (entropy of mean predictions with input noise) and epistemic (mean variance) uncertainties. Model interpretability used SHAP force plots to visualize token contributions. Statistical tests: DeLong for AUROC, Mann-Whitney U for AUPRC; 95% CIs by bootstrap (1000 iterations with 4000 samples). Significance threshold p < 0.05. Implementation in Python 3.10.0. Code available at: https://github.com/jipyeong-lee/ASA-PS-NLP-vs-Human-Physicians/.
Key Findings
- Dataset: 717,389 surgeries (Oct 2004–May 2023). Final modeling sets: training n=593,510; tuning n=426; test n=460. Training set skewed toward ASA-PS I/II (~90%); tuning/test were class-balanced after subsampling.
- ClinicalBigBird overall performance (test set): macro AUROC 0.915 (95% CI 0.909–0.920), micro AUROC 0.914 (0.909–0.919), weighted AUROC 0.912 (0.905–0.918). Macro specificity 0.901 (0.897–0.906), precision 0.732 (0.717–0.746), F1 0.716 (0.701–0.730). Class-wise AUROC: I 0.952; II 0.884; III 0.877; IV-V 0.946. Class-wise F1: I 0.750; II 0.772; III 0.688; IV-V 0.654.
- ClinicalBigBird vs BioClinicalBERT: ClinicalBigBird had higher micro-averaged AUROC (p = 0.010) and lower aleatoric/epistemic uncertainty. BioClinicalBERT macro AUROC 0.899; generally lower sensitivity/precision/F1 across classes.
- ClinicalBigBird vs GPT-4: Comparable macro/micro AUROCs overall, but ClinicalBigBird outperformed GPT-4 notably on longer inputs and in ASA-PS I and II for long texts. GPT-4 showed overestimation tendencies, with high overcalling of ASA-PS I and II to higher classes in confusion matrices.
- Human physicians vs model: ClinicalBigBird exceeded board-certified anesthesiologists in specificity (0.901 vs 0.897), precision (0.732 vs 0.715), and F1 (0.716 vs 0.713), all p < 0.01. Versus residents, ClinicalBigBird showed higher sensitivity (0.723 vs 0.598), specificity (0.901 vs 0.868), and F1 (0.716 vs 0.633), all p < 0.001.
- Error patterns: Residents labeled 63.26% as ASA-PS II (conservative bias). Board-certified anesthesiologists underestimated ASA-PS II as I (33.33%) and ASA-PS III as I/II (33.13%). ClinicalBigBird reduced underestimation for ASA-PS II and III (5.85% and 25.15%, respectively) but increased overestimation for ASA-PS I (1.35% to 32.00%). GPT-4 strongly overestimated (ASA-PS I 77.33%; II 22.22%).
- Critical boundary (II vs III): Misassignment rates (I/II vs III/IV-V or vice versa) were 13.48% (board-certified anesthesiologists), 21.96% (residents), 11.74% (ClinicalBigBird), 14.12% (BioClinicalBERT), and 11.95% (GPT-4).
- Subgroup by note length: ClinicalBigBird achieved higher AUROC/AUPRC on longer-than-median texts (weighted AUROC 0.931; AUPRC 0.853) than shorter texts (weighted AUROC 0.908; AUPRC 0.820). For longer inputs, ClinicalBigBird significantly outperformed GPT-4 in ASA-PS I and II and BioClinicalBERT in II–IV-V.
- Interpretability: SHAP showed high-impact tokens aligned with clinical guidelines (e.g., hypothyroidism, moyamoya disease, infarction) driving class predictions.
- Inter-rater agreement: Fleiss’ kappa—board-certified anesthesiologists 0.743 (0.731–0.754), residents 0.480 (0.463–0.498), GPT-4 (10 runs) 0.612 (0.601–0.623).
Discussion
The proposed NLP approach using pre-anesthesia evaluation summaries achieved performance comparable to, and in several metrics exceeding, that of board-certified anesthesiologists, while outperforming anesthesiology residents. This suggests that an automated system could improve consistency in ASA-PS assignment and mitigate long-standing inter-rater variability.
The boundary between ASA-PS II and III is clinically pivotal for perioperative planning, non-anesthesia care, and ambulatory surgery eligibility. Misclassification at this boundary can lead to inappropriate preoperative testing and compromise safety. The ClinicalBigBird model demonstrated lower misassignment rates across this boundary compared to physicians and other models, indicating potential to standardize decision-making.
Error analyses revealed different biases: residents tended to conservatively label as ASA-PS II, potentially reflecting limited experience; board-certified anesthesiologists sometimes underestimated by overlooking well-controlled comorbidities. ClinicalBigBird, trained on consensus labels and equipped to process long texts without fatigue, reduced underestimation for ASA-PS II/III, though it showed increased overestimation for ASA-PS I.
ClinicalBigBird’s advantages stem from domain pretraining and long-context processing (up to 4096 tokens), enabling comprehensive utilization of lengthy pre-anesthesia notes. In contrast, BioClinicalBERT’s 512-token limit necessitates truncation/segmentation, likely contributing to lower performance and higher uncertainty. GPT-4, evaluated via static few-shot prompting without task-specific fine-tuning, tended to overestimate ASA-PS and misinfer underlying conditions; with domain-specific pretraining and fine-tuning, its performance might improve, but in this study it was less optimal for ASA-PS classification than ClinicalBigBird.
Overall, integrating such NLP models into perioperative workflows could provide objective, reproducible ASA-PS scoring to support risk prediction, resource allocation, and shared decision-making, potentially improving patient safety and operational efficiency.
Conclusion
An NLP-based model using free-text pre-anesthesia evaluation summaries can classify ASA-PS with performance similar to or better than board-certified anesthesiologists and superior to anesthesiology residents. By enhancing consistency and inter-rater reliability, this approach could streamline perioperative workflows and support objective risk stratification, pending validation across diverse clinical settings.
Limitations
- Ground truth subjectivity: ASA-PS inherently involves subjective judgment. Despite consensus labeling and substantial inter-rater agreement (Fleiss’ kappa 0.743), residual uncertainty remains; performance metrics should be interpreted as approximate.
- Single-institution data: Models were developed/validated using data from a single South Korean hospital; generalizability may be limited due to differences in demographics, documentation styles, and practice patterns.
- Adult-only cohort: Pediatric cases were excluded; extension to pediatric populations requires dedicated validation.
- Data modality scope: Only refined pre-anesthesia evaluation summaries were used; incorporating broader unprocessed records (outpatient histories, nursing notes, admissions, labs) could improve generalization and clinical impact.
- Translation effects: Translating notes from Korean to English may have affected accuracy.
- Limited human rater sample: Five board-certified anesthesiologists and three residents may not represent broader clinician populations; larger, multi-center studies are needed.
- GPT-4 setup limitations: Static few-shot prompting was used without domain-specific pretraining or fine-tuning, potentially underestimating GPT-4’s capabilities; dynamic prompting could improve robustness.
- Metric prevalence dependence: Metrics like F1 may not fully reflect performance in settings with different ASA-PS distributions than the balanced test/tuning sets.
Related Publications
Explore these studies to deepen your understanding of the subject.