Introduction
The American Society of Anesthesiologists' Physical Status (ASA-PS) classification system is crucial for assessing patient comorbidities before sedation and analgesia. However, inconsistencies among raters hinder its objective use. This study addresses this limitation by developing and validating NLP models for ASA-PS classification using pre-anesthesia evaluation summaries. The ASA-PS classification system is integral to anesthetic guidelines, impacting patient care, billing, and reimbursement. Previous studies have highlighted poor to moderate inter-rater agreement among healthcare professionals using the ASA-PS system, leading to inconsistencies. Even with a 2014 update providing examples for each ASA-PS class, significant discrepancies persist. This necessitates a reliable tool that accurately assigns ASA-PS classes using unstructured patient information. Recent advances in NLP, particularly models like Generative Pretrained Transformer (GPT)-4, BioClinicalBERT, and ClinicalBigBird, offer potential solutions for processing unstructured medical text data. While GPT-4 demonstrates high accuracy on general medical tasks, its performance on specialized tasks is lower, requiring domain-specific improvements. BioClinicalBERT has shown promise in ASA-PS classification but is limited by its token processing capacity. ClinicalBigBird, with its larger token capacity, may overcome this limitation, potentially leveraging long-term dependencies in the text. This study aims to develop and compare NLP-based ASA-PS classification models using pre-anesthesia evaluation summaries, and compare their performance against anesthesiology residents and board-certified anesthesiologists, ultimately creating an automated, objective framework for risk prediction and resource allocation in perioperative medicine.
Literature Review
The literature review highlights the inconsistencies and limitations of the existing ASA-PS classification system. Studies have shown poor to moderate inter-rater agreement among healthcare professionals, even after a 2014 update to the guidelines. This inconsistency impacts clinical decision-making, resource allocation, and billing processes. Recent research has explored the use of NLP techniques in medical text analysis, showing promise in improving the accuracy and efficiency of clinical tasks. Specific models, like BioClinicalBERT and ClinicalBigBird, have demonstrated successful application in similar contexts, albeit with limitations in token handling capacity. These models leverage pre-trained biomedical corpora, granting them domain-specific knowledge that potentially leads to superior performance compared to general-purpose language models. This study builds upon this prior work by directly comparing the performance of several state-of-the-art NLP models with that of human physicians in a real-world clinical setting.
Methodology
This observational study used data from 717,389 surgical cases at a tertiary hospital (October 2004–May 2023). Cases with missing pre-anesthesia records, ASA-PS scores, or ASA-PS VI were excluded, resulting in 610,721 eligible cases. Data was split into training (October 2004–December 2022), tuning (January 2023–March 2023), and test (April 2023–May 2023) sets. For the tuning and test sets, 120 cases per ASA-PS class (I, II, III, IV-V) were randomly sampled. Board-certified anesthesiologists created reference labels for the tuning and test sets. Three NLP models – ClinicalBigBird, BioClinicalBERT, and GPT-4 – were trained and validated against anesthesiologists' consensus labels. The ClinicalBigBird, BioClinicalBERT, and GPT-4 models were trained using a three-stage process: (1) Masked language modeling using the training dataset to understand relationships between words; (2) Supervised learning using input and output training datasets; and (3) Fine-tuning using the tuning dataset. Hyperparameters were tuned using grid search and cross-validation. Model evaluation metrics included AUROC, AUPRC, sensitivity, specificity, precision, recall, F1-score, and confusion matrices. Aleatoric and epistemic uncertainties were also calculated. Subgroup analysis was performed, stratifying the test set based on the median length of pre-anesthesia evaluation summaries. SHAP (Shapley Additive exPlanations) method was employed for model interpretability. The DeLong test compared AUROCs, and Mann-Whitney U-test compared AUPRCs. Fleiss' kappa measured inter-rater reliability among physicians and GPT-4 iterations. Statistical significance was set at p < 0.05.
Key Findings
The ClinicalBigBird model achieved the highest performance, with AUROCs exceeding 0.91 in both macro and micro averages. It outperformed both BioClinicalBERT (p = 0.010 for micro-averaged AUROC) and was comparable to GPT-4. Importantly, ClinicalBigBird demonstrated superior performance compared to board-certified anesthesiologists, achieving higher specificity (0.901 vs. 0.897), precision (0.732 vs. 0.715), and F1-score (0.716 vs. 0.713) (all *p* < 0.01). It also surpassed anesthesiology residents in sensitivity, specificity, and F1-score (all *p* < 0.001). Analysis of confusion matrices revealed that anesthesiology residents frequently classified cases as ASA-PS II, while board-certified anesthesiologists underestimated ASA-PS II and misidentified ASA-PS III. ClinicalBigBird showed improved performance in these categories. GPT-4 exhibited a significant tendency toward overestimation of ASA-PS I and II. Subgroup analysis showed ClinicalBigBird achieved better AUROCs and AUPRCs for longer pre-anesthesia evaluation summaries. Feature importance analysis using SHAP values highlighted keywords like "hypothyroidism," "moyamoya disease," and "infarction" as significant contributors to ASA-PS classification. The Fleiss' kappa values for board-certified anesthesiologists, anesthesiology residents, and GPT-4 were 0.743, 0.480, and 0.612, respectively. Aleatoric and epistemic uncertainties were lower for ClinicalBigBird than BioClinicalBERT.
Discussion
This study demonstrates the potential of NLP models, specifically ClinicalBigBird, to accurately classify ASA-PS scores from unstructured pre-anesthesia evaluation summaries, achieving performance comparable to, and in some cases exceeding, that of experienced anesthesiologists. This has significant implications for improving the consistency and objectivity of ASA-PS classification, potentially reducing inter-rater variability. The superior performance of ClinicalBigBird compared to other models might be attributed to its ability to handle longer input sequences and its pre-training on a large medical text corpus. The high accuracy of the model could streamline clinical workflows, reducing the burden on healthcare professionals and improving resource allocation. The model's ability to consistently process all available information mitigates potential biases and overlooking clinical details inherent in human assessment. While the study shows promise, the inherent subjectivity of ASA-PS classification remains a challenge. The model’s tendency to misclassify ASA-PS III as ASA-PS IV-V warrants further investigation. Further research should focus on addressing model limitations and expanding the dataset to improve generalizability.
Conclusion
This study demonstrates that an NLP-based model, specifically ClinicalBigBird, can achieve ASA-PS classification performance comparable to or exceeding that of board-certified anesthesiologists. This approach offers a promising solution for improving the objectivity and efficiency of ASA-PS classification, potentially reducing inter-rater reliability issues. Future research should focus on validating the model across diverse populations and clinical settings, expanding the input data to include additional medical records, and exploring techniques to further enhance model interpretability and reduce misclassification errors, particularly in the differentiation between ASA-PS III and IV-V.
Limitations
The study's limitations include the use of data from a single institution in South Korea, potentially limiting the generalizability of the findings to other populations and healthcare systems. The relatively small sample size of physicians might not fully represent the variability in human classification. The study only included adult patients, excluding the pediatric population. The translation of pre-anesthesia evaluations from Korean to English may have introduced some error. The static few-shot prompting for GPT-4 might have limited its performance compared to fine-tuned models. Finally, the prevalence-dependent metrics might not fully reflect model performance in settings with different ASA-PS class distributions.
Related Publications
Explore these studies to deepen your understanding of the subject.