logo
ResearchBunny Logo
Efficient Detection of Stigmatizing Language in Electronic Health Records via In-Context Learning: A Comparative Analysis and Validation Study

Medicine and Health

Efficient Detection of Stigmatizing Language in Electronic Health Records via In-Context Learning: A Comparative Analysis and Validation Study

H. Chen, M. Alfred, et al.

This groundbreaking study by Hongbo Chen, Myrtede Alfred, and Eldan Cohen delves into the effectiveness of in-context learning (ICL) for identifying stigmatizing language in Electronic Health Records. Remarkably, ICL surpassed traditional methods with superior performance despite utilizing less data, highlighting its potential in bias reduction within healthcare documentation.

00:00
00:00
Playback language: English
Introduction
Stigmatizing language in Electronic Health Records (EHRs) poses significant risks to patient care, perpetuating biases, disrupting therapeutic relationships, and reducing treatment adherence. Previous research predominantly employed resource-intensive supervised machine learning techniques for detection, necessitating large annotated datasets. In-context learning (ICL), where a pre-trained large language model adapts based on instructions and examples, offers a data-efficient alternative. This study aims to evaluate ICL's effectiveness in detecting stigmatizing language in EHRs, particularly in data-scarce environments. The prevalence of stigmatizing language in EHRs is substantial, ranging from 42.8% to 59.1% in various note types, according to previous studies. Such language can be explicit (derogatory terms) or implicit (subtle judgments in tone or phrasing). The detrimental impact on patient care includes biased clinical decisions, diminished trust in clinicians, reduced treatment engagement, and increased avoidance of healthcare. The manual detection of stigmatizing language is labor-intensive and costly, requiring clinical and linguistic expertise. Supervised machine learning approaches, while effective, require large annotated datasets, which are not always available. ICL offers a potential solution to overcome data scarcity and the need for extensive manual annotation. Prompt engineering plays a crucial role in ICL's success. This study will compare ICL's performance with established zero-shot and few-shot methods (textual entailment and SetFit), and a fully supervised fine-tuning approach. A novel prompting strategy, the ‘Stigma Detection Heuristic Prompt,’ will be introduced to enhance ICL’s capabilities. Fairness considerations are crucial, and the study will assess algorithmic bias across protected attributes (sex, age, and race) using the equality of opportunity criterion.
Literature Review
Several studies have explored automated detection of stigmatizing language in EHRs using supervised machine learning techniques like logistic regression and BERT classifiers. However, these approaches rely heavily on large, annotated datasets, which are often difficult and expensive to obtain. The inherent subtleties and contextual dependencies of stigmatizing language make annotation a particularly challenging task. Research in social psychology highlights the connection between implicit biases and language use, indicating that automatically detecting stigmatizing language in clinical notes is vital for identifying and mitigating unconscious biases in healthcare. The impact of stigmatizing language on patient care is well documented, with studies showing negative effects on treatment adherence, therapeutic relationships, and overall quality of care. The 21st Century Cures Act has increased patient access to EHRs, further emphasizing the importance of addressing stigmatizing language in these records. While supervised machine learning approaches have shown promise, their reliance on large, labeled datasets remains a significant limitation. In-context learning (ICL) presents an appealing data-efficient alternative, enabling large language models (LLMs) to adapt to tasks using minimal labeled data.
Methodology
This study utilized a dataset of 5,043 EHR sentences from Beth Israel Deaconess Medical Center, annotated for stigmatizing language (64.4% stigmatizing, 35.6% non-stigmatizing). The dataset was created by Harrigian et al. and extracted from MIMIC-IV. The performance of ICL was compared against zero-shot (textual entailment using ROBERTA-M, BART-M, DEBERTA-M, and BERT-M), few-shot (SetFit using MPNET-V2, E5-V2, ROBERTA-V2, MINILM-V2, BERT-BIO-ST, and BIOLORD), and supervised fine-tuning (using ROBERTA, BERT, BERT-BIO, ROBERTA-XLM, and BERT-BIOMED) approaches. The ICL approach employed four prompting strategies: Generic, Chain of Thought, Clue and Reasoning Prompting, and the novel Stigma Detection Heuristic Prompt. Five LLMs were used for ICL: LLAMA-3, FLAN-T5, GEMMA-2, MISTRAL-0.2, and BIO-LLAMA-3. Model fairness was evaluated using the equality of opportunity criterion, assessing true positive rate (TPR), false positive rate (FPR), and F1-score disparities across sex, age, and race subgroups. The largest absolute TPR disparity for each demographic attribute was reported for each model. For the few-shot settings, experiments were repeated five times with different random seeds, and the results averaged. For supervised fine-tuning, hyperparameters were tuned using a validation set (500 sentences), and the AdamW optimizer was used. For ICL, a validation set of 30 data points was used for prompt format tuning. Performance was evaluated using accuracy, precision, recall, AUC-ROC (for non-ICL approaches), and F1-score, with 95% confidence intervals calculated from ten train-test-validation splits. Paired t-tests were used to compare top-performing models, with the Bonferroni correction applied for multiple comparisons.
Key Findings
In the zero-shot setting, the best ICL model (GEMMA-2 with the Stigma Detection Heuristic Prompt) achieved an F1 score of 0.858 (95% CI [0.854, 0.862]), a 18.7% improvement over the best textual entailment model (DEBERTA-M, F1=0.723). In the few-shot setting, the best ICL model (LLAMA-3 with the Stigma Detection Heuristic Prompt) showed F1 score improvements of 21.2%, 21.4%, and 12.3% over the leading SetFit models with 4, 8, and 16 annotations per class, respectively. Using only 32 labeled instances, the best ICL model achieved an F1 score of 0.901, only 3.2% lower than the F1 score of 0.931 obtained by ROBERTA, the best supervised fine-tuning model trained on 3,543 labeled instances. Fairness evaluation revealed that supervised fine-tuning models exhibited greater bias, with larger absolute TPR disparities for sex (up to 0.062), age (up to 0.136), and race (up to 0.055) compared to ICL models. The paired t-tests confirmed statistically significant differences (P<.001) between the top ICL and zero-shot/few-shot models, showing ICL's superiority. The comparison between the best ICL and the supervised fine-tuning model showed that while ROBERTA had a higher F1 score and precision, LLAMA-3 had significantly higher recall.
Discussion
The study demonstrates ICL's effectiveness in detecting stigmatizing language in EHRs, significantly outperforming established zero-shot and few-shot approaches. The novel Stigma Detection Heuristic Prompt further enhances ICL's performance. The results highlight ICL's data efficiency, achieving comparable performance to supervised methods with substantially less labeled data. This is particularly relevant in healthcare, where acquiring large annotated datasets for tasks like stigmatizing language detection can be challenging. The superior fairness metrics of ICL models suggest a potential advantage in mitigating algorithmic bias, a crucial consideration in sensitive clinical applications. The observed variability in ICL performance across different prompting strategies underscores the importance of careful prompt engineering. The findings suggest that ICL could be a valuable tool for enhancing bias detection in clinical documentation while reducing resource demands. Future research should explore the application of ICL in diverse clinical settings and investigate its performance on datasets with a broader range of stigmatizing language expression. Further research examining the potential impact of integrating ICL into the workflow of AI scribes and patient access to this feedback is essential.
Conclusion
This study demonstrates that ICL is a robust and data-efficient method for detecting stigmatizing language in EHRs, outperforming existing zero-shot and few-shot approaches. The novel Stigma Detection Heuristic Prompt improves ICL’s performance. ICL also shows promise in mitigating algorithmic bias. Future work should focus on broader dataset validation and integration into clinical workflows, including applications for AI scribes and direct patient feedback.
Limitations
The study used a single dataset from a specific institution, limiting generalizability. The dataset focused primarily on explicit forms of stigmatizing language, potentially underrepresenting implicit forms. The binary classification framework may not capture the nuanced nature of stigmatizing language adequately. The study did not explore the effectiveness of decoder-based LLMs or incorporate sociodemographic factors beyond sex, age, and race in the fairness evaluation. Finally, the study did not directly investigate bias mitigation techniques for supervised fine-tuning.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny