Medicine and Health

Efficient Detection of Stigmatizing Language in Electronic Health Records via In-Context Learning: A Comparative Analysis and Validation Study

H. Chen, M. Alfred, et al.

This groundbreaking study by Hongbo Chen, Myrtede Alfred, and Eldan Cohen delves into the effectiveness of in-context learning (ICL) for identifying stigmatizing language in Electronic Health Records. Remarkably, ICL surpassed traditional methods with superior performance despite utilizing less data, highlighting its potential in bias reduction within healthcare documentation.

00:00

~3 min • Beginner • English

Index

Introduction

Electronic Health Records (EHRs) are central to clinical communication but can embed stigmatizing language that undermines trust, contributes to poorer care, and perpetuates bias. Stigmatizing language may be explicit (e.g., derogatory terms) or implicit through tone and framing. Prior work shows substantial prevalence of stigmatizing language in clinical notes and links such language to negative provider attitudes and treatment decisions. Manual annotation to detect stigma is costly and requires domain expertise. Existing automated approaches have focused on supervised learning, which needs large annotated datasets—often impractical for subtle, context‑dependent stigma. In‑context learning (ICL) with instruction‑tuned LLMs offers a data‑efficient alternative, potentially well suited to nuanced language detection. This study evaluates whether ICL can accurately and fairly detect stigmatizing language in EHRs under zero‑ and few‑shot conditions and compares ICL to textual entailment, SetFit, and fully supervised fine‑tuning.

Literature Review

Prior studies document the prevalence and harms of stigmatizing language in EHRs, including effects on clinician perceptions, adherence to guidelines, and patient trust. Bias manifests across domains such as substance use, mental health, diabetes, and obesity. Early automated approaches used supervised learning (e.g., logistic regression, BERT) requiring thousands of labels (Sun et al.; Harrigian et al.; Barcelona et al.). However, supervised models can reflect dataset imbalances and exhibit fairness gaps across protected groups (sex, race, age), as shown in broader healthcare ML literature (e.g., chest X‑ray classifiers). Recent advances in prompting and ICL suggest instruction‑tuned LLMs can perform competitively in low‑label regimes. Prompt engineering techniques (e.g., Chain of Thought, CARP) improve reasoning for text classification. Yet, no prior work has systematically evaluated ICL for detecting stigmatizing language in EHRs or examined fairness in this context. This study addresses these gaps and introduces a task‑tailored Stigma Detection Guided Prompt based on domain heuristics and linguistic cues (questioning credibility, disapproval, stereotyping, difficult patient, unilateral decision‑making).

Methodology

Dataset: 5,043 sentences (3,249 stigmatizing; 1,794 non‑stigmatizing) sampled from 4,710 discharge summaries for 4,259 patients in MIMIC‑IV (Beth Israel Deaconess Medical Center, 2008–2019). Sentences were identified via keyword matching (e.g., non‑compliant, agitated) with symmetric context windows of 15 tokens. Two annotators (1 clinician, 2 assistants) labeled sentences; disagreements resolved via discussion; average pairwise Cohen’s kappa > 0.9. Demographics were available for sex, age groups (0–25, 26–50, 51–75, 75+), and race (White, Black, Hispanic, Asian, Other/Unknown). Ethics approvals for the original dataset; secondary analysis exempt (TCPS2 Article 2.2). Approaches compared: - Zero‑shot textual entailment: Transformer NLI models treat the sentence as a premise and label as stigmatizing/non‑stigmatizing (models: roberta‑large‑mnli (ROBERTA‑M), bart‑large‑mnli (BART‑M), deberta‑large‑mnli (DEBERTA‑M), bert‑base‑multilingual‑cased‑multinli (BERT‑M)). - Few‑shot SetFit: Contrastive fine‑tuning of sentence transformers on small labeled pairs followed by a classifier head (models: paraphrase‑mpnet‑base‑v2 (MPNET‑V2), e5‑base‑v2 (E5‑V2), stsb‑roberta‑base‑v2 (ROBERTA‑V2), paraphrase‑multilingual‑MiniLM‑L12‑v2 (MINILM‑V2), S‑BioBert‑snli‑multinli‑stsb (BERT‑BIO‑ST), BioLORD‑2023 (BIOLORD)); evaluated with 4, 8, 16 annotations per class; sampling repeated 5 times with different seeds and averaged. - Fully supervised fine‑tuning: Encoder‑only PLMs (roberta‑base (ROBERTA), bert‑base‑uncased (BERT), biobert‑v1.1 (BERT‑BIO), xlm‑roberta‑base (ROBERTA‑XLM), BiomedNLP‑BiomedBERT‑base‑uncased‑abstract‑fulltext (BERT‑BIOMED)) fine‑tuned on 3,543 training sentences (~70%), 500‑example validation (~10%), held‑out test (~20%). Hyperparameters tuned via grid search (learning rate, batch size, weight decay). AdamW optimizer; early stopping after 5 epochs without val loss improvement. - In‑Context Learning (ICL): Instruction‑tuned LLMs (Meta‑Llama‑3‑8B‑Instruct, flan‑t5‑large, gemma‑2‑9b‑it, Mistral‑7B‑Instruct‑v0.2, Bio‑Medical‑Llama‑3‑8B). A 30‑example validation set was used to refine prompt formatting. Four prompting strategies: Generic; Chain of Thought (two‑stage reasoning then answer); CARP (list clues, then reasoning, then decision); Stigma Detection Guided Prompt (task‑tailored: focuses on expert‑flagged keywords in context and provides five common linguistic stigma patterns). Few‑shot ICL included exemplars (and in COT/CARP, exemplars with reasoning/clues) subject to model context limits. Evaluation: Binary classification using accuracy, precision, recall, F1, and ROC‑AUC where applicable (AUC not computed for ICL due to non‑probabilistic outputs). For robustness, models were trained/tuned/evaluated on 10 distinct train/validation/test splits with disjoint test sets of 1,000 examples; report means and 95% CIs. Paired t‑tests (F1, precision, recall) compared top models across approaches; Bonferroni correction for 15 comparisons set α=0.003. Few‑shot experiments averaged across seeds. Fairness: Equal performance criterion assessed subgroup disparities (TPR, FPR, F1) across sex (male/female), age (0–25, 26–50, 51–75, 75+), and race (White, Black, Hispanic, Asian, Other/Unknown). Disparity computed as difference between groups (sex) or subgroup vs median (age, race); we report largest absolute disparities per attribute and summarize highest observed disparities across approaches.

Key Findings

- Zero‑shot: ICL outperformed textual entailment. Best ICL (GEMMA‑2 with Stigma Detection Guided Prompt) achieved F1=0.858 (95% CI [0.854, 0.862]) and accuracy=83.5%, exceeding best entailment (DEBERTA‑M, F1=0.723 [0.718, 0.728], accuracy=73.8%) by 0.135 F1 (~18.7%). Paired t‑tests showed significant gains in F1, precision, recall (P<.001). - Few‑shot: ICL surpassed SetFit at all label budgets. Using the Stigma Detection Guided Prompt with LLAMA‑3, F1 scores were 0.874 (4 shots/class), 0.892 (8), 0.901 (16). SetFit bests were 0.721 (MINILM‑V2, 4), 0.735 (E5‑V2, 8), 0.802 (E5‑V2, 16). Improvements were ~21.2%, ~21.4%, and ~12.3% (all P<.001). - Prompting strategies: The Stigma Detection Guided Prompt yielded the highest mean accuracy and F1 across zero‑ and few‑shot ICL, outperforming Generic, COT, and CARP. Zero‑shot mean accuracy varied by 11.3 percentage points across prompts; few‑shot mean accuracy gaps were 6.8–8.1 points depending on shot count. - Supervised vs ICL: Fully supervised RoBERTa achieved F1=0.931 (95% CI [0.924, 0.938]) and ROC‑AUC=0.973, while best ICL reached F1=0.901 (95% CI [0.895, 0.907]) with only 32 labels (16/class), underscoring ICL’s label efficiency. RoBERTa had higher precision; ICL had higher recall (P<.001 for differences). - Fairness: Supervised fine‑tuning showed larger disparities than ICL. Mean largest absolute TPR disparities across supervised models were 0.037 (sex), 0.077 (age), 0.046 (race). ICL models’ mean largest TPR disparities were ≤0.007 (sex), ≤0.010 (age), ≤0.013 (race). Highest observed supervised disparities: TPR up to 0.051 (sex), 0.108 (age), 0.064 (race); FPR up to 0.039/0.037/0.043; F1 up to 0.064/0.086/0.064. ICL highest disparities remained below 0.018–0.026 across metrics and attributes, indicating more equitable performance.

Discussion

Findings demonstrate that instruction‑tuned LLMs with in‑context learning detect stigmatizing EHR language more effectively than established zero‑ and few‑shot baselines and approach fully supervised performance with orders of magnitude fewer labels. This addresses the core research question: can ICL provide accurate and fair stigma detection under data scarcity? Yes—ICL achieved strong F1 and superior subgroup parity compared with supervised fine‑tuning, which exhibited larger TPR/FPR/F1 disparities likely reflecting demographic imbalances in training data. The Stigma Detection Guided Prompt—embedding domain heuristics and focusing attention on high‑risk keywords in context—was pivotal, consistently improving ICL performance over Generic, COT, and CARP prompts. Practically, ICL’s non‑probabilistic outputs limit threshold‑independent metrics and threshold tuning; however, its high recall and fairness profile suggest utility as a supportive screening tool to flag potentially stigmatizing language. Clinically, integrating such tools can support bias‑aware documentation, provided deployment addresses clinician workflow, acceptance, and privacy. Overall, ICL offers a data‑efficient and equitable solution for stigma detection with strong potential for real‑world adoption.

Conclusion

ICL significantly improves detection of stigmatizing language in EHRs compared with zero‑shot textual entailment and few‑shot SetFit approaches and performs competitively with fully supervised fine‑tuning using far fewer labels. The newly proposed Stigma Detection Guided Prompt enhances ICL by encoding clinical stigma heuristics and focusing on contextualized high‑risk terms. ICL also exhibits more equitable performance across sex, age, and race subgroups than supervised models. Future work should: (1) validate across larger, multi‑institutional datasets and note types; (2) develop datasets capturing implicit stigma beyond keyword curation; (3) extend labeling to uncertainty and stigma subtype dimensions; (4) explore fairness‑aware training and bias mitigation; (5) investigate fine‑tuning of decoder LLMs when labels are available; and (6) study human‑AI collaboration, usability, and workflow integration for clinical deployment.

Limitations

- Single‑site dataset (Beth Israel Deaconess, ED discharge summaries) limits generalizability; demographics and institutional practices may bias results. - Keyword‑based curation focuses on explicit stigmatization and may underrepresent subtle, context‑dependent stigma; the Stigma Detection Guided Prompt’s keyword focus may further bias toward explicit cases. - Binary labels omit ambiguity and stigma subtypes; lack of an “uncertain/requires context” category may conflate borderline cases. - Did not evaluate fine‑tuning of decoder‑only models (e.g., Llama) due to limited labels; performance remains underexplored. - Fairness analysis limited to sex, age, race; other attributes (socioeconomic status, education, geography) not assessed. - Bias mitigation techniques for supervised models (e.g., diverse sampling, fairness constraints) were not systematically evaluated. - ICL models produce non‑probabilistic outputs, limiting threshold calibration and threshold‑independent metrics. - Proprietary LLMs (e.g., GPT‑4) were not evaluated due to data‑sharing restrictions; comparative generalizability is unknown. - Human‑AI collaboration aspects (automation bias, usability, workflow integration) were not assessed in user studies.

Related Publications

Explore these studies to deepen your understanding of the subject.

Health and Fitness

Influence of social determinants of health in the evolution of the quality of life of older adults in Europe: A comparative analysis between men and women

R. Llorens-ortega, C. Bertran-noguer, et al.

Linguistics and Languages

Processing Chinese formulaic sequences in sentence context: a comparative study of native and non-native speakers

K. Chen, L. Gu, et al.

Computer Science

A framework for the emergence and analysis of language in social learning agents

T. J. Wieczorek, T. Tchumatchenko, et al.

The Arts

Monsters revisited: a comparative study of the use of humor in dramatizing benevolent monsters in *The Monsters under the Bed* and *The Boy Who Loved Monsters and the Girl Who Loved Peas*

H. M. Bayoumy

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny