
Medicine and Health
Natural language processing system for rapid detection and intervention of mental health crisis chat messages
A. Swaminathan, I. López, et al.
Discover how Crisis Message Detector-1 (CMD-1), developed by a remarkable team including Akshay Swaminathan and Jonathan H. Chen, revolutionizes mental health crisis communication. This advanced NLP system dramatically reduces response times and enhances triage accuracy in telehealth services, ensuring that urgent messages are swiftly identified and managed.
~3 min • Beginner • English
Introduction
Suicide rates have risen globally and in the United States, with many individuals seeking help via crisis hotlines and messaging platforms that face capacity and triage challenges. First-come-first-serve queuing can bury urgent, high-risk messages beneath less urgent ones. Prior research suggests ML can assist triage, but many systems are not integrated into healthcare workflows. The authors developed CMD-1, an NLP system to detect potential crisis messages (suicidal/homicidal ideation, domestic violence, non-suicidal self-injury) from patients using a large tele-mental health platform. The goal was to surface high-risk messages to a crisis response team to reduce response times while maintaining human oversight for all surfaced and unsurfaced messages. The study reports retrospective and large-scale prospective validations and describes workflow integration.
Literature Review
The paper references prior ML efforts for crisis triage. Xu et al. reported high PPV and sensitivity for non-crisis cases (0.984 and 0.942) and precision/recall of 0.649/0.870 for crisis cases. Crisis Text Line developed classifiers identifying suicidal risk and ongoing self-harm with recall of 0.89, and demonstrated reduced wait times for high-risk texters (from 8 min median to 3 min). The authors position CMD-1 as integrated within a provider’s clinical workflow, aiming to extend these approaches with near real-time detection and clinician notification.
Methodology
Design followed TRIPOD guidelines. Data source: Cerebral, a national tele-mental health provider with HIPAA-compliant chat. Messages sent by patients between 04/01/2021 and 10/31/2022 were considered. Datasets: training and retrospective test sets from 04/01/2021–04/01/2022 (random 60:40 split; 721 training, 481 retrospective test), and a prospective test set from 10/01/2022–10/31/2022 (102,471 patient messages). Ethics: analysis of routinely collected EHR data; Stanford IRB deemed not human subjects research.
Inclusion criteria and crisis terms filter: Given <1% crisis prevalence, a manually curated, lemmatized crisis terms filter (275 words/phrases, e.g., “feel terrible,” “hopelessness,” “negative thoughts”) was used to enrich labeling efficiency. For training/retrospective sets, messages without any term were excluded; for prospective deployment, only messages passing the filter were evaluated by the model, but all messages comprised the test set for filter evaluation. From 10,063,900 messages (04/01/2021–04/01/2022), a random 200,000 were sampled; after excluding care team messages and applying the filter, 3,969 unique patient messages were eligible for labeling. Labeling proceeded in two stages: (1) random sample of 596 messages (17 crises, 2.8%); (2) enrichment using a crisis tracker to label messages up to 7 days before documented crises (606 messages; 365 crises, 60%). Total labeled set: 1,202 messages (32% crises). Before prospective deployment, the model was retrained on all labeled messages.
Outcome: Binary indicator of whether a message warranted crisis specialist attention based on possible suicidal ideation, homicidal ideation, domestic violence, or non-suicidal self-injury, using standard definitions. Ambiguous messages (e.g., “I need help,” “I’m so depressed”) were considered warranting attention.
Labeling and reliability: Three trained labelers (with crisis specialist oversight) labeled training/retrospective sets; 92.8% perfect concordance on 300 random messages; discrepancies adjudicated by crisis specialists. In prospective testing, crisis specialists labeled surfaced messages (true/false positives) via Slack UI; false negatives not surfaced were captured through standard escalation channels and recorded; remaining unsurfaced messages presumed true negatives by policy.
Predictors and modeling: Features derived from message text using TF-IDF on uni- to 5-grams. Model: L1-regularized logistic regression (Lasso) for feature selection and prediction. Hyperparameters tuned via 10-fold cross-validation on 60% of the training set to minimize a misclassification cost function reflecting stakeholder-defined FP:FN cost ratio (1:20). The probability threshold was chosen to minimize misclassification cost (equivalent to ~0.048 under perfect calibration); deployment used a threshold of 0.01.
Performance evaluation: Metrics included AUC (DeLong CI), sensitivity, specificity, PPV, NPV (Clopper-Pearson CIs). Prospective metrics calculated overall and by day/week. Calibration assessed via slope/intercept and curves. Failure analysis reviewed all false negatives; those filtered out were examined for missing phrases. Subgroup analysis evaluated performance for patients aged 18–21.
System deployment/workflow: The system comprised an alerter and listener (AWS Lambda). Every 10 minutes, recent messages passed through the crisis terms filter and model; predicted crises were posted to Slack (message text, patient/chart link, timestamps, and labeling buttons). Crisis specialists reviewed and labeled posts; the listener captured labels for analysis and potential retraining.
Key Findings
- Data characteristics: Training set 721 messages (32% crises); retrospective test set 481 (32% crises); prospective test set 102,471 messages (0.55% crises) from 32,803 patients. In prospective testing, 9,795 messages passed the terms filter (5.4% crises) and 92,676 did not (0.03% crises). Messages passing the filter were longer (median 170 chars vs 83).
- Retrospective test performance (threshold 0.01): AUC 0.82 (95% CI: 0.78–0.86); sensitivity 0.99 (0.955–0.998); specificity 0.12 (0.085–0.158); PPV 0.35 (0.309–0.4); NPV 0.95 (0.831–0.994). Calibration slope 0.48 (p<0.001); intercept 0.43 (p<0.001).
- Prospective test performance: Overall AUC ~0.98 (e.g., 0.975; 95% CI: 0.966–0.984); sensitivity ~0.98 (e.g., 0.975; 0.958–0.986); specificity ~0.97 (0.966–0.973); PPV 0.66 (0.626–0.692); NPV 0.99 (0.997–0.999). Calibration slope 0.96 (p<0.001); intercept about −1.43 (p<0.001). Day-wise ranges: sensitivity 0.89–1.00; specificity 0.994–0.999; PPV 0.48–0.85; NPV 0.999–1.000.
- Crisis terms filter (prospective): sensitivity 0.993 (0.982–0.998); specificity 0.910 (0.908–0.911); PPV 0.0567 (0.052–0.062); NPV 0.99996 (≈1.00).
- Subgroup (age 18–21, N messages=5,579; passing filter=525; event rate 0.72%): AUC 0.98 (0.9756–0.9936); sensitivity 0.976 (0.871–0.999); specificity 0.965 (0.944–0.979); PPV 0.702 (0.566–0.816); NPV 0.998 (0.988–1). Calibration slope 1.21 (p<0.001); intercept −1.34 (p<0.001).
- Failure analysis: 17 false negatives identified; 4 did not pass the terms filter due to missing phrases (e.g., “ER,” “tired of being alive,” “not exist,” “not feeling okay”).
- Response times: With CMD-1 (Oct 1–31, 2022), median time from message sent to CMD-1 post 8–11 min (avg IQR 5.1 min); CMD-1 post to crisis specialist triage 0–1 min (avg IQR 1.1 min); total message sent to triage 8–13 min (avg IQR 4.7 min). Pre-deployment response times averaged over 9 hours.
Discussion
CMD-1 addressed the need for rapid identification and triage of crisis messages in a high-volume tele-mental health setting, where first-come-first-serve queues can delay attention to high-risk individuals. Integrated into clinical workflows and leveraging near real-time alerts to crisis specialists, CMD-1 achieved high sensitivity and specificity prospectively and substantially reduced response times from hours to minutes. This rapid triage is clinically meaningful, as timely intervention can avert suicide attempts. The system’s deployment required collaboration across clinicians, data scientists, and engineers to balance misclassification costs (favoring sensitivity), ensure usability (Slack-based alerts), and build robust infrastructure.
Differences between retrospective and prospective performance (higher AUC and specificity prospectively) are attributed to sampling methodology and class imbalance: the retrospective sets were enriched with messages in the 7 days before crisis events, making non-crisis messages more similar to crises, while the prospective set had many more true non-crisis messages that were shorter and easier to distinguish. Supporting analyses showed median predicted probabilities and a down-sampling experiment aligned AUCs across settings. Failure analysis highlighted coverage gaps in the terms filter. Overall, CMD-1 demonstrates that ML can be safely and effectively integrated to augment human triage, improving operational efficiency and time-to-intervention for high-risk patients.
Conclusion
The study presents CMD-1, an NLP-based, clinically integrated system that accurately detects potential crisis messages at scale and markedly accelerates crisis specialist response times. Contributions include a two-stage filter-plus-logistic regression approach, rigorous retrospective and prospective validations, integration into real-world workflows with human oversight, and evidence of generalization to a vulnerable youth subgroup. Future work should evaluate downstream clinical outcomes (e.g., hospitalization, ED utilization), perform ablative analyses to optimize and expand the crisis terms filter, enhance model PPV via alternative embeddings or models (e.g., word2vec, SVMs, random forests, large language models), and continue calibration and threshold tuning for differing event rates and populations.
Limitations
- Reliance on a crisis terms filter may exclude crisis messages lacking listed phrases, leading to false negatives; some missing terms were identified (e.g., “ER,” “tired of being alive”). An ablative study could quantify filter impacts and guide expansion.
- Small, enriched training/retrospective sets (32% crises vs ~0.6% in deployment) can cause miscalibration when applied to populations with different event rates; while CMD-1 is used as a classifier with a low threshold and maintained high sensitivity/specificity, miscalibration may inflate predicted risks and increase false positives.
- Moderate PPV implies ~40% of surfaced messages are false positives, increasing review burden. There is room to improve PPV without sacrificing sensitivity using alternative features/models.
- Retrospective sampling (including messages before crises) may reduce apparent performance vs. deployment due to harder negative examples; population shifts between time periods (e.g., gender distribution, diagnosis mix, time in treatment) may also affect performance.
- False negatives in the prospective set were captured via standard escalation processes; while policy mandates routing all crises to specialists, any breakdowns could bias FN ascertainment.
Related Publications
Explore these studies to deepen your understanding of the subject.