logo
ResearchBunny Logo
Introduction
The rising global rates of suicide, exceeding 30% increase in the United States between 2000 and 2020, highlight the urgent need for efficient mental health crisis intervention. Individuals in crisis frequently utilize messaging platforms like crisis hotlines and online chat services, yet limited triage capacity often results in substantial delays. This necessitates improved methods for rapidly identifying high-risk messages among the increasing volume of communications. Existing crisis hotlines struggle with capacity issues, resulting in high drop rates and significant delays in response. For instance, the NAMI HelpLine experienced a 60% increase in requests between 2019 and 2021, while still maintaining a 25% drop rate despite tripling its staff. Similarly, the National Suicide Prevention Lifeline managed to respond to only approximately 30% of incoming chats and 56% of text messages in 2020. This highlights a critical need for technology-driven solutions to improve the efficiency and timeliness of crisis response. The first-come-first-served approach commonly used by many platforms exacerbates the problem, potentially delaying crucial intervention for high-risk individuals. Machine learning (ML) offers a promising avenue to address these challenges. Emerging research has demonstrated the potential of ML in automating the triage of crisis messages, showcasing models with high positive predictive value and sensitivity for identifying both crises and non-crises. To address this critical need, this study presents the development and validation of Crisis Message Detector-1 (CMD-1), a novel NLP system designed to assist clinicians in identifying potential crisis messages within a large national telehealth platform serving over 200,000 patients. CMD-1 aims to improve response times to patients experiencing suicidal/homicidal ideation, domestic violence, or non-suicidal self-injury, ultimately facilitating timely intervention and potentially reducing negative health outcomes.
Literature Review
Prior research has explored the application of machine learning to crisis message triage. Xu et al. developed a model achieving high positive predictive value (0.984) and sensitivity (0.942) for non-crisis cases, with precision and recall of 0.649 and 0.870, respectively, for crisis cases. Crisis Text Line’s research yielded a system with two binary classification models to quickly identify messages indicating suicidal risk and self-harm, achieving a recall of 0.89. These studies demonstrate the feasibility of using ML for crisis message triage, yet further development and validation are crucial, particularly within real-world clinical settings and at scale. This research builds upon this foundation by evaluating the performance of CMD-1 within a large, national telehealth platform, offering insights into its practical implementation and impact on response times.
Methodology
This study utilized a three-dataset approach: a training set, a retrospective test set, and a prospective test set. The training and retrospective sets contained messages sent between April 1, 2021, and April 1, 2022, with a 60:40 split for training and testing respectively. The prospective test set comprised messages from October 1, 2022, to October 31, 2022. Due to the low prevalence of crisis messages (under 1%), a crisis terms filter was implemented to increase the efficiency of data labeling during the training phase. This filter, initially containing 275 common crisis-related words and phrases, was iteratively updated as more data was labeled. Messages not containing any terms from the filter were excluded. A total of 721 messages (32% true crises) were used for the training set, and 481 messages (32% true crises) were used for retrospective testing. The resulting model was a two-stage NLP system. The first stage employed a keyword filter, while the second utilized L1-regularized logistic regression (Lasso) for prediction. Features were derived from the text of messages using TF-IDF scores for unigrams to 5-grams. Hyperparameter tuning was performed using 10-fold cross-validation on 60% of the training data, minimizing misclassification cost with a weighting prioritizing the avoidance of false negatives (1:20 ratio of false positives to false negatives). The final probability threshold for classifying messages as crises was set at 0.01, aligning with stakeholder input reflecting the cost implications of missed crises. The prospective test set included 102,471 messages (0.55% true crises). Model performance was evaluated using AUC, sensitivity, specificity, PPV, NPV, and calibration metrics. Response times from message sent to triage were analyzed in the prospective test set, comparing times before and after CMD-1 deployment. Finally, a subgroup analysis was conducted for patients aged 18-21 years to assess model performance in a high-risk and potentially linguistically distinct population.
Key Findings
The retrospective test set showed the model had an AUC of 0.82 (95% CI: 0.78-0.86), sensitivity of 0.99 (95% CI: 0.96-1.00), and PPV of 0.35 (95% CI: 0.309-0.4). The prospective test set yielded a significantly improved AUC of 0.98 (95% CI: 0.966-0.984), sensitivity of 0.98 (95% CI: 0.96-0.99), and PPV of 0.66 (95% CI: 0.626-0.692). Daily median response times in the prospective test set ranged from 8 to 13 minutes after CMD-1 deployment, a dramatic reduction from the pre-deployment average of over 9 hours. The crisis term filter alone in the prospective test set demonstrated a high sensitivity (0.993) but a low PPV (0.057), highlighting the added value of the machine learning model. Analysis of false negatives revealed that several messages were missed due to the absence of keywords in the filter or the presence of less common phrases indicating crisis. Subgroup analysis for patients aged 18-21 years in the prospective test set also revealed strong model performance (AUC=0.98, Sensitivity=0.98, PPV=0.70).
Discussion
The study's results demonstrate that CMD-1 effectively improves the speed and accuracy of crisis message triage. The substantial reduction in response time from over 9 hours to a median of 8-13 minutes is clinically significant, potentially leading to faster intervention for individuals in acute distress. The high sensitivity across both retrospective and prospective evaluations confirms the model's ability to identify crisis messages reliably. However, the relatively lower PPV, particularly in the retrospective set, highlights the need for ongoing refinement to reduce false positives, thus improving the efficiency of crisis specialist workflows. The improved performance of the model in the prospective test set compared to the retrospective set is discussed. This highlights the potential impact of the model on patient care, accelerating access to crucial support. The integration of CMD-1 within the existing clinical workflow proves practical and valuable, demonstrating the feasibility of leveraging ML systems to enhance mental health service delivery. This study provides strong evidence supporting the application of NLP and ML in improving response to mental health crises.
Conclusion
CMD-1 showcases the significant potential of NLP-based machine learning models in revolutionizing mental health crisis response. The drastic reduction in response times and high sensitivity underscore its clinical utility. Future research should focus on enhancing the model's PPV to minimize false positives and improve efficiency, potentially exploring alternative NLP techniques or incorporating additional data sources. Investigating the long-term impact of CMD-1 on patient outcomes, such as hospitalization rates or suicide attempts, would further solidify its value.
Limitations
The study's limitations include the size of the training set (721 messages) and the artificial enrichment of crisis messages in the training and retrospective sets to overcome class imbalance. This could have introduced bias and potentially affected the model's generalizability. The crisis terms filter, while broad, may have excluded some crisis messages, and the analysis of false negatives could be broadened. Differences in patient demographics between the retrospective and prospective datasets might also have influenced the model's performance, warranting further investigation. Finally, the relatively moderate PPV suggests room for improvement in reducing false positives.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny