
Psychology
Predicting recurrent chat contact in a psychological intervention for the youth using natural language processing
S. Hornstein, J. Scharfenberger, et al.
This study by Silvan Hornstein, Jonas Scharfenberger, Ulrike Lueken, Richard Wundrack, and Kevin Hilbert explores how Natural Language Processing can predict recurrent chat contacts in a German youth crisis service. With an XGBoost classifier achieving an AUROC of 0.68, the research reveals intriguing insights into the demographics and conditions associated with recontact, highlighting NLP's potential for tailored care in chat-based hotlines.
~3 min • Beginner • English
Introduction
Youth mental health needs are high, yet many adolescents and young adults face barriers to accessing care. Chat-based counseling hotlines offer a low-threshold, first point of contact and can act as gateways into further treatment within a stepped-care framework. Identifying individuals likely to recontact after an initial session is operationally important, as recurrent users can strain limited hotline capacity and may indicate unmet needs better addressed by additional or alternative services. Natural Language Processing (NLP) can extract clinically relevant information from conversational text and has shown utility across mental health applications, but its use in real-world chat counseling settings is underexplored, particularly for predicting downstream outcomes such as recontact. This study evaluates whether text from first-time chat consultations can predict recurrent contact within six months, hypothesizing significant predictive performance on a temporally separated test set to inform redirection to appropriate care.
Literature Review
Prior NLP applications in mental health have leveraged clinical notes and non-clinical text for tasks such as readmission prediction, suicide risk detection, symptom extraction, patient stratification, and assessment of therapeutic processes. In chat-based counseling specifically, earlier work analyzed counseling strategies and distress classification, reporting high AUCs (~0.9) though often on small test sets, and developed methods to detect disclosure of suicidal ideation using deep learning with external knowledge. Additional work explored chatbot development for mental health support. Despite promising results, there is limited research on deploying NLP to support the role of chat counseling as a first access point into care by predicting outcomes relevant to service flow, such as recurrent contact, which is a known pain point due to capacity constraints and the higher severity often observed among frequent users. This study addresses that gap by assessing the feasibility of predicting recontact to enable personalized, stepped interventions.
Methodology
Design and preregistration: Secondary data analysis preregistered on OSF (https://osf.io/xa4pn). Due to privacy constraints, the planned use of transformer-based embeddings was replaced by word-stem features; earlier data (pre-October 2021) were excluded to align with prior ethics approval (Leipzig University, 372/21-ek). All users consented to research use of anonymized data.
Setting: Routine-care data from krisenchat, a free 24/7 messenger-based counseling service in Germany for individuals up to age 24, staffed primarily by trained volunteers with supervision and crisis escalation procedures.
Sample: All first-time contacts between Oct 2021 and Dec 2022 were considered. Consultations were defined as sessions with at least 3 counselor messages and at least 10 total exchanged messages. After excluding 8,419 first contacts without a qualifying consultation, the final sample comprised 18,871 unique chatters. Mean messages per first consultation: chatter 25.4 (SD 20.8), counselor 17.7 (SD 12.1); mean word count 456.4 (SD 283.9).
Outcome: Recurrent contact within 188 days after first consultation. A consultation was considered completed after 6 hours without messages; messages within 6 hours did not count as recontact. Of 18,871 chatters, 8,141 (43.1%) recontacted within 188 days; more than half recontacted within one week. Sensitivity analysis varying the window 150–210 days yielded minor AUC variation (SD < 0.01).
Data anonymization and preprocessing: Extensive privacy-preserving processing included counselor marking of PII for deletion; replacement of all names with [NAME] and city names with [CITY]; stemming; random shuffling of word order separately for chatter and counselor text; deletion of stems appearing in fewer than 5 chats. Resulting data were randomized word stems for chatter and counselor per consultation without identifiable sequences or rare unique tokens.
Feature engineering and models: Text features were generated using TF-IDF on word stems with separate vectors for chatter and counselor messages, concatenated for modeling. Due to randomized word order, no n-grams were used. The primary classifier was XGBoost. A baseline model used metadata: counts of word stems (by role) and time-of-day of first contact.
Training/validation: Time-based split with training on 14,929 first consultations (pre-September 2022) and a held-out test set of 3,942 consultations (September–December 2022) for one-time final evaluation. Within training, a 5× repeated 5-fold cross-validation with Bayesian hyperparameter optimization (250 iterations) maximized ROC AUC. Statistical comparison to baseline used a two-sided 5×2 paired t-test with corrected p-value. Final significance on test set assessed with permutation testing.
Best hyperparameters (text model): max_df_chatter=0.8; min_df_chatter=150; max_df_couns=0.3; min_df_couns=75; colsample_bytree=0.9; eta=0.05; gamma=1.5; max_depth=8; min_child_weight=20; subsample=0.6; use_idf=False. Best baseline hyperparameters: colsample_bytree=0.8; eta=0.01; gamma=0.25; max_depth=6; min_child_weight=10; subsample=0.8.
Explainability: SHAP (tree explainer) quantified feature contributions at the word-stem level for chatter and counselor texts. To provide context without access to sequences, co-occurrence analyses identified stems frequently appearing with top SHAP-ranked stems. For dimensionality reduction, Word2Vec embeddings (pretrained on German Wikipedia) were used to cluster stems via k-means; 1,350 stems matched the embedding vocabulary; 20 clusters were selected using silhouette score and usability considerations.
Key Findings
- Dataset: 18,871 first-time chatters; ~813,000 messages (479,782 chatter; 333,454 counselor) and ~8.6M words. Recontact within 188 days occurred in 8,141 (43.1%).
- Cross-validation (train set): Best text-based model AUROC 0.67. Chatter-only and counselor-only models each reached AUROC ~0.65. Combined text without role information AUROC 0.66. Baseline (length/time-of-day and counts) AUROC 0.57; text model significantly better than baseline (5×2 test, p < 0.01).
- Independent test set (N=3,942; newest consultations): AUROC 0.68 (p < 0.01). Using 0.5 threshold: accuracy 0.65 (baseline 0.58), balanced accuracy 0.62, precision 0.62, recall 0.44. Reported sensitivity 0.62 and specificity 0.66 at this threshold. Confusion matrix counts: TN=1,819; FP=452; FN=940; TP=731 (2,550/3,942 correct).
- Feature importance (SHAP): Indicators of younger age (e.g., numbers “12”, “13”), female gender terms, and self-harm/suicide-related stems were associated with higher recontact likelihood; male-related stems predicted no recontact. Time-related stems (daytime, night) had high importance. Work/job-related terms and counselor stems like “professional,” “internet care,” and “advice” predicted no recontact, possibly reflecting redirection during the first session.
- Co-occurrence insights: Counselor “suicide” often co-occurred with chatter “thoughts,” suggesting counselor reflections of expressed ideation and reframing/clarification. Counselor “harm” co-occurred with chatter “harm” in ~2/3 of ~5,000 conversations where counselor used the stem. Work/job stems co-occurred with everyday stress-related stems (e.g., “situation,” “flat,” “time,” “stress”).
- Clustering: 1,350 stems clustered into 20 groups. Top clusters included: (1) emotionally valenced adjectives (e.g., alone, weak, wrong, best); (2) numbers (supporting age relevance); (3) life-context nouns (work, university, city, house). Predictive signal was distributed broadly—top 50 stems contributed under half of explained importance; even least-important clusters contributed nontrivially.
Discussion
The study demonstrates that text-only features from first consultations can predict recurrent contact with meaningful accuracy in a real-world, privacy-constrained setting, offering a basis for stepped-care decisions such as proactive redirection to routine mental healthcare. Performance compares favorably to NLP-based outcome prediction benchmarks in digital mental health and aligns with meta-analytic estimates for treatment outcome prediction, despite using only text stems and strict anonymization. Importantly, unlike prior chat-based studies focusing on contemporaneous state (e.g., distress classification), this work targets a prospective outcome, which is generally more challenging and clinically relevant for precision care.
Clinical utility depends on more than predictive metrics; rigorous trials should test whether integrating model outputs into workflows improves client outcomes and system efficiency. Human-vs-model comparisons (e.g., counselor predictions versus algorithm) could clarify complementary strengths and inform hybrid decision support. The approach underscores benefits of simple, privacy-preserving models (lower computation, reduced PII handling, feasibility of extensive CV) while acknowledging potential gains from advanced language models in less restricted contexts. Explainability highlighted plausible predictors (age cues, self-harm/suicide, gender indications, time-of-day), and counselor-language signals suggesting in-session redirection effects, supporting interpretability and potential intervention design (e.g., targeted follow-up pathways).
Conclusion
NLP applied to anonymized first-session chat text can predict recurrent contact among youth seeking help, supporting the feasibility of AI-driven triage within chat counseling services. The study contributes a privacy-conscious, role-aware TF-IDF + XGBoost pipeline validated on a temporally held-out cohort and interpretable via SHAP and clustering analyses. Future work should: (1) combine text with non-text features (e.g., demographics, usage patterns) to enhance performance; (2) evaluate transformer-based approaches where legally/ethically permissible; (3) benchmark against counselor judgments; (4) conduct prospective trials to test clinical impact; and (5) design and assess tailored interventions (e.g., automated referrals, differentiated follow-ups) triggered by model risk estimates.
Limitations
- Counselor influence on recontact: Counselors were not instructed to limit recontacts and sometimes encouraged follow-up (e.g., asking to message again the next morning), potentially conflating need with counselor behavior; night-time cases likely rare but present.
- Recontact is an imperfect proxy for need: Factors such as gender-specific satisfaction or the inability to reconnect with the same counselor (explicitly communicated) may affect recontact independent of need; male-associated stems predicting no recontact should not be inferred as lower need.
- Non-returners with unmet need: Some dissatisfied chatters with unmet need may not recontact; prior evaluations suggest high satisfaction (≈ two-thirds high satisfaction; ≈90% would recommend), implying limited impact but not negligible.
- Severity linkage: Recontact may be weakly tied to absolute symptom severity; severely affected chatters who accessed external care may not recontact, limiting severity inference. Additional severity outcomes could broaden utility.
- Anonymization constraints: Randomized word order and aggressive preprocessing precluded context inspection, n-grams, and sequence models, reducing interpretability for some high-importance stems.
- Bias in pretrained resources: Word2Vec used for clustering can embed societal biases, potentially affecting interpretability.
- Validation design: Time-based split improves ecological validity but may be affected by seasonality and temporal drifts, influencing performance estimates.
Related Publications
Explore these studies to deepen your understanding of the subject.