
Medicine and Health
A machine learning approach predicts future risk to suicidal ideation from social media data
A. Roy, K. Nikolitch, et al.
This groundbreaking study introduces SAIPH, a state-of-the-art algorithm designed to predict suicidal ideation risk using Twitter data. Conducted by a team of experts including Arunima Roy and Zachary A. Kaminsky, the research promises insights into suicide risk behaviors, offering a potential clinical decision tool for screening and monitoring.
~3 min • Beginner • English
Introduction
The study addresses the challenge of prospectively identifying individuals at risk for suicidal ideation (SI) and, by extension, suicidal behaviors. Traditional clinical screening often misses many at-risk individuals, and the low base rate of suicidal behavior complicates prospective research. Social media, particularly Twitter, presents a public, longitudinal data source where users may disclose distress and suicide-relevant risk factors that are not revealed in clinical settings. The authors aim to develop a machine learning model (SAIPH) that predicts who is at risk for SI and when risk is elevated, using text signals that precede explicit mentions of SI. The approach is grounded in psychological theories linked to suicide risk (interpersonal psychological theory for suicide, hopelessness model) and related constructs (depression, anxiety, insomnia). The study also tests whether aggregated algorithmic risk scores can reflect population-level suicide mortality patterns.
Literature Review
Prior work shows youths often disclose suicidality and related distress online and that online expressions can correlate with psychometric assessments of suicide risk. Existing social media analyses frequently focus on detecting tweets explicitly mentioning suicide, using methods like LIWC or deep learning to identify suicidal content, which limits prediction before SI is expressed. Reported accuracies for identifying suicidal tweets or users vary (e.g., LIWC-based approaches, CNN classifiers), but few studies aim to forecast future SI or model temporal risk without prior SI mentions. One study modeled temporal suicide topic patterns but required users’ previous suicidal content, limiting applicability to first-time ideators. Reviews indicate a small fraction of suicide prevention studies use social networks and even fewer apply machine learning, highlighting a gap for theory-guided, temporally sensitive models. Research also links social media suicide-related content to regional suicide rates, suggesting feasibility for population-level surveillance.
Methodology
Data collection: Using the Twitter API via Tweepy, the authors performed weekly queries from September 2016 to June 2019 to identify users expressing SI with the query "I suicide thinking OR planning." User timelines were downloaded; identifiers and mentions were anonymized or scrubbed. Control tweets were gathered using the search term "I". Tweets from the UK were excluded per local copyright laws. A total of 7,223,922 tweets were collected (6,385,079 controls; 838,843 SI). Demographics (age group and sex) were inferred where possible using M3inference.
Event labeling: A rule-based scanner flagged potentially suicidal tweets using ordered keyword patterns. A psychiatrist reviewed flagged tweets ≥3 months post-collection to confirm SI events. For controls, a random tweet day was marked as an "event". Tweets flagged but deemed non-SI (e.g., prevention or celebrity-suicide discussions) were retained as control events. Past suicide attempt or plan (SAP) was identified using a similar flagging approach followed by psychiatrist confirmation; only users active in January 2020 were included in SAP analysis. For decedent analysis, public profiles of 9 celebrities who died by suicide were collected; date of death served as event date.
Regional sampling: For 92 US counties with 2017 suicide death rates (CDC WONDER), 50 tweets per hour were sampled using the query "I" with coordinates per county during August 2019 and again in September/October 2019. Aggregated SI scores per county were compared with county suicide death rates overall and by age groups.
Neural networks for psychological constructs: Training data were gathered using construct-specific queries (burden, loneliness, stress, anxiety, insomnia, depression; three depression models). Tweets were filtered by TextBlob subjectivity ≥0.5; construct-positive cases had polarity <0, controls polarity >0 to emphasize negative vs positive sentiment. Text was represented with a bag-of-words and used to train single-hidden-layer neural networks (20 neurons; logistic SGD; alpha=0.001; 100,000 epochs; batch size=1) using sklearn MLPClassifier (Python 3.6.8, sklearn 0.21.3). Hyperparameters were tuned via k-fold cross-validation; 20–40 neurons performed similarly and 20 was chosen. Linear SVMs with a linear kernel were also trained for comparison.
Neural network validation: Networks were validated against binary-coded items adapted from psychometric scales (e.g., perceived stress scale), computing AUCs to assess construct capture. Sentiment polarity was also evaluated as a comparator metric.
SI model training and validation: Two validation strategies were used. Primary split balanced cases/controls by month of most recent event into training (N cases=283; 512,526 tweets; N controls=2655; 3,518,494 tweets) and test (N cases=277; 326,317 tweets; N controls=2691; 2,866,585 tweets) sets. A second split trained on data with events prior to August 2018 (N cases=265; N controls=2422) and tested on August 2018–May 2019 (N cases=278; N controls=2911). Random forest classifiers were trained using bootstrap aggregating to balance classes, yielding 10 models whose outputs were averaged per tweet. Inputs were 9 neural network construct scores plus sentiment subjectivity and polarity (scenario 1) or neural network scores only (scenario 2). Hyperparameters were tuned via GridSearchCV; no max depth was optimal; 10 estimators; random state=0. Cross-validation on training yielded mean accuracy 0.83 ± 0.040.
Model assessment: For individual-level prediction, mean model score was computed from tweets at least 1 day before the event and at least 7 days after any previous SI event. Performance was summarized by AUC for all SI events, single-event ideators, and recurrent ideators; sex- and age-stratified analyses were performed using inferred demographics. Threshold analyses yielded sensitivity, specificity, and PPV across cutoffs; permutations (10,000) assessed robustness.
Temporal analysis: For individuals with ≥120 days of data pre-event, an individual-specific baseline threshold (mean model score) was computed. Frequency of scores above the threshold (frequency score) was calculated within sliding windows of 4, 7, 14, and 21 days, normalized by tweeting frequency. Logistic regression assessed whether peaks (frequency score exceeding baseline) occurred more often closer to the SI event across lags from 5 to 120 days. Similar temporal analyses were applied to suicide decedents, examining mean scores and risk peaks relative to death.
Regional analyses: Kendall’s tau correlations, Poisson or hierarchical models (reporting IRR) examined associations of county-level mean SI scores with suicide death rates overall and by age group, including interaction with population size and assessing minimal sampling duration (days) needed for stable associations.
Statistics: Conducted in R; non-normal distributions assessed using Anderson-Darling and non-parametric tests as needed; AUC significance via permutation (10,000). Significance defined as two-tailed p<0.05.
Key Findings
- Neural network construct scorers achieved AUCs often exceeding 0.70 against binary-adapted psychometric items; sentiment polarity best captured depression, anxiety, stress, and sleep loss, but neural networks outperformed SVMs (Mean NN AUC 0.68 ± 0.11 vs SVM 0.63 ± 0.12; p=0.014).
- Individual SI prediction: Using 9 NN scores plus sentiment metrics in random forests with bagging, the model predicted 830 SI events (from 277 ideators) vs 3159 control events with AUC=0.88 (95% CI 0.86–0.90). Single-event ideators AUC=0.80 (95% CI 0.75–0.84); recurrent ideators AUC=0.90 (95% CI 0.88–0.92). Sentiment polarity alone AUC=0.74 (95% CI 0.713–0.76). Using NN scores without sentiment AUC=0.85 (95% CI 0.83–0.87). Year-split validation (train pre-Aug 2018; test Aug 2018–May 2019) AUC=0.83 (95% CI 0.81–0.86).
- Threshold operating point: Score threshold 0.683 maximized sensitivity and specificity (sensitivity 80%, specificity 78.9%, posterior probability 42.4% for increased SI risk when above threshold).
- Demographics: Performance similar by sex (males AUC=0.88; females AUC=0.85). AUC negatively correlated with age (rho = −0.90, p = 1.42×10⁻⁶) indicating stronger performance in younger users.
- SAP identification among ideators: Model distinguished SI events from those with past suicide attempt/plan vs non-SAP ideators with AUC=0.75 (95% CI 0.64–0.85). Threshold 0.731 yielded sensitivity 80%, specificity 78.3%, posterior probability 75.2% for SAP among SI when above threshold. Number of SI tweets was higher in SAP (mean 6.81 ± 9.83) vs non-SAP (2.38 ± 3.12), p=2.8×10⁻¹⁰; AUC=0.69 using count. SAP prediction AUC decreased with age (rho=−0.86, p=8.25×10⁻⁸).
- Temporal risk prediction: Mean scores over sliding windows had AUCs highest near the event (e.g., 21-day window at 1 day before event AUC=0.84) and remained ≥0.75 across start positions; random start permutations average AUC 0.805 ± 0.013. Frequency of above-threshold peaks predicted near-term SI: strongest association at 6 days using 21-day window (OR=6.1 ± 1.1, p=6.9×10⁻¹¹⁶), with highest OR at 1 day (OR=6.7 ± 1.1, p=9×10⁻⁷¹).
- Suicide decedents: Across 9 celebrity cases, mean model scores increased approaching death (rho=−0.70, p=8.7×10⁻¹⁹). Peaks in frequency score were most consistently associated ~20 days before death with 21-day window (OR=7.6 ± 1.2, p=8.8×10⁻²⁰).
- Regional validation: In 92 US counties, mean SI scores over 16 days in August 2019 correlated with county suicide death rates (Kendall’s tau=0.16, p=0.021); Poisson regression IRR=1.12 (95% CI 1.02–1.23, p=0.0201). Population size by SI score interaction significant (tau=0.38, p=6.67×10⁻¹⁰). In September/October 2019, mean SI scores correlated with death rates (tau=0.31, p=1.16×10⁻⁵); hierarchical model IRR=1.33 (95% CI 1.19–1.50, p=4.48×10⁻⁷). Minimal sampling required: August data suggested 16 days (IRR=1.91, 95% CI 1.30–2.82, p=0.0013); September/October data indicated as few as 8 days (multiple subsets, all p<0.05). Associations strongest in younger age groups: significant for ages 15–24 (tau=0.15, p=0.041), 25–34 (tau=0.18, p=0.013), 35–44 (tau=0.15, p=0.038), consistent with higher Twitter usage among younger populations.
Discussion
The study demonstrates that theory-informed machine learning applied to public Twitter text can identify individuals at elevated risk for future SI without relying on explicit prior SI mentions. Combining neural network-derived psychological construct scores with random forest classification achieved strong discrimination, particularly in recurrent ideators and younger users. Temporal analyses indicate that surges in above-threshold model outputs provide actionable windows of elevated near-term risk, with odds of SI increasing approximately sevenfold within about 10 days of a peak. Validation against suicide decedents showed consistent pre-death risk peaks, and regional aggregation of SI scores tracked county-level suicide mortality, particularly in age groups with higher Twitter engagement. These findings support a two-step strategy for potential clinical decision support: use average risk scores for identifying at-risk individuals followed by personalized temporal monitoring of peak frequencies to anticipate risk windows. The results suggest applicability for both individual-level screening and population-level surveillance, with potential integration into clinical workflows and public health monitoring, while highlighting the need for careful consideration of false positives and ethical deployment.
Conclusion
The authors introduce SAIPH, a machine learning framework that predicts individual-level risk for suicidal ideation from Twitter content before explicit SI is expressed and provides temporal signals of imminent risk. The model also scales to population-level surveillance, where aggregated scores correlate with county suicide death rates, especially among younger age groups. Future work should include prospective validation against psychometrically assessed SI and behavior, application across diverse demographic groups and platforms, refinement of temporal models (e.g., anomaly detection, sequence models), incorporation of additional risk domains (e.g., substance use), and evaluation of clinical and public health implementation to optimize screening, monitoring, and intervention timing.
Limitations
- Outcome limitation: The primary outcome is expression of SI on social media, not clinically assessed SI or attempts; correspondence to real-world behavior cannot be fully quantified. Some controls may have experienced SI without expressing it online.
- Platform and data scope: The algorithm is trained and evaluated on Twitter only; generalizability to other platforms or offline contexts is untested.
- Construct coverage: Emphasis on depressive and stress-related constructs may omit important pathways (e.g., substance use) due to weaker construct classifier performance.
- Demographics: Limited demographic and socioeconomic detail (e.g., race/ethnicity/nationality) restricts assessment of equity and generalizability; performance decreased with age, reflecting Twitter usage biases.
- Controls and temporal context: Control selection used the term "I" and random event days; alternative control strategies (e.g., matching "I think" on the same day) might better control for contemporaneous events and mood fluctuations.
- Regional analyses: County-level comparisons used prior-year death rates versus current Twitter data; temporal mismatch may affect inference. Sensitivity to national events (e.g., mass shootings) influences required sampling duration.
- False positives and base rates: Even with good sensitivity/specificity, low base rates imply many false positives, particularly for suicide attempts, affecting practical deployment.
- Methodological choices: Sequence models (RNN/LSTM) were not used; temporal modeling relies on sliding windows and thresholds, which may miss complex dynamics or require sufficient per-user data.
- Data/code availability: Due to policy and sensitivity, underlying data and code are not publicly available, limiting external replication.
Related Publications
Explore these studies to deepen your understanding of the subject.