logo
ResearchBunny Logo
Introduction
Suicide rates in the United States are significant and have shown a marked increase over the past two decades. Geographic variations in these rates are substantial, with higher rates observed in Western states and in rural areas. Current suicide surveillance systems, however, suffer from reporting lags of several months or more, hindering timely, effective public health interventions. The need for timely, localized data is critical for targeted suicide prevention efforts. While researchers have explored near real-time data sources like social media and online searches for individual-level suicide risk prediction, their application to geographic-level suicide surveillance is less developed. Some existing work has used social media data for national-level suicide prediction, but these studies often lack the state-level granularity required for effective local interventions. This study addresses this gap by investigating the use of multiple near real-time data sources within a machine learning framework to generate state-specific weekly estimates of suicide fatalities. The researchers acknowledge potential biases in the real-time data sources, including convenience sampling and varying access to and utilization of online platforms and health services across geographic regions. The goal is to create a more comprehensive and timely method for estimating suicide fatalities that mitigates some of these inherent biases.
Literature Review
The introduction extensively reviews existing literature on suicide rates in the US, highlighting the geographic disparities and the limitations of current surveillance systems. It discusses the potential of using complementary near real-time data sources, such as social media, online search trends, and health administrative data, to improve the timeliness of suicide trend estimations. The review points to prior research using Twitter data to indicate geography-specific suicide rates and other studies using various online data for individual-level suicide risk prediction. However, it emphasizes the lack of research on leveraging such data sources to enhance geographically focused suicide surveillance at the state level. The review mentions a study by Choi et al. (2020) which used real-time ensemble data to estimate US national suicide deaths, but notes that this work did not address state-level prediction. Finally, the review acknowledges the potential biases associated with using real-time online data sources due to convenience sampling and differing patterns of online data usage across geographic locations.
Methodology
The study employed six data sources: Google search trends (suicide-related terms), Twitter posts (suicide-related keywords), YouTube search trends (suicide-related terms), emergency department (ED) visits for suicide ideation or attempts (from the National Syndromic Surveillance Program), weekly averages of self-reported PHQ-9 assessments (from Mental Health America), and historical state-level weekly suicide fatality data (from CDC's National Vital Statistics System). The study focused on four states: Colorado, Louisiana, New York, and Utah. Data preprocessing involved geolocation of Twitter posts using APIs and aggregation of data at the state level. The primary modeling approach involved long short-term memory (LSTM) recurrent neural networks, which are well-suited for handling time series data and addressing the vanishing gradient problem in traditional RNNs. Separate LSTM models were built using individual data sources and combinations thereof. The models used a two-week sliding window to predict weekly suicide fatalities. Hyperparameter tuning (number of hidden layers, hidden dimensions, and epochs) was performed using a grid search to minimize root mean squared error (RMSE). Baseline models were constructed using only lagged historical suicide fatality data. Additional comparative analyses included a two-phase ensemble approach similar to Choi et al. (2020) and a single-phase approach using principal component analysis (PCA) to fuse features from all data sources before feeding them into various machine learning models. Model performance was evaluated using RMSE, Pearson correlation coefficient, and mean absolute difference (MAD).
Key Findings
The LSTM model using all data sources demonstrated accurate estimates of state-specific suicide rates in all four states, with percentage errors of less than 5% (-2.768% for Utah, -2.823% for Louisiana, -3.449% for New York, and -5.323% for Colorado). The LSTM model consistently outperformed the baseline autoregressive model using only historical suicide fatality data. Analysis of individual data sources revealed that online data sources (Google Trends, YouTube Trends, and Twitter) generally had lower error rates and higher week-to-week Pearson correlations than health services data sources in New York. Sensitivity analysis using alternative methods for combining data sources showed generally higher error rates than the LSTM models. The RMSE ranged from 3.765 in Utah to 7.414 in New York for the ‘All Sources’ model. The Pearson correlation was highest for NY (0.475) and lowest for LA (0.061) for the ‘All Sources’ model. For MAD, the best performance of the All Sources model was for UT (3). The study found that models using all data sources generally performed better than those using online data, health services data, or historical fatality data alone. However, the performance varied across states, with the model consistently underestimating suicide deaths, potentially due to the increasing national suicide rates during the study period.
Discussion
The study's findings demonstrate the potential of combining multiple real-time data sources with deep learning to produce accurate and timely estimates of state-level suicide rates. This approach offers significant advantages over traditional methods reliant on delayed mortality data, enabling more effective and targeted suicide prevention efforts. The superior performance of the LSTM model compared to the baseline model highlights the value of incorporating real-time data. However, the study also reveals state-wise variations in model performance, suggesting the need for further investigation into factors that influence the predictive power of different data types in diverse geographic settings. The consistent underestimation of suicide deaths may be due to several factors, including the increasing trend in national suicide rates, limitations in data sources (e.g., reliance on public social media accounts), and potential misclassification of deaths. Further exploration is needed to explain the observed variability in model performance and to refine the model for improved accuracy.
Conclusion
This study successfully demonstrates a deep learning approach for predicting weekly state-level suicide fatalities using multiple real-time data sources. The LSTM model's accuracy and its outperformance of baseline models highlight the potential for timely, localized suicide surveillance and improved prevention efforts. Future research should investigate factors explaining the variability in model performance across states, explore additional data sources (including environmental factors), and refine the model to address consistent underestimation of suicide rates. Further research into the potential for using these models in real-world public health settings is also crucial.
Limitations
The study's limitations include its focus on a limited number of states, which may affect the generalizability of findings. The use of public social media data introduced potential biases related to self-reported location information and the representativeness of public Twitter accounts. The consistent underestimation of suicide deaths warrants further investigation. Additional data sources could potentially improve the model's accuracy. The gold-standard data (mortality figures) might also be affected by challenges in postmortem reviews and misclassification of deaths, potentially influencing model performance.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny