logo
ResearchBunny Logo
Introduction
The increasing suicide rate in the United States, rising by 36% between 2000 and 2021, underscores the critical need to understand suicide circumstances for effective interventions and policymaking. The National Violent Death Reporting System (NVDRS) is a crucial data source, meticulously documenting suicide incidents, including demographics and social determinants of health (SDoH). The NVDRS includes death investigation notes describing circumstances contributing to suicide and coded suicide circumstance variables manually annotated by human abstractors. While a standardized coding manual and training are provided, only 5% of annotations are verified by two independent annotators, leaving a substantial risk of annotation inconsistencies at state and intra-state levels. Previous research by the authors highlighted performance disparities across states in NLP methods for extracting suicide circumstances from NVDRS narratives, suggesting data annotation inconsistencies. While various methods exist for addressing data annotation errors in NLP (probabilistic approaches, machine learning models, generative models via active learning), these methods have limitations when applied to NVDRS data due to the low rate of peer-reviewed annotations. This study addresses these limitations by proposing an empirical NLP approach to detect and rectify potential annotation inconsistencies in the NVDRS death investigation notes, focusing on three prevalent suicide crises: Family Relationship Crisis, Mental Health Crisis, and Physical Health Crisis.
Literature Review
Several studies have explored methods to address data annotation errors in NLP. Conventional probabilistic approaches are limited in handling infrequent events or comparing events with similar probabilities. Supervised training paradigms require high-quality annotated data, a limitation with the NVDRS dataset's low verification rate. Previous attempts focused on general NLP tasks like POS tagging and NER, unsuitable for identifying mis-labelings in free-text death investigation notes. This study builds upon prior work by the authors using NLP to extract suicide circumstances from NVDRS narratives, addressing the limitations of previous methods by focusing on a novel approach to detect and correct label inconsistencies within the dataset itself.
Methodology
This study uses data from the NVDRS, encompassing 267,804 suicide death incidents between 2003 and 2020. Access to the NVDRS data required meeting eligibility requirements and ensuring confidentiality and data security, with approval obtained through a restricted access database (RAD) proposal and Weill Cornell Medicine's Institutional Review Board. The study uses three tasks: validating inter-state annotation inconsistencies, identifying specific data instances causing these inconsistencies, and verifying improved annotation consistency after removing problematic data instances. Three crises—Family Relationship, Mental Health, and Physical Health—were selected due to their prevalence and poor classification scores in previous work. Data preprocessing addressed class imbalance by excluding states with fewer than 10 positive instances and creating balanced class distributions through downsampling negative instances. Annotation inconsistency validation was based on the assumption that consistent label annotations in two datasets yield equivalent predictive capabilities in models trained separately on these datasets. The study compared the F-1 score of models trained using only data from other states (PureOthers) with those trained using mixed data (Others+Target, Target+Others) to quantify inconsistency. A k-fold cross-validation approach identified problematic data instances by repeatedly training classifiers and counting prediction errors. The threshold for identifying potential mistakes was set at 5. To verify annotation consistency, instances flagged as potential mistakes were removed from the training dataset, and the model was retrained. A random removal baseline was also used for comparison. Manual correction of potential mistakes was performed by two annotators with an Inter-Annotator Agreement (IAA) of 0.893 (Kappa value). An incremental training paradigm was employed with four training sets to analyze the impact of corrected data on model performance, using a step size of T for a finer-grained view. Logistic regression models were used for risk of bias analysis by examining the relationship between suicide circumstances and demographic variables (race, age, sex), comparing Odds Ratios (ORs) in original annotations, those after removal of mistakes, and those after random dropping. The BioBERT model was used for text classification, with data instances longer than 512 tokens truncated.
Key Findings
Analysis of annotation inconsistencies showed that incorporating target state data into training improved performance on the target state's test set in approximately 83.7% of states for Physical Health Crisis, 32.5% for Family Relationship Crisis, and 33.3% for Mental Health Crisis. Conversely, performance decreased on other states' test sets in 69.8%, 40%, and 43.6% of states, respectively. Problematic instance discovery revealed a significant number of potential mistakes in Ohio and Colorado for all three crises (Table 2). Removing these instances improved average micro F-1 scores across all three crises and in both states, with greater improvements than the random baseline (Figure 5, Table 3). Manual correction confirmed a large number of actual mis-labelings. Incremental training showed significant improvement in model performance on other states' test sets when corrected data was used (Figure 6). Risk of bias analysis revealed differences in ORs for youth vs adults, Blacks vs whites, and females vs males after correcting for inconsistencies, indicating the importance of data accuracy for data-driven suicide analysis (Table 4).
Discussion
The findings demonstrate the effectiveness of the proposed NLP approach in identifying and rectifying annotation inconsistencies in unstructured death investigation notes. The consistent improvements observed across states and crises highlight the impact of addressing these inconsistencies on model performance and generalizability. The significant differences in ORs before and after correction showcase the importance of data quality in ensuring accurate and unbiased analysis of suicide circumstances and their association with demographic variables. The results support the notion that rectifying label inconsistencies can improve the accuracy of suicide circumstance detection models and subsequently improve the reliability of data-driven suicide prevention strategies. The study demonstrates a novel methodology for improving the quality of a large, complex healthcare dataset, which has broad applicability to other similar datasets.
Conclusion
This study presents a novel NLP framework for detecting and rectifying annotation inconsistencies in NVDRS death investigation notes. The results demonstrate the effectiveness of the proposed method in improving the accuracy and consistency of suicide circumstance attributions. Future research should explore the use of LLMs, address potential biases in reporting and data curation, and develop automated methods for label correction. Expanding this framework to other state-based reporting systems could improve data quality and enhance suicide prevention efforts.
Limitations
The study's limitations include the computational cost of the cross-validation-like approach for large datasets, the use of only BERT-based models, the potential for information loss due to token truncation in BERT, and the use of subjective manual corrections. Further refinement of parameter tuning and exploration of automatic correction methods are warranted. The potential impact of biases in incident reporting and data curation on the results is also a topic for future investigation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny