logo
ResearchBunny Logo
A natural language processing approach to detect inconsistencies in death investigation notes attributing suicide circumstances

Computer Science

A natural language processing approach to detect inconsistencies in death investigation notes attributing suicide circumstances

S. Wang, Y. Zhou, et al.

Discover the critical need for data accuracy in scientific research and policy development. This innovative study by Song Wang, Yiliang Zhou, Ziqiang Han, Cui Tao, Yunyu Xiao, Ying Ding, Joydeep Ghosh, and Yifan Peng employs a cutting-edge Natural Language Processing approach to identify annotation inconsistencies in the National Violent Death Reporting System, revealing significant insights into suicide-circumstance data. Don't miss it!

00:00
00:00
~3 min • Beginner • English
Introduction
Suicide deaths have increased in the United States, with a 36% rise in the suicide rate from 2000 to 2021, underscoring the importance of understanding suicide circumstances for effective prevention. The National Violent Death Reporting System (NVDRS) aggregates detailed data from all U.S. states, D.C., and Puerto Rico, including death investigation notes and manually coded suicide circumstance variables (e.g., Family Relationship Crisis, Mental Health Crisis, Physical Health Crisis). Only about 5% of incidents are double-annotated, raising concerns about annotation inconsistencies due to limited peer verification, potential gaps in expertise, and human error. Prior work using NLP to extract suicide circumstances from NVDRS narratives found performance disparities across states, suggesting inconsistent annotations. This study aims to detect and quantify annotation inconsistencies, identify specific mislabeled instances, and assess how correcting them affects model performance and potential bias analyses, using transformer-based models and an evaluation framework based on changes in F-1 scores and odds ratios across demographic subgroups.
Literature Review
The paper situates its contribution within prior efforts to detect annotation errors in NLP using probabilistic methods, SVMs, active learning, and pre-trained language models. Traditional probabilistic approaches struggle with infrequent events or closely probable events, and supervised paradigms require high-quality labels—both challenging in NVDRS where only 5% are double-annotated. Prior inconsistency detection work has focused on general-domain tasks like POS tagging and NER, making them ill-suited for free-text death investigation notes. The study builds on previous work that applied BERT-based models to NVDRS crises and follows ideas from label consistency validation in NER and discrepancy-based error detection, adapting them to large-scale healthcare mortality narratives.
Methodology
Data source: NVDRS records of 267,804 suicide incidents (2003–2020) across all states, D.C., and Puerto Rico, each with two death investigation notes (Coroner/Medical Examiner and Law Enforcement). Suicide crises considered: Family Relationship, Mental Health, Physical Health. Annotation policy: If either note evidences a crisis, the abstractor codes it as present. Preprocessing: For each crisis, exclude states with fewer than 10 positive instances; for each state, retain all positives and down-sample negatives to balance classes. Validate annotation inconsistency (Step 1): For a target state s with dataset Ds (size x), sample m=4 disjoint subsets from other states Dother (each size x, with Ds ∩ Dother = ∅). Split Ds and Dother into train/validation/test (8:1:1). Create training sets of equal size: PureOthers (only other states), Others+Target (others plus target), and Target+Others (target plus others). Train transformer-based binary classifiers (BERT family; BioBERT backbone) for each crisis. Quantify inconsistency via ΔF-1, defined as the difference between the mean F-1 from mixed training (Others+Target and Target+Others) and the F-1 from PureOthers on the respective test sets. Larger positive ΔF-1 on the target state test set and smaller negative ΔF-1 on other states indicate stronger inconsistency. Identify problematic instances (Step 2): Concatenate Ds and Dother, shuffle, and perform k=5-fold cross-validation; repeat n=5 times with different random seeds. For each instance, count how often model predictions disagree with ground truth across all repetitions/folds; denote this count ci (0≤ci≤n). Flag instances in Ds as potential mistakes (PMs) by thresholding the error count (threshold=5) to minimize false positives. Verify annotation consistency (Step 3): Remove identified PMs from training data and retrain; compare performance against a random baseline that removes the same number of instances at random. Additionally, recruit two trained annotators to manually review PMs and correct actual mislabelings; disagreements resolved by discussion. Use an incremental training paradigm with step size T to feed four data combinations: Others+Target, Others+CorrectedTarget, Target+Others, CorrectedTarget+Others, analyzing F-1 on target and other states’ test sets as more data are added. Risk of bias analysis: Fit logistic regression models per crisis to compute odds ratios (ORs) and 95% confidence intervals for comparisons across demographic subgroups (youth vs adults, Black vs white, female vs male), using original annotations, annotations after PM removal, and annotations after random dropping, demonstrated for Ohio and Colorado. Implementation and training: BioBERT (max 512 tokens; 768-dim representations); truncate sequences exceeding 512 tokens (5.1% affected). Binary cross-entropy loss; Adam optimizer; 30 epochs; model selection by validation performance; PyTorch implementation; experiments repeated n=5 seeds. Hardware: Intel Xeon 6226R CPU, Nvidia RTX A6000 GPUs.
Key Findings
- Incorporating the target state’s data into training increased the F-1 on the target state’s test set by 5.4% on average and decreased F-1 on other states’ test sets by 1.1%, indicating cross-state annotation inconsistencies. - State-level ΔF-1 patterns: Physical Health Crisis—83.7% (36/43) of states showed positive ΔF-1 on the target state test set; 69.8% (30/43) showed negative ΔF-1 on other states’ test sets. Family Relationship Crisis—32.5% (13/40) improved on target state; 40% (16/40) decreased on other states. Mental Health Crisis—33.3% (13/39) improved on target state; 43.6% (17/39) decreased on other states. - Problematic instances identified: Ohio—Family Relationship 159/1077 (14.8%), Physical Health 324/2328 (13.9%), Mental Health 143/9654 (1.5%). Colorado—Family Relationship 254/3315 (7.7%), Physical Health 294/6019 (4.9%), Mental Health 168/8534 (2.0%). - Removing PMs improved average micro F-1 on other states’ test sets beyond random baselines: Ohio—Family Relationship 0.695→0.713 (random 0.701), Physical Health 0.645→0.664 (random 0.654), Mental Health 0.571→0.600 (random 0.585). Colorado—Family Relationship 0.705→0.726 (random 0.714), Physical Health 0.684→0.694 (random 0.690), Mental Health 0.574→0.607 (random 0.587). - Manual correction (Ohio, Family Relationship): Inter-annotator agreement κ=0.893. Of 159 PMs, 89 confirmed mislabels (87 false negatives, 2 false positives). Incremental training with corrected labels improved final average micro F-1: other states’ test set 0.691→0.733; Ohio test set 0.679→0.714. - Correcting errors yielded an average F-1 improvement of 3.85% across settings. - Risk of bias analysis showed changes in ORs after removing/correcting mistakes, e.g., Colorado Mental Health: Youth OR from 0.89 [0.59–1.33] (original) to 0.65 [0.31–1.36] (after PM removal); Black OR from 0.68 [0.49–0.93] to 0.51 [0.07–3.70]; Female OR from 0.20 [0.03–1.42] to 0.48 [0.27–0.84].
Discussion
Findings indicate substantial annotation inconsistencies in NVDRS death investigation notes across states. Including target state data improves performance on that state while often hurting generalization to other states, consistent with labeling differences. The cross-validation-based identification and removal of likely mislabeled instances consistently increased F-1 scores on other states’ test sets and, after manual correction, also improved performance on both target and other states. These improvements suggest that aligning annotations reduces noise and enhances model generalizability. The odds ratio analyses demonstrated that correcting annotation errors can meaningfully alter estimated associations between suicide circumstances and demographic subgroups, highlighting the importance of label accuracy for epidemiologic inference and policy. The proposed framework offers a practical pathway to detect, verify, and rectify label inconsistencies in large-scale mortality datasets and could be extended to other state-based reporting systems.
Conclusion
The study introduces an empirical NLP framework leveraging transformer-based models to detect cross-state annotation inconsistencies in NVDRS death investigation notes, identify likely mislabeled instances, and improve data quality through removal and manual correction. Results demonstrate improved classification performance and altered subgroup association estimates after addressing inconsistencies, underscoring the necessity of accurate labels for reliable research and policy. Future work includes exploring large language models, optimizing parameters (e.g., folds, thresholds) via grid search, developing scalable automatic correction methods, mitigating subgroup biases, and applying the framework to other reporting systems such as FARS.
Limitations
- Computational cost: The cross-validation-like discovery of problematic instances becomes demanding as dataset size grows. - Model scope: Demonstrations are limited to BERT-based models, though the framework is model-agnostic; newer LLMs were not evaluated. - Input composition: CME and LE reports were concatenated despite potential conflicts between sources. - Sequence length: 5.1% of inputs exceeded BioBERT’s 512-token limit and were truncated, potentially omitting relevant information. - Parameter choices: Number of folds and error threshold were fixed (k=5, threshold=5); further tuning could yield better results. - Potential subgroup biases remain and warrant deeper investigation and mitigation strategies. - Manual correction, while effective, is not scalable; automated approaches are needed.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny