
Computer Science
Generation and evaluation of artificial mental health records for Natural Language Processing
J. Ive, N. Viani, et al.
Limited access to mental health records has stifled NLP innovations in clinical settings. Researchers, including Julia Ive and Natalia Viani from Imperial College London, unveil a promising method for generating artificial clinical documents, showing that such data can match original records for training NLP models while safeguarding sensitive information.
~3 min • Beginner • English
Introduction
The study addresses the challenge of limited access to clinical free-text data, especially in mental health records, due to privacy and governance constraints. The authors hypothesize that (1) statistically and clinically valid artificial electronic health record (EHR) text can be generated using key phrase–guided neural text generation; (2) such artificial data can be useful for downstream NLP tasks; and (3) the generation process can be controlled by varying the amount of input key phrases to mitigate leakage of rare, potentially sensitive information. The work is motivated by gaps in available de-identified mental health datasets and the need for shareable proxies that enable development and benchmarking of NLP models while minimizing disclosure risks.
Literature Review
Prior work has explored medical text generation and EHR note prediction but rarely as a replacement for genuine training data in NLP pipelines. Jing et al. generated brief medical imaging reports using hierarchical RNNs; Liu forecasted EHR note content conditioned on prior patient data; and Lee generated short chief complaint texts (<20 tokens) with an encoder–decoder, conditioning on diagnosis and admission information. However, utility of synthetic text for downstream NLP tasks and risks of retaining rare information have been understudied. No prior work has focused on generating full EHR notes for NLP purposes in mental health, a domain with complex narratives and limited structured coding.
Methodology
Datasets: Two corpora were used. (1) CRIS MHR dataset: de-identified discharge summaries from South London and Maudsley (SLaM) NHS Trust via the CRIS platform, filtered to common mental health ICD-10 diagnoses. Splits: train-gen-mhr (24,273 patients; 537K lines; 12M tokens), val-gen-mhr (1,348 patients; 30K lines; 653K tokens), test-gen-mhr (1,349 patients; 29K lines; 659K tokens). For diagnosis classification (ICD-10), 10% and 20% of test-gen-mhr were used for development and test: train-class-mhr, dev-class-mhr, test-class-mhr. (2) MIMIC-III dataset: discharge summaries linked to a phenotyping task with 13 phenotypes (~1,000 patients), with additional generation data extracted by matching first three diagnoses characters (ICD-9) excluding phenotyping cohort patients. Splits: train-gen-mimic (9,767 patients; 10,926 admissions; 1.2M lines; 20M tokens), val-gen-mimic (126 patients; 132 admissions; 13K lines; 224K tokens), test-gen-mimic (phenotyping set: 1,045 patients; 1,560 admissions; 183K lines; 3M tokens). For phenotyping classification, 10% and 20% of test-gen-mimic used as development and test: train-class-mimic, dev-class-mimic, test-class-mimic.
Generation approach: A Transformer encoder–decoder (OpenNMT-py v0.2.1) is trained for gap-filling: input consists of concatenated clinical metadata and key phrases; output is the full original paragraph sentence by sentence. Clinical metadata include: diagnosis code/description (ICD-10/ICD-9), gender, age, death indicator, timestamp relative to admission, record section type, and sentence order in section. Key phrases are extracted from each paragraph using RAKE, aligned at sentence level. Three input configurations: (a) all: all extracted key phrases; (b) top+meta: top-scored key phrases (~3 per sentence) plus clinical metadata; (c) one+meta: one best key phrase per sentence plus clinical metadata. Baseline key simply echoes the input key phrases without generation. Inputs are lowercased; dates replaced with placeholders; sentences <5 words removed; rare words (freq=1) mapped to UNK; vocabulary ~50K. Models trained 30K steps; beam search size 5.
Intrinsic evaluation: Compared generated vs original text with perplexity (PPL), ROUGE-L (recall), BLEU (precision), and TER (edit rate); also average sentence length. Analysed cumulative TER distributions across setups. Focused further analyses on top+meta and one+meta (which use fewer original words).
Memorization assessment: Following Carlini-style analysis, from train-gen-mhr top+meta model, selected 1K sentences each containing rare (lowest quartile) and frequent (highest quartile) n-grams for n=2,3,5. Measured proportion of target n-grams present in inputs (%, in) and outputs (%, out), and model confidence via PPL when regenerating those sentences.
Human evaluation: Three trained medical student annotators independently labeled ~1,000 sentences per setup (top+meta and one+meta) from test-gen-mhr with seven fine-grained categories (1–7) grouped into SAME, GOOD, BAD/IRRELEVANT, NO SENSE. Inter-annotator agreement measured by accuracy and Cohen’s kappa on grouped categories; double annotation across 120 documents.
Extrinsic evaluation: Downstream text classification with three models: (a) LDA (150 topics) + Random Forest (RF), (b) bag-of-words n-grams up to 5 + RF, (c) CNN (Kim-style) with Word2Vec embeddings trained on each corpus. For CRIS (ICD-10) used five-fold CV and averaged F1 per class (F20, F32, F60, F31, F25, F10). Compared models trained on genuine vs artificial datasets (all, top+meta, one+meta, key), and performed two-sample Kolmogorov–Smirnov (2S-KS) tests for (i) between-model comparisons within data type and (ii) within-model comparisons across data types. Ablation: removed common key phrases from both genuine and artificial training data in top+meta to assess reliance on overlap.
Implementation details: Preprocessing with spaCy 2.0.18; LDA via Gensim 3.6.0; RF via scikit-learn 0.20.2 (balanced subsample, max_depth=2); CNN with Keras 2.2.4 (embedding dim 500; conv filters sizes 3 and 8 with 10 units each; max-pooling=2; dropout 0.5/0.8; dense layer 50 units; L2 regularization); fixed record length to upper quartile; five random restarts per experiment. Training on GPU (Azure Tesla K80 for CRIS; GTX 1070 for MIMIC). Ethics approvals and governance described for CRIS and MIMIC.
Key Findings
Intrinsic evaluation: Increasing input information makes generated text closer to originals. The all setup produces text most similar to source (high ROUGE-L ~0.8 across datasets) and low TER; key (input echo) yields shortest sentences and poor content coverage. The top+meta model provides the best balance of metrics with longer sentences and more even TER distributions, indicating semantic preservation with rephrasing; one+meta has high TER (≥0.5) for ~85% of sentences, reflecting greater divergence.
Memorization: In train-gen-mhr top+meta, low-frequency n-grams appear more often among input key phrases than high-frequency ones (e.g., 2-grams: %, in low=40% vs high=16%). In outputs, 2-grams appear equally (%, out low=48%, high=48%), implying only ~8% of low-frequency 2-grams and ~32% of high-frequency 2-grams are reproduced beyond what was provided in inputs. Higher PPL values for sentences with low-frequency n-grams indicate the model is less confident reproducing rare sequences, suggesting limited memorization risk.
Human evaluation: Inter-annotator agreement indicates acceptable reliability: accuracy 0.78 (κ=0.54) for top+meta and accuracy 0.87 (κ=0.49) for one+meta on grouped categories. Most frequent labels were “Modified, does not contradict the diagnosis” (49% top+meta; 66% one+meta) and “Preserved, details omitted” (24% both). Contradictions to diagnosis were rare (~1%). Disagreements mainly between GOOD and NO SENSE.
Extrinsic evaluation (CRIS diagnosis classification): CNN performed best overall; genuine average F1: CNN 0.48, LDA 0.39, BoW 0.28. Training on artificial data yielded comparable trends and preserved model ranking. CNN avg F1: all 0.45, top+meta 0.43, one+meta 0.39, key 0.34. LDA avg F1: all 0.38, top+meta 0.38, one+meta 0.37, key 0.39 (≈genuine 0.39). BoW avg F1: all 0.28, top+meta 0.29, one+meta 0.29, key 0.24. KS tests showed significant differences in some within-model comparisons (e.g., CNN genuine vs top+meta p=0.004), but artificial data maintained relative performance gaps between models (CNN>LDA>BoW). Error analysis of CNN showed similar behavior between genuine and top+meta models with overlapping error cases; top+meta had more high-confidence FPs but slightly fewer FNs.
Ablation (CRIS): Removing common key phrases from both genuine and artificial training data still yielded comparable performance between genuine and top+meta for LDA (avg F1 0.30 vs 0.26; KS p=4.4e-04) and CNN (avg F1 0.25 vs 0.18; KS p=2.7e-04), supporting that artificial text adds useful context beyond exact key phrase overlap.
Extrinsic evaluation (MIMIC phenotyping): Genuine average F1: CNN 0.46, BoW 0.34, LDA 0.23. Artificial data preserved model ranking and produced comparable performance: CNN all 0.45, top+meta 0.40 (KS p=0.13 vs genuine), one+meta 0.36; BoW all 0.32, top+meta 0.31 (p=0.28), one+meta 0.27; LDA all/top+meta/one+meta ~0.21. key performed poorly (e.g., CNN 0.24). Overall, synthetic data—especially top+meta—can effectively train downstream classifiers with performance close to genuine data.
Discussion
The findings support the hypotheses that key phrase–guided Transformer models can generate clinically coherent artificial discharge summaries that are statistically close to authentic text and useful for training downstream classifiers. The top+meta configuration, which uses relatively little original text (≈3 key phrases per sentence plus metadata), strikes a balance between preserving clinical meaning and reducing lexical overlap, thereby lowering the risk of leaking rare or identifiable information. Memorization analyses suggest limited reproduction of rare n-grams beyond those explicitly provided in inputs, and human evaluation confirms that generated sentences generally remain clinically valid with very few contradictions to diagnoses. Extrinsic evaluations on two datasets demonstrate that models trained on artificial data reproduce the relative performance ordering of standard NLP classifiers and achieve F1-scores comparable to models trained on genuine text, indicating that synthetic corpora can serve as viable proxies for method development under data governance constraints. These outcomes are significant for enabling external development and benchmarking of clinical NLP models when direct access to real EHR text is restricted, potentially accelerating innovation while mitigating privacy risks.
Conclusion
This work introduces a practical framework for generating artificial clinical discharge summaries using a key phrase–guided Transformer, evaluated intrinsically (text similarity, TER distributions, memorization risk, human clinical validity) and extrinsically (downstream diagnosis and phenotype classification). Key contributions include: (1) demonstrating that artificial text conditioned on minimal key phrases and metadata can be both clinically valid and statistically similar to source text; (2) showing limited memorization of rare n-grams; and (3) establishing that synthetic data can train classifiers with performance comparable to genuine data while preserving model rankings. Future directions include exploring generation methods that further minimize reliance on genuine inputs (e.g., adversarial or privacy-preserving learning), developing quantitative metrics and thresholds for identifiability risk in free text, extending to additional NLP tasks (information extraction, temporal modeling), and addressing longitudinal coherence across multi-document patient histories.
Limitations
- The study focuses on discharge summaries and two downstream tasks (ICD-10 diagnosis classification for mental health and phenotyping), which may not generalize to other clinical text types or NLP tasks (e.g., information extraction, temporal reasoning).
- While data were already PHI-de-identified and rare tokens filtered, there is no universally accepted metric for residual identifiability in free text; memorization analyses rely on n-gram frequency heuristics and perplexity proxies.
- Some within-model performance differences between genuine and synthetic training data were statistically significant (e.g., CNN on CRIS), indicating residual gaps.
- The approach relies on key phrases extracted from real data; although ablations suggest added value beyond overlap, dependency on key phrase quality and selection remains.
- Longitudinal consistency across multiple documents per patient was not modeled; the method generates sentences at paragraph level without ensuring cross-note coherence.
- Trained models and protected datasets cannot be publicly released due to governance constraints, limiting reproducibility to re-implementation on authorized data.
Related Publications
Explore these studies to deepen your understanding of the subject.