Introduction
Natural Language Processing (NLP) offers significant potential for improving healthcare through the analysis of unstructured clinical text. However, data accessibility is a major obstacle, primarily due to ethical concerns surrounding the sharing of personally identifiable information (PII) contained within electronic health records (EHRs). While efforts exist to make de-identified EHR data available, strict regulations and a scarcity of mental health-specific resources remain significant challenges. The machine learning community often addresses similar issues by using artificially generated data; for instance, in image processing. Text generation is a growing area of NLP research with applications in dialogue, machine translation, summarization, and story generation. While there is work on generating medical data to assist clinicians, the replacement of real training data with artificial counterparts remains relatively understudied. Previous attempts, such as Lee's work generating short chief complaint documents, lacked analysis of downstream NLP task utility and often overlooked the risk of models memorizing rare, potentially sensitive, information. This study addresses these gaps by focusing on the generation of full mental health records (MHRs), a domain characterized by complex narratives and limited structured coding, for use in NLP applications. The proposed approach generates EHRs using state-of-the-art text generation models, guided by key phrases extracted from real text to maintain semantic integrity and relevance. The study aims to test the hypothesis that statistically and clinically valid data can be generated, that this data is useful for downstream NLP tasks, and that this generation process can be effectively controlled to mitigate the risk of retaining rare and sensitive information from the original data. The data used is already de-identified; the focus is on quantifying and assessing whether unusual or rare information leaks into the synthetic data and on the implications for data release to the research community with less stringent governance.
Literature Review
The paper reviews existing literature on artificial data generation for NLP in healthcare, highlighting the limitations of previous work. It notes a lack of studies generating full EHR notes, particularly for mental health, and the absence of thorough evaluations on downstream NLP task performance and the risk of memorization of rare, potentially sensitive information from the original data. The closest previous work, by Lee, only generated short chief complaint documents. The current research builds on these limitations, aiming to generate full-length MHRs while carefully assessing both the utility of generated data for downstream NLP tasks and the risk of retaining sensitive information from the original data.
Methodology
The study uses EHRs from two databases: a mental health record (MHR) database (CRIS) and the MIMIC-III database. Two text classification tasks are explored: diagnosis code prediction and phenotype prediction. The CRIS MHR dataset, already de-identified, comprises discharge summaries with common mental health ICD-10 diagnoses. The MIMIC-III dataset, also de-identified, is used for a phenotyping task. Data is split into training, validation, and test sets. Artificial data generation employs the Transformer model, guided by key phrases extracted from the original text using the Rake algorithm. These key phrases, along with clinical information (ICD-10 code, gender, age, etc.), are used as input to generate artificial paragraphs sentence by sentence. Three experimental setups are compared: (a) using all extracted key phrases ('all'), (b) using top-scored key phrases plus clinical information ('top+meta'), and (c) using one top-scored key phrase per sentence plus clinical information ('one+meta'). A baseline ('key') simply uses all extracted key phrases without generation. Intrinsic evaluation assesses text preservation (perplexity, ROUGE-L, BLEU, TER), memorization (analyzing the presence of rare and frequent n-grams in generated and original text), and clinical validity (human evaluation by medical students using a seven-category annotation scheme with inter-rater agreement assessment). Extrinsic evaluation uses generated data in a text classification task, comparing the performance of models trained on artificial data with those trained on genuine data. Three text classification models (Latent Dirichlet Allocation (LDA), n-gram bag-of-words (BoW), and Convolutional Neural Network (CNN)) are used. Ablation experiments are conducted to investigate the influence of key phrases on classifier performance by removing them from both genuine and artificial data. The Kolmogorov-Smirnov (KS) test is employed to compare distributions of F1-scores.
Key Findings
Intrinsic evaluation revealed that the 'top+meta' model, using minimal information from the original text, demonstrated balanced scores across metrics and showed low memorization of rare n-grams, minimizing the risk of retaining sensitive information. Human evaluation indicated sufficient inter-annotator agreement, with most generated sentences categorized as clinically valid and not contradicting the diagnosis. Extrinsic evaluation demonstrated comparable classification results when using artificial data ('top+meta' and 'one+meta') for training compared to genuine data across all three models (LDA, BoW, CNN). The 'top+meta' setup, using approximately 31% of the original words per sentence, consistently performed well. Ablation experiments confirmed that the inclusion of real key phrases in artificial data significantly influences classifier performance. Analysis of errors in the best-performing CNN model (genuine vs. top+meta) revealed a similar error pattern, indicating that the artificial model accurately reflects the behavior of the model trained on real data. Results from the MIMIC-III dataset showed similar trends, with the 'top+meta' setup yielding optimal performance for BoW and CNN models.
Discussion
The study successfully demonstrates a method for generating artificial clinical documents that are both statistically and clinically valid for use in downstream NLP tasks. The 'top+meta' model provides a balance between data fidelity and information security, utilizing a small amount of information from the original text while minimizing the risk of memorization of sensitive details. The comparable performance of models trained on artificial data vs. genuine data highlights the potential of this approach to accelerate NLP development in the clinical domain. This is particularly relevant for institutions lacking in-house NLP expertise, enabling them to outsource the development of NLP algorithms using protected artificial data while adhering to strict governance procedures. The findings emphasize the importance of balancing the need for realistic data with the need to protect sensitive information. The methodology is relatively lightweight and adaptable, making it readily applicable to various clinical data settings.
Conclusion
This study presents a novel approach to generate high-quality artificial mental health records for NLP research. The method utilizes a transformer model guided by key phrases extracted from real data and demonstrates comparable performance to models trained on real data in downstream classification tasks. The results underscore the potential for this method to facilitate NLP research while minimizing privacy risks, thus accelerating progress in this crucial field. Future directions include exploring alternative modelling approaches such as adversarial learning and developing more robust methods for assessing the risk of revealing identifiable information. Additionally, expanding this method's applicability to other clinical NLP tasks and investigating strategies for handling longitudinal data coherence are vital for broader impact.
Limitations
The study focused on two specific NLP tasks (diagnosis code prediction and phenotype prediction), limiting the generalizability to other NLP tasks. The human evaluation, while demonstrating sufficient inter-rater agreement, was conducted by medical students and may not fully capture the perspectives of experienced clinicians. The data used was already de-identified, which might not fully capture the challenges of generating synthetic data from raw clinical records. Furthermore, the absence of universally accepted metrics and thresholds for assessing the risk of revealing sensitive information from free-text data limits the definitive conclusions regarding the safety of the generated data.
Related Publications
Explore these studies to deepen your understanding of the subject.