Medicine and Health

Large language models for preventing medication direction errors in online pharmacies

C. Pais, J. Liu, et al.

Medication direction errors in pharmacies can be dangerous, but researchers Cristobal Pais, Jianfeng Liu, Robert Voigt, Vin Gupta, Elizabeth Wade, and Mohsen Bayati are leveraging large language models to combat this issue. Their innovative MEDIC system fine-tunes these models to enhance prescription accuracy and reduce near-miss events significantly, showcasing a promising path for safer pharmacy practices.

00:00

Playback language: English

Index

Introduction

Medication errors are a leading cause of preventable adverse drug events, incurring substantial costs and patient harm. Errors in medication directions, which instruct patients on how to take their medication, are a significant contributor to this problem. These errors can stem from various factors including human error in transcription, ambiguous prescriber notes, incomplete data, and inconsistencies in EHR data entry and formatting. The problem is exacerbated by diverse, non-standard style guidelines across different organizations. This paper addresses this challenge by exploring the potential of Large Language Models (LLMs) to improve the accuracy of medication directions within the context of an online pharmacy. The study focuses on two key stages of the pharmacy workflow: data entry (DE), where technicians transcribe prescriber directions, and pharmacist verification (PV), where pharmacists review the transcribed directions for accuracy. Near-miss events, where errors are caught before reaching the patient, serve as a key patient safety metric. The paper introduces MEDIC, a human-in-the-loop AI solution designed to enhance the accuracy and efficiency of the DE phase, thereby reducing the number of near-miss events.

Literature Review

The introduction cites numerous studies highlighting the significant prevalence and cost of medication errors in general and medication direction errors specifically. It notes the high incidence of preventable adverse drug events, the substantial financial burden, and the involvement of medication errors in thousands of deaths. Existing literature shows that errors in dispensing instructions are particularly common and costly, occurring during the process of inputting prescriptions into pharmacy systems. The transition to electronic health records (EHRs) has introduced additional complexity and potential for errors due to variations in style guides and the use of free-text fields.

Methodology

The study introduces MEDIC, an AI system designed to improve the accuracy of medication directions within an online pharmacy. MEDIC leverages the capabilities of large language models (LLMs) for natural language processing (NLP), but addresses the known challenge of LLM 'hallucinations' (fabricating information) through a combination of techniques: 1. **Data:** MEDIC uses a dataset of approximately 1.6 million single-line medication directions from Amazon Pharmacy data, including both raw prescriber directions and pharmacist-verified typed directions. This data is split into subsets for training, testing, and evaluation. A smaller subset (1000 samples) is manually annotated by experts to provide labeled training data for the core component extraction. Data augmentation is then used to increase the size of this dataset. A comprehensive medication database (`DMedCat`) is also created from RxNorm, OpenFDA, and Amazon Pharmacy data to power safety guardrails. 2. **Model:** MEDIC comprises three stages: (a) Pharmalexical normalization – a rule-based system that standardizes raw directions; (b) AI-powered extraction – a fine-tuned DistilBERT model that extracts core components (verb, dose, route, frequency, auxiliary information) from the normalized directions; (c) Semantic assembly and safety enforcement – a module that assembles extracted components into a complete direction, applying safety guardrails based on the medication catalog to prevent hallucinations. The guardrails stop direction generation if there are inconsistencies with `DMedCat`, multiple values for core components, missing essential information, or other potential errors. 3. **Benchmarks:** Two benchmark systems were developed: (a) T5-FineTuned – an LLM fine-tuned on a large dataset of medication directions; (b) Claude – an LLM using a few-shot prompting approach. These benchmarks, along with a simple rule-based model, allow for comparison with MEDIC's performance. 4. **Evaluation:** Both retrospective (offline) and prospective (online) evaluations were performed. Retrospective evaluation included NLP metrics (BLEU, METEOR) on a held-out dataset ($D_{Test}$ and $D_{Eval}$), human review of suggestions for near-miss events and clinical severity assessments by pharmacists. Prospective evaluation involved deploying MEDIC in the production system of an online pharmacy and comparing near-miss rates and other operational metrics (suggestion coverage, adoption rate, edit ratio) before and after deployment. The study also explores the performance of ChatGPT4 and Gemini Pro on a selection of inputs. The study also incorporates a flagging module to identify potential errors made by data entry technicians. This module compares the technician's entered directions to the original prescriber's directions and flags any significant discrepancies.

Key Findings

The key findings highlight MEDIC's superior performance compared to the benchmark LLMs in reducing medication direction errors. * **Retrospective Evaluation:** The NLP metrics (BLEU and METEOR) showed MEDIC and T5-FineTuned (trained on 1.5 million samples) had similar performance, with T5-FineTuned having a slight edge. However, human evaluation revealed a significant difference: T5-FineTuned had 1.51 times more near-miss events than MEDIC, and the best-performing Claude model had 4.38 times more near-miss events. The analysis of near-misses related specifically to dose or frequency, which have higher risk, showed a similar trend. * **MEDIC Guardrails:** Analysis of MEDIC's safety guardrails revealed that they effectively prevented the generation of potentially unsafe directions in a significant portion of cases (~80%). The guardrails were triggered most frequently due to inconsistencies with the medication catalog, multiple values for core components, and missing essential information. * **Flagging Module:** MEDIC's flagging module, which compares the directions entered by technicians to those prescribed, showed high accuracy (95.1%) in retrospective testing at identifying various types of errors. * **Prospective Evaluation:** Deployment of MEDIC within the production system of an online pharmacy resulted in a 33% reduction in near-miss events. This was accompanied by increases in suggestion coverage and adoption rates and a decrease in post-adoption edits by technicians. * **LLM Comparison:** A qualitative comparison with ChatGPT4 and Gemini Pro showed that these models, even without fine-tuning, produced outputs structurally similar to Claude, highlighting the need for domain-specific adaptations and safety guardrails to achieve safe and accurate medication direction generation.

Discussion

The study's findings demonstrate that integrating LLMs with domain expertise and robust safety mechanisms can significantly improve the accuracy and efficiency of medication direction processing in online pharmacies. The substantial reduction in near-miss events achieved by MEDIC directly translates to improved patient safety. The system's ability to flag technician errors offers an additional layer of protection. The superior performance of MEDIC compared to the benchmark LLMs highlights the critical role of domain knowledge and safety guardrails in mitigating the risks associated with LLM hallucinations. MEDIC's design, using mostly synthetically generated data and publicly available datasets, makes it readily adaptable to other pharmacy settings.

Conclusion

This study successfully demonstrates the feasibility and effectiveness of integrating LLMs with domain expertise to reduce medication direction errors in online pharmacies. MEDIC achieved a significant reduction in near-miss events, highlighting the potential of AI to improve patient safety and operational efficiency. Future research could focus on extending MEDIC's capabilities to handle multi-line directions, incorporating patient feedback, and integrating with other pharmacy systems such as EHRs. Exploring reinforcement learning techniques to enhance the model's responsiveness to human feedback is another avenue for future development.

Limitations

The study has several limitations. First, it lacks direct patient feedback on the quality of the AI-generated medication directions. Second, it primarily focuses on electronic prescriptions and does not fully address other prescription methods (fax, scanned documents, oral instructions). Third, it primarily focuses on single-line directions. Multi-line directions, which are more complex and error-prone, were not comprehensively addressed. Finally, the study is conducted within a single online pharmacy, limiting the generalizability of the results.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

A framework for human evaluation of large language models in healthcare derived from literature review

T. Y. C. Tam, S. Sivarajkumar, et al.

Computer Science

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

J. Chen, Y. Zhang, et al.

Interdisciplinary Studies

Opportunities for Retrieval and Tool Augmented Large Language Models in Scientific Facilities

M. H. Prince, H. Chan, et al.

Linguistics and Languages

Applying large language models for automated essay scoring for non-native Japanese

W. Li and H. Liu

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny