Medicine and Health
Large language models for preventing medication direction errors in online pharmacies
C. Pais, J. Liu, et al.
Medication direction errors in pharmacies can be dangerous, but researchers Cristobal Pais, Jianfeng Liu, Robert Voigt, Vin Gupta, Elizabeth Wade, and Mohsen Bayati are leveraging large language models to combat this issue. Their innovative MEDIC system fine-tunes these models to enhance prescription accuracy and reduce near-miss events significantly, showcasing a promising path for safer pharmacy practices.
~3 min • Beginner • English
Introduction
The study addresses medication direction errors—particularly incorrect dosage and frequency—that contribute substantially to preventable adverse drug events and costs in the pharmacy workflow. Errors commonly arise during data entry when prescriber free-text directions are transcribed into standardized patient-facing instructions, a process complicated by abbreviations, typos, ambiguous or incomplete entries, and heterogeneous style guidelines across systems and countries. EHRs, while structured, still permit free text, perpetuating inconsistencies. High-risk examples include methotrexate weekly dosing misentered as daily, which can cause severe harm. The research question is whether integrating domain knowledge and safety guardrails with LLM-based NLP can reliably interpret prescriber directions and generate standardized, pharmacist-quality directions to reduce near-miss events in online pharmacy operations. The purpose is to improve accuracy and safety in the data entry (DE) phase and thereby reduce pharmacist workload and near-misses, with a human-in-the-loop design to ensure safety.
Literature Review
The paper situates its work within literature on medication errors and patient safety, noting millions of preventable adverse drug events annually and significant dispensing error rates in community pharmacies. It references challenges introduced by EHR free-text fields and variability in prescribing standards. Within NLP/LLM literature, the authors note LLMs’ strong text capabilities and methods for adaptation (fine-tuning and prompting), but emphasize risks of hallucination and overconfidence in high-stakes clinical contexts. Prior work on clinical decision support and pharmacy-direction simplification (e.g., neural machine translation approaches) informs the benchmarks. Recent surveys highlight limitations of standard NLP metrics in capturing clinical severity, motivating human evaluation and safety guardrails.
Methodology
Design: MEDIC (Medication Direction Copilot) is a three-stage system with human-in-the-loop oversight aimed at the pharmacy DE and pharmacist verification (PV) stages. It both suggests standardized directions and flags discrepancies between prescriber directions and technician-entered directions.
Data: Approximately 1.6M single-line historical Amazon Pharmacy directions were processed and split into subsets: D_H (1,000) for human labeling; D_Train (~1.58M); D_Test (20,000); D_Eval (1,200). A medication catalog (D_MedCat) was constructed from RxNorm, OpenFDA, and Amazon’s drug catalog (~99% coverage), including medication attributes and default/required components (e.g., verb, route) to power deterministic guardrails.
MEDIC pipeline:
- Stage 1: Pharmalexical normalization. A rule-based preprocessing module applies hundreds of pharmacist-derived transformation rules to standardize raw prescriber text (normalize abbreviations, correct typos, standardize phrasing) producing clean inputs.
- Stage 2: AI-powered extraction. A DistilBERT-based named entity recognition model is fine-tuned to extract core components: verb, dose, route, frequency and auxiliary entities (indication, action, max dose, time, period). Training used D_H (1,000 human-labeled samples, D_HL), augmented synthetically to D_HLA (10,000). A separate synthetic test set D_HLA^T (10,000) assessed extraction. Hyperparameters included batch size 16, learning rate 1e-5, 3 epochs, weight decay 1e-5; Bayesian optimization selected final settings. On D_HLA^T the model achieved precision/recall/F1 > 0.99 with only six misclassifications across 160,484 entities. Sensitivity analyses showed augmentation was critical (F1 ≈ 0.70 without augmentation; ≈0.90 with 5,000 augmented samples; no significant gains beyond 10,000).
- Stage 3: Semantic assembly and safety enforcement. Extracted components are assembled into standardized directions, filling missing required elements from D_MedCat when available and ordering components as verb, dose, route, frequency, and auxiliary subcomponents. Safety guardrails halt suggestion generation when: GR1) conflicts exist between extracted components and D_MedCat (e.g., verb incompatible with dosage form); GR2) multiple values of any core component (suggesting multi-line complexity); GR3) dose present without verb (depending on catalog requirements); GR4) missing frequency; GR5) missing dose when form is tablet/capsule. If triggered, MEDIC refrains from suggesting output.
Flagging function: Both prescriber and DE-entered directions undergo normalization and extraction, then components are semantically compared (standardized with D_MedCat) to flag discrepancies in real time.
Benchmarks:
- T5-FineTuned: T5-base model fine-tuned on n pairs of (raw prescriber direction → pharmacist-verified direction) from D_Train with n ∈ {100, 1k, 10k, 100k, 1.5M}. Also evaluated feeding normalized inputs. Latency ≈ 1s.
- Claude (Anthropic v2.1): zero-/few-shot prompting (0, 5, 10 examples). Latency ≈ 7.6–8.2s.
Evaluations:
- Retrospective NLP metrics: BLEU and METEOR on D_Eval (1,200) and D_Test (20,000). Also a rule-based baseline (Stage 1 only).
- Retrospective human review: Pharmacists identified suggestions that would constitute near-miss events if used; also assessed clinical severity of near-misses.
- Prospective deployment: Before-after comparison in production against the then-active baseline system. Primary metric: near-miss rate on directions. Secondary: suggestion coverage, adoption rate by DE techs, post-adoption edit ratio. Latency and cost were monitored. Human-in-the-loop mechanisms routed problematic cases to a review queue for catalog and data updates and periodic retraining.
Runtime/Cost: MEDIC average latency ~200 ms on CPU; T5-FineTuned ~1 s on CPU; Claude/ChatGPT/Gemini ~7.6–8.2 s with usage costs. MEDIC and T5 incur no per-inference usage cost and run on low-cost CPUs.
Key Findings
Retrospective NLP metrics:
- On D_Eval (1,200), T5-FineTuned (1.5M) slightly outperformed MEDIC in BLEU/METEOR; Claude variants lagged behind both. On D_Test (20,000), larger fine-tuning datasets improved T5 performance; with only 100 samples, T5 underperformed the rule-based baseline; with 1.5M samples it slightly surpassed MEDIC on NLP metrics.
Retrospective human evaluation (near-miss rates):
- Claude (best ten-shot) produced 4.38× (95% CI 3.13, 6.64) more near-misses than MEDIC on 1,200 prescriptions.
- T5-FineTuned (1.5M) produced 1.51× (95% CI 1.03, 2.31) more near-misses than MEDIC overall.
- Considering only cases where MEDIC’s guardrails allowed generation (MEDIC Active ≈ 80%): T5-FineTuned (1.5M) outperformed MEDIC with a near-miss ratio of 0.58 (95% CI 0.33, 0.92). Outside MEDIC Active, T5 suggestions were overconfident and error-prone.
- Dose/frequency near-miss ratios paralleled overall findings with wider CIs; differences between MEDIC and T5-FineTuned (1.5M) were not statistically significant for this subset, while Claude showed higher risk.
Clinical severity of near-misses:
- No statistical difference in clinically severe near-misses between MEDIC and T5-FineTuned (P=0.58).
- Claude (ten-shot) had 5.87× (95% CI 2.1, 19.0) more clinically severe near-misses than MEDIC overall and 4.36× (95% CI 1.44, 14.0) in MEDIC Active cases. Errors included dangerous dosing/frequency changes (e.g., added insulin doses), route errors, and omission of critical timing.
Guardrails and error types:
- Common guardrail trigger reasons among blocked suggestions included incorrect verb (19.5%), multiple dose (19.1%), multiple frequency (11.8%), incorrect route (11.8%), missing verb/frequency (10.7%), incorrect dose form (7.4%), missing dose (5.9%), etc. Guardrail activations distributed across GR1 (38.7%), GR2 (42.9%), GR3/GR4 (10.7%), GR5 (7.7%).
Flagging performance:
- On 795 historically validated near-miss cases, the flagging module detected 95.1% of errors overall and was especially effective for dose quantity, route, frequency, and auxiliary discrepancies; less effective for verb and dose form due to catalog incompleteness.
Prospective deployment outcomes:
- 33% reduction in direction-related near-miss events (95% CI 26%, 40%).
- Suggestion coverage increased by 18.3% (95% CI 17.8%, 18.9%).
- Adoption by DE technicians increased by 28.5% (95% CI 28.1%, 29.0%).
- Post-adoption edits decreased by 44.3% (95% CI 43.2%, 45.4%).
Latency/Cost:
- MEDIC (≈200 ms) and T5 (≈1 s) met production SLA on CPUs; Claude/ChatGPT/Gemini incurred higher latency (~8 s) and usage costs.
Discussion
Findings show that combining an entity-extraction LLM with domain knowledge and deterministic guardrails can reduce clinically meaningful errors versus generative LLM approaches that are prone to hallucinations and overconfidence. Although a massively fine-tuned generative model (T5-FineTuned 1.5M) slightly outperformed on text similarity metrics and performed well on MEDIC-Active cases, its lack of self-awareness in difficult cases resulted in more near-misses overall relative to MEDIC. This highlights the limitation of BLEU/METEOR in safety-critical tasks and underscores the value of safety-aware system design. Prospective results demonstrate operational impact: fewer near-misses, higher suggestion coverage and adoption, and fewer edits, implying reduced pharmacist workload and improved throughput. The guardrail strategy, anchored in a curated medication catalog, effectively mitigates hallucinations by enforcing self-consistency and stopping unsafe generations. The system’s CPU efficiency and low cost facilitate scalable deployment, and its data-efficient extraction approach (1,000 labeled + augmentation) supports portability to other settings. Overall, the work addresses the core problem of unsafe direction transcription by aligning LLM capabilities with pharmacy logic and human oversight.
Conclusion
The study introduces MEDIC, a three-stage AI system that integrates pharmalexical normalization, DistilBERT-based entity extraction, and guardrail-enforced semantic assembly to generate safe, standardized medication directions and flag transcription errors. Compared with few-shot and large-scale fine-tuned generative baselines, MEDIC reduced near-miss events in retrospective review and achieved a 33% reduction during prospective deployment, while maintaining low latency and cost. Contributions include a domain-informed architecture, a curated medication catalog for safety enforcement, and a human-in-the-loop process for continuous improvement. Future work includes expanding to multi-line and more complex directions, incorporating modalities such as OCR and speech-to-text for non-electronic prescriptions, integrating patient feedback and outcomes, enhancing the catalog to improve detection of verb/dose-form errors, exploring reinforcement learning from human feedback, and leveraging modern LLMs as controlled overlays (e.g., for assembly assistance or user-facing chat interfaces) without compromising safety guardrails.
Limitations
Key limitations include: (1) Absence of direct patient feedback on clarity and outcomes; (2) Focus on electronic single-line directions (>98% of cases), with limited evaluation of multi-line or more complex instructions that carry higher risk; (3) Limited coverage of non-electronic media (fax, scans, oral prescriptions), which may be more error-prone; (4) Real-world performance variability due to human factors, system constraints, and data quality, leading to discrepancies between retrospective and prospective flagging accuracy; (5) Dependence on the completeness and accuracy of the medication catalog, which affected detection of verb and dose form errors; (6) Confidentiality constraints limited reporting of some absolute rates; (7) Generalizability beyond the studied pharmacy requires catalog adaptation and validation, though design aims for portability.
Related Publications
Explore these studies to deepen your understanding of the subject.

