Psychology
Applying speech technologies to assess verbal memory in patients with serious mental illness
T. B. Holmlund, C. Chandler, et al.
The study addresses the challenge of efficiently and frequently assessing verbal episodic memory, a domain disproportionately impaired in schizophrenia and other SMI. Traditional tests require trained personnel, have limited alternate forms, and are usually administered cross-sectionally in controlled settings, limiting longitudinal monitoring and personalization. The authors propose leveraging speech technologies to administer story recall tasks via smart devices, automatically transcribe responses, and algorithmically score recall by comparing the retelling to the original story. The aim is to create a transparent, interpretable, and scalable framework enabling frequent, remote monitoring of verbal memory and its relationship to clinical states, thereby supporting digital phenotyping and personalized medicine.
Prior work has established robust verbal memory impairments in schizophrenia and SMI and highlighted the potential of digital phenotyping in psychiatry. Traditional scoring emphasizes item/theme counts, which may not capture semantic fidelity. Earlier NLP methods (e.g., Latent Semantic Analysis) have been used to assess narrative memory and language coherence in psychiatric populations. Advances in word embeddings and distance metrics (e.g., Word Mover’s Distance) offer improved semantic comparison between texts. Automated assessment in related domains (e.g., essay grading, team performance modeling) and prior automated neuropsychological tools suggest feasibility. The study builds on this literature by applying modern semantic methods and ASR within an ambulatory memory assessment framework.
Design and participants: 104 adults participated: 25 patients meeting U.S. federal definitions of SMI (mean age 49.7, SD 10.4; 52.2% female; diagnoses: schizophrenia N=16, major depressive disorder N=8, bipolar disorder N=1) and 79 presumed healthy undergraduates (mean age 21.7, SD 1.4; 62% female). Patients received SCID and BPRS assessments; BPRS factor scores (Affective, Agitation, Positive, Negative) indicated overall low-to-moderate symptom severity with variability. IRB-approved with informed consent. Task and administration: An iOS app presented 10 spoken passages (5 narratives, 5 instructional), each ~62–87 words (avg ~74), structurally similar to Wechsler Logical Memory. Each story had characters, setting, problem, and resolution; instructional texts described procedural steps. Participants performed immediate and delayed recall per passage, speaking into the device; audio was recorded. The protocol supports frequent, self-administered, ambulatory use. Human ratings: 3–7 expert raters (trained with backgrounds in psychology/cognitive neuroscience/medicine) rated each response on a 0–6 scale for recall of concepts/themes. The average across raters per response served as ground truth. Inter-rater reliability was assessed via correlations to the mean rating and pairwise correlations. Transcription: Each response was transcribed independently by two human transcribers; discrepancies were resolved, yielding a human word error rate (WER) of 7.2%. Automated transcriptions were produced via (1) a generic, cloud-based ASR (Google Speech-to-Text) and (2) a custom ASR built with the Kaldi toolkit, whose language model was tuned to the domain vocabulary (e.g., balloons, skateboards). All recordings were screened to avoid uploading personally identifiable/sensitive content to cloud services. NLP features and modeling: Preprocessing included lowercasing, removal of punctuation and hesitation markers. Two features quantified recall–prompt similarity: (a) common word types (unique word overlap count) and (b) semantic similarity via Word Mover’s Distance (WMD) computed from 300-dim Word2Vec embeddings trained on Google News (Gensim). An ordinary least squares linear regression combined these features to predict human ratings. Performance was evaluated by Pearson correlation between predicted and average human ratings using 5-fold cross-validation. Group comparisons used two-sided Welch’s t-tests with Holm correction. Outcomes: Feasibility metrics included proportion of usable recordings and response length. Primary accuracy metrics were correlations (predicted vs. human ratings), R^2, and ASR WER (overall and by group).
Feasibility and response characteristics:
- Of 1035 recordings, 92% were processable (patients: 86%; healthy: 96%).
- Mean retelling length: overall ~61 words; healthy 62.2 (SD 21.4) vs. patients 48.7 (SD 22.4); Cohen’s d = 0.8; t = -9.1; p < 0.001. Short responses (<10 words): healthy 5.4% vs. patients 19.7%. Human ratings and reliability:
- Mean human rating (0–6): healthy 4.6 (SD 1.1) vs. patients 3.3 (SD 1.3); d = 1.1; t = 13.4; p < 0.001.
- Individual rater correlation to gold standard: mean R = 0.83 (range 0.73–0.89). Average pairwise inter-rater correlation: R ≈ 0.73. Automated scoring using human transcripts (hybrid):
- Common word types vs. human ratings: R = 0.82.
- WMD vs. human ratings: R = -0.81 (lower distance indicates greater similarity/better recall).
- Combined OLS model: R = 0.83 (range 0.74–0.90 across 5 folds), R^2 ≈ 0.69; coefficients: +0.15 (common word types), -0.54 (WMD). Predicted scores: healthy 4.6 (SD 0.9) vs. patients 3.4 (SD 0.9); d = 1.3; t (Welch) significant, p < 0.001. ASR performance and fully automated scoring:
- Generic ASR (Google) WER overall 23.3% (healthy 17.1%; patients 43.7%).
- Custom ASR (Kaldi, domain-tuned) WER overall 10.5% (healthy 6.2%; patients 24.8%), approaching human transcription error (7.2%).
- Prediction using generic ASR transcripts: correlation with human ratings R = 0.80 (range 0.74–0.88 across 5 folds).
- Prediction using custom ASR transcripts: correlation with human ratings R = 0.82 (range 0.74–0.88).
- Predicted scores from automated pipelines vs. those from human transcripts: R = 0.96–0.99. Variance explained decreased modestly from human transcripts (R^2 = 0.69) to ASR transcripts (R^2 = 0.67), a 2% drop. Group differences in feature measures (all p < 0.001):
- Common word types (human transcription): healthy 26.7 (SD 8.1) vs. patients 16.4 (SD 6.8); d = 1.4; t = 17.8.
- WMD (human transcription): healthy 1.3 (SD 0.4) vs. patients 1.7 (SD 0.5); d = -1.0; t = -12.0.
- Similar patterns held with generic and custom ASR transcripts. Overall, automated scoring reached human-rater-level performance and was robust to ASR errors, supporting viability for clinical monitoring.
The findings demonstrate that an end-to-end, speech-technology–based verbal memory assessment can be reliably administered via smart devices, including in outpatient and potentially remote contexts. Patients tolerated the task, and high-quality speech was captured in most trials. Automated scoring that combines word overlap with semantic similarity closely approximated expert human ratings, achieving correlations on par with individual raters. Importantly, performance remained strong even when using ASR transcripts, despite higher error rates, likely because ASR errors often affect non-essential words and semantic similarity metrics mitigate lexical variation. This supports the approach’s practicality for frequent, longitudinal monitoring of cognitive function and its relationship to clinical state in SMI. The work also underscores the value of modern semantic techniques (e.g., WMD) that capture gist-level recall beyond verbatim reproduction, aligning better with how humans evaluate narrative memory. The framework offers a foundation for scalable, objective, and explainable cognitive monitoring in psychiatry.
This study introduces and validates a practical, interpretable, and scalable pipeline for frequent assessment of verbal episodic memory using mobile administration, automated transcription, and NLP-based scoring. Automated metrics matched human-rater performance and were robust to ASR imperfections, enabling remote, repeated measurements suitable for clinical monitoring and digital phenotyping in SMI. Future work should: (1) establish sensitivity to within-person change and clinical meaningfulness over time; (2) expand feature sets (e.g., syntax, discourse structure, speech graph metrics, acoustic affect) to enhance validity and clinical relevance; (3) further tailor ASR to clinical speech characteristics; (4) evaluate generalizability across languages, tasks, and populations; and (5) implement privacy-preserving, compliant data architectures and clinical oversight mechanisms for safe deployment.
- Group differences: Patients and controls were not matched on age and education; thus group effects should not be interpreted as specific memory deficits. Future matched-control designs are needed.
- Generalizability: The task used a limited set of English passages and embeddings; transfer to other languages, cultural contexts, or test formats may require adaptation, especially for syntax-dependent features.
- ASR variability: Higher WER in patients (particularly with generic ASR) may vary by disorder characteristics, symptoms, or accents; while scoring was robust, specific subgroups may need ASR customization.
- Clinical validation: The study did not establish longitudinal sensitivity, test–retest reliability, or clinical meaningfulness at the individual level for ambulatory monitoring; further psychometric validation is required.
- Privacy/data governance: Cloud ASR usage raises privacy concerns; although recordings were screened, broader deployment requires compliant, secure data pipelines and monitoring.
- Setting and feasibility: While ambulatory administration is feasible, practical, technical, and legal challenges (e.g., supervision, device access, crisis monitoring) must be addressed for real-world use.
Related Publications
Explore these studies to deepen your understanding of the subject.

