logo
ResearchBunny Logo
Using Mobile Data and Deep Models to Assess Auditory Verbal Hallucinations

Psychology

Using Mobile Data and Deep Models to Assess Auditory Verbal Hallucinations

S. Mirjafari, A. T. Campbell, et al.

This innovative research by Shayan Mirjafari, Andrew T Campbell, Subigya Nepal, and Weichen Wang investigates the exciting intersection of mobile data and deep learning to assess auditory verbal hallucinations. Through ecological momentary assessments and advanced neural networks, the study showcases the promising potential of mobile technology for real-time AVH evaluation.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper investigates whether multimodal mobile data and deep learning can assess the valence of Auditory Verbal Hallucinations (AVH) in daily life. AVH, common in schizophrenia-spectrum disorders but also present across other conditions and even in non-clinical populations, can produce significant distress. Prior work highlights AVH as part of a continuum of psychotic experiences and emphasizes the need to identify individuals requiring clinical care. Concurrently, mobile sensing and ecological momentary assessment (EMA) enable in-situ, real-time measurement of behavior and experiences, while advances in NLP suggest linguistic markers may indicate psychosis risk. This study aims to predict the momentary valence of hallucinated voices (from “Not at all” to “Extremely/A lot”) using linguistic features from audio diaries and contextual features from passive phone sensing, leveraging transfer learning and data fusion to address a complex, multimodal prediction problem.
Literature Review
The authors situate their work within several strands of research: (1) Mobile sensing and EMA for mental health have shown promise in detecting or monitoring anxiety, depression, bipolar disorder, and schizophrenia, with EMA reducing recall bias compared to retrospective measures. (2) AVH research under the RDoC framework emphasizes neurobiological and cognitive aspects, and EMA-based studies have linked hallucinations to time-of-day, worry, and emotional state. (3) NLP and speech analyses identify language markers predictive of psychosis and schizophrenia spectrum disorders, with deep language models (e.g., BERT) capturing clinically relevant linguistic phenotypes. (4) Deep learning and transfer learning increasingly support mobile-sensing applications, outperforming hand-crafted features in various domains. The paper extends this literature by integrating audio diary NLP/auditory embeddings and transformed passive sensing features (via VGGish and ROCKET) to assess AVH valence in situ.
Methodology
Study design and participants: N=435 U.S.-based participants who hear voices were remotely recruited via a web-based process (contextual Google ads), screened on clinical criteria, required to use Android, and consented under IRB oversight. Study duration was 30 days with compensation upon completion. Demographics included majority women (232), men (189), and other genders (14); age: 27 under 25, 307 between 25–50, 101 over 50; participants reported multiple clinical diagnoses, with depression most frequent. Data collection: Participants completed EMAs four times daily at randomized windows (9–11 am, 12–2 pm, 3–5 pm, 6–8 pm), plus optional self-initiated entries. The EMA first asked if the participant was experiencing voices “right now.” If yes, four follow-up items were shown, each with four response categories: (1) How NEGATIVE is the content? (2) How LOUD are the voices? (3) How much CONTROL do you have? (4) How much POWER do the voices have? Scales: Not at all, A little, Moderate/Moderately, Extremely/A lot. These items serve as valence-related labels. Total self-reports: 3,838 from 435 participants. If voices were reported, participants could optionally record an up to 3-minute audio diary describing their hallucination content. Passive sensing: A background Android app (StudentLife-based) collected phone usage (unlock duration, number of unlocks), conversational proxies (audio amplitude, conversation duration, number of conversations via an on-device speech presence classifier without recording raw audio), and GPS for mobility (significant places via DBSCAN; number of places and distance traveled). These features were computed hourly. Supervision and preprocessing: EMA categories were one-hot encoded per question. For sensing features, 24 hours prior to an EMA response were used to construct features, reflecting proximity to the momentary AVH experience. Feature extraction: - Audio diary (auditory): VGGish (pretrained on large-scale audio) produced 128-D embeddings per 0.96 s log-mel patch; embeddings averaged across patches to yield a 128-D per-diary feature. - Audio diary (textual): Diaries were transcribed (speech-to-text). BERT generated contextual embeddings; the 12×768 encoder outputs per token were summed to a 768-D sentence representation, then averaged across sentences to a 768-D transcript vector. - Mobile sensing (two approaches): (1) VGGish-based transformation: 24-hour hourly time series per sensing stream were zero-meaned, min-max scaled to [-1,1], then expanded into a 24-second, 44.1 kHz time-domain sequence by sampling Gaussian noise around each hourly value (44100 samples per hour point). Spectrograms from these signals were fed into VGGish to obtain a 128-D embedding per sensing stream. (2) ROCKET: Applied 64 random 1-D convolution kernels per stream to generate 128 features (max and proportion of positive values per kernel) for parity with VGGish dimensionality. Modeling and training: - Data split: For each participant, earliest 60% of data for training, next 20% validation, final 20% test; splits held constant across experiments. - Auditory-textual model: Input size 896 (128 VGGish + 768 BERT); fully connected network with batch normalization, dropout, ReLU activations; cross-entropy loss; Adam optimizer; batch size 64; 80 epochs; output layer size 4 with sigmoid for multi-class probabilities. - Sensing model: Input size 896 (7 sensing streams × 128 features each from VGGish or ROCKET); fully connected network with tanh activations, batch norm, dropout; cross-entropy; Adam; batch size 42; 120 epochs. - Fusion (hybrid) model: Transferred features from the trained auditory-textual and sensing models (outputs of their 32-unit layers) were concatenated (total 64-D) and fed to a small fully connected network (tanh/ReLU), softmax output; Adam; cross-entropy; batch size 32; 50 epochs. - Overall model: Trained from scratch on all raw features concatenated (1792-D); deeper fully connected network with batch norm and dropout; cross-entropy; Adam; batch size 64; 150 epochs. Baselines: XGBoost models trained on (a) all combined raw features (overall), (b) transferred features from the trained auditory-textual and sensing models, and (c) transferred features from the 3rd hidden layer of the overall model. Evaluation: F1 score (top-1 and top-2) on the held-out test set for each EMA question separately (negativeness, loudness, control, power). Chance baselines computed as always predicting the most frequent class in the test set.
Key Findings
- Hybrid fusion model outperformed all other models (auditory-textual only, sensing only, overall from-scratch model, and XGBoost baselines) across EMA questions. - Predictive performance (hybrid model): • Negativeness: top-1 F1 = 54%, top-2 F1 = 72% • Loudness: top-1 F1 = 51%, top-2 F1 = 74% • Control: top-1 F1 = 48%, top-2 F1 = 68% • Power: top-1 F1 = 47%, top-2 F1 = 70% - Chance top-1 F1 (most-frequent-class predictor) for test set: 29% (negativeness), 38% (loudness), 41% (control), 28% (power). - Sensing feature transformations: Models trained on VGGish-based transformed sensing features outperformed those using ROCKET-based features (for both top-1 and top-2 F1). Training on untransformed sensing features yielded weaker performance (best top-1 F1 around 41% across questions), even with fusion. - Data characteristics: 3,838 EMA self-reports from 435 participants; label distributions are imbalanced across categories.
Discussion
The study demonstrates that multimodal mobile data—linguistic content from audio diaries and contextual patterns from passive sensing—can be fused via transfer learning to meaningfully assess AVH valence in situ. Using 24 hours of sensing data prior to each EMA is justified by the momentary nature of the AVH query, capturing proximal behavioral signals. Transforming low-dimensional, hourly sensing time series into higher-dimensional representations (via spectrograms processed by VGGish) improves predictive power, likely because these representations capture nuanced patterns that simple aggregates miss. VGGish-based transformations outperform ROCKET in this setting, possibly because VGGish’s learned filters provide stronger representations when the number of transformed features is modest, whereas ROCKET may require more kernels to match performance. Model regularization (notably high dropout in early layers) was beneficial given the relatively small training sample versus high feature dimensionality. The authors discuss clinical implications: a deployed mobile system could continuously assess AVH valence and alert providers, enabling timely intervention. They also outline privacy-preserving deployment by performing feature extraction on-device (running frozen portions of pretrained models) so that raw data need not leave the phone.
Conclusion
The authors studied 435 individuals who report hearing voices and collected multimodal data via EMAs, optional audio diaries, and passive sensing. They framed AVH assessment as a four-class prediction problem across four EMA dimensions (negativeness, loudness, control, power). Leveraging transfer learning (VGGish for audio/sensing spectrograms, BERT for transcripts) and data fusion, their hybrid neural model achieved 54% top-1 and 72% top-2 F1 on the negativenes dimension and comparable gains on the other dimensions, surpassing chance and alternative models. They introduced a novel transformation of mobile sensing time series into spectrograms to extract higher-dimensional features with VGGish, which improved performance over ROCKET-based and untransformed sensing features. As a proof of concept, this is the first demonstration of using mobile data to assess AVH valence, suggesting a path toward real-time, privacy-preserving clinical monitoring and intervention systems.
Limitations
- Training sample is relatively small compared to the high dimensionality of multimodal features, necessitating strong regularization (e.g., high dropout) to prevent issues like exploding gradients. - Label distributions across EMA categories are imbalanced, affecting chance performance baselines and potentially model calibration. - Comparison with ROCKET used a limited number of random kernels/features (128) to match VGGish dimensionality; ROCKET may perform differently with many more kernels. - The work is presented as a proof of concept; broader generalizability and clinical validation beyond the reported cohort are not established within this study.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny