logo
ResearchBunny Logo
Using Mobile Data and Deep Models to Assess Auditory Verbal Hallucinations

Psychology

Using Mobile Data and Deep Models to Assess Auditory Verbal Hallucinations

S. Mirjafari, A. T. Campbell, et al.

This innovative research by Shayan Mirjafari, Andrew T Campbell, Subigya Nepal, and Weichen Wang investigates the exciting intersection of mobile data and deep learning to assess auditory verbal hallucinations. Through ecological momentary assessments and advanced neural networks, the study showcases the promising potential of mobile technology for real-time AVH evaluation.

00:00
00:00
Playback language: English
Introduction
Auditory verbal hallucinations (AVH), the perception of hearing voices that are not real, significantly impact individuals, causing distress and impairment. While prevalent in mental illnesses like schizophrenia and bipolar disorder, AVH also occurs in mentally healthy individuals. The valence (positivity or negativity) of these voices is a crucial indicator of the severity of the condition. Traditional AVH research often relies on retrospective measures, suffering from recall bias. This study investigates the feasibility of using mobile sensing, offering both passive data collection (location, phone usage, audio cues) and active data collection (EMA and audio diaries) to address this limitation. The researchers aim to develop a model that predicts AVH valence using these multimodal data sources in real-time, thus potentially enabling timely interventions. Mobile sensing has proven useful in detecting early signs of various mental health conditions. Linguistic analysis has also shown promise in identifying subtle markers of mental illness within daily language. This study leverages both these advances to create a novel approach to AVH assessment. The researchers hypothesized that combining linguistic cues from audio diaries (analyzed using natural language processing techniques) and contextual cues from mobile sensing data could accurately predict AVH valence.
Literature Review
Existing research on AVH often employs retrospective methods, which can be unreliable due to recall biases. Ecological Momentary Assessments (EMA) provide a more real-time approach. Studies have shown correlations between time of day, worry, and emotional states with the occurrence of hallucinations. The integration of mobile sensing adds the potential to capture behavioral data for a more comprehensive understanding. Several studies have demonstrated the effectiveness of mobile sensing in detecting early signs of various mental illnesses including anxiety, depression, bipolar disorder, and schizophrenia. The use of machine learning, especially deep learning, offers the ability to identify complex patterns from this data without manual feature engineering. In the domain of natural language processing (NLP), methods such as BERT have shown potential to characterise language phenotypes in psychosis. Previous work has demonstrated success in predicting psychosis onset based on linguistic features and using pre-trained BERT models for improved performance in clinical NLP tasks. This research builds upon these existing methodologies, integrating real-time data collection with advanced machine learning techniques.
Methodology
This study enrolled 435 participants who self-reported experiencing AVH through online advertising. Participants were screened based on clinical criteria and required to have an Android phone. The study lasted 30 days with participants compensated $125. Ethical considerations were adhered to, including IRB approval and informed consent procedures. The participants used a custom Android application for data collection. Ecological Momentary Assessments (EMA) prompted participants four times a day to report on their AVH experience using four-point scales for negativity, loudness, control, and power of the voices. If participants reported experiencing AVH, they were optionally asked to record a short audio diary describing the content. The app passively collected data on location (GPS), phone usage (unlocks, duration), and audio (speech amplitude, conversation duration, conversation count). The audio diaries were transcribed using speech-to-text, and features were extracted using pre-trained models: VGGish for auditory features (128-D embedding) and BERT for textual features (768-D embedding). The mobile sensing data was processed in two ways: 1) transformed into banded spectrograms and processed by VGGish for a 128-D embedding, and 2) using ROCKET for feature extraction (128 features). Three neural network models were trained: one using audio and text features, one using sensing features, and a hybrid model combining both. The models were trained separately to predict the four EMA questions. The dataset was split into 60% training, 20% validation, and 20% testing sets. Model performance was evaluated using top-1 and top-2 F1 scores. A comparison was also done using XGBoost as a baseline.
Key Findings
The hybrid model consistently outperformed the individual models (audio-text and sensing) and the overall model trained on all features from scratch. The hybrid model, which used data fusion techniques and transferred features from the pre-trained auditory-textual and sensing models, achieved the best performance. For the "negativeness" EMA question, the hybrid model obtained a 54% top-1 and 72% top-2 F1 score. For "loudness," it scored 51% top-1 and 74% top-2. For "control," it achieved 48% top-1 and 68% top-2. Finally, for "power," the scores were 47% top-1 and 70% top-2. These scores are significantly better than chance performance considering the imbalanced distribution of responses. The VGGish-based transformation of mobile sensing data significantly outperformed the ROCKET-based transformation, suggesting its effectiveness in extracting informative features from the time series data. The use of a higher dropout rate in earlier layers of the models was crucial to address the exploding gradient problem given the high dimensionality of features and relatively small sample size. The XGBoost models, used as a baseline, performed worse than the neural network models across all scenarios.
Discussion
This research demonstrates the effectiveness of using mobile data and deep learning models for the real-time assessment of AVH valence. The superior performance of the hybrid model underscores the importance of integrating multimodal data sources (audio, text, and sensing). The novel approach of using VGGish to transform mobile sensing data proved beneficial, outperforming the ROCKET method. The findings highlight the potential of this approach for clinical applications, enabling timely interventions for individuals experiencing AVH. The results support the hypothesis that linguistic and contextual cues provide valuable insights into the nature of AVH experiences. The study provides a proof of concept for a mobile system that could be deployed to assist healthcare professionals in monitoring and managing AVH in patients. Future research could explore ways to refine the model, increase sample size, and assess the effectiveness of the system in real-world clinical settings.
Conclusion
This study presents a novel method for assessing auditory verbal hallucinations using mobile data and deep learning. A hybrid model integrating audio diary transcriptions, mobile sensing data, and data fusion techniques outperformed individual models, achieving promising predictive accuracy. The novel VGGish-based transformation of sensing data improved model performance significantly. This research demonstrates the feasibility of real-time AVH monitoring using mobile technology, offering a potential tool for timely interventions and personalized treatment strategies. Future research could focus on improving model accuracy, incorporating additional data modalities, and conducting longitudinal studies to evaluate long-term effectiveness.
Limitations
The study's sample size, while large for this type of research, could be further expanded to enhance generalizability. The reliance on self-reported data introduces potential biases. The study focused on a specific demographic (US residents with Android phones) which limits the generalizability to other populations. The passive sensing data, while rich, lacks the granularity of physiological measures, which could provide further insights. The study's reliance on pre-trained models introduces constraints imposed by those models' architectures and training data.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny