logo
ResearchBunny Logo
Automating the analysis of facial emotion expression dynamics: A computational framework and application in psychotic disorders

Psychology

Automating the analysis of facial emotion expression dynamics: A computational framework and application in psychotic disorders

N. T. Hall, M. N. Hallquist, et al.

We introduce a machine-learning and network-modeling method to quantify the dynamics of brief facial emotion expressions using video-recorded clinical interviews. Applied to 96 people with psychotic disorders and 116 never-psychotic adults, the approach reveals distinct expression trajectories—schizophrenia toward uncommon emotions, other psychoses toward sadness—and offers broad applications including telemedicine. This research was conducted by Nathan T. Hall, Michael N. Hallquist, Elizabeth A. Martin, Wenxuan Lian, Katherine G. Jonas, and Roman Kotov.

00:00
00:00
~3 min • Beginner • English
Introduction
Human social communication relies heavily on facial emotion expressions (EEs), yet methods to quantify rapid, second-to-second changes in EEs are limited. Self-report is prone to bias and typically cross-sectional, and traditional objective approaches such as FACS are labor-intensive while EMG can be obtrusive. Advances in computer vision enable automated facial emotion recognition algorithms (FERAs) to quantify EEs from video, but most studies use static summaries rather than dynamics. Emotion dynamics can be characterized by inertia (tendency to persist in an emotion) and transitions (cross-emotion changes). This study introduces a computational framework that combines a pretrained FERA with dynamic network modeling (CS-GIMME) to quantify fast EE dynamics (~500 ms) during clinical interviews. Psychotic disorders provide a compelling testbed due to known abnormalities in emotional expression; the study examines whether inertia and cross-emotion transitions differ across schizophrenia (SZ), other psychoses (OP), and never-psychotic (NP) adults, and how these dynamics relate to symptom dimensions.
Literature Review
The gold standard FACS (and Emotion FACS) enabled standardized facial coding but requires extensive human labor; EMG detects subtle movements but is intrusive and can affect ecological validity. FERAs such as FaceReader offer scalable, accurate decoding of seven basic emotions. Prior research often used static indices (means, variability) rather than temporal dynamics, despite theory emphasizing inertia and inter-emotion transitions in affective processes. GIMME and its variants (CS-GIMME) from computational neuroscience model directed, lagged relationships in heterogeneous samples, aligning with the need to capture individual and subgroup EE dynamics. In psychosis, human-coded facial expressions and EMG studies link reduced pleasant expressions and variability to negative and disorganized symptoms; preliminary FERA work also shows fewer pleasant expressions and reduced variability in schizophrenia. However, prior work has not examined dynamic transitions among EEs. The present study addresses this gap by focusing on macro-expressions (≈500 ms to 4 s) and leveraging high-frequency video-derived time series downsampled to 500 ms.
Methodology
Sample and procedure: Participants were drawn from the Suffolk County Mental Health Project (first-admission psychosis cohort, recruited 1989–1995) and an age/sex-matched never-psychotic (NP) comparison group. For the present analyses, after video consent and quality screening, the final sample included SZ (n=43), OP (n=53), and NP (n=116). Clinical interviews comprised the SCID and Quality of Life Scale; symptom dimensions were assessed via SAPS/SANS factor scores (reality distortion, disorganization, inexpressivity, avolition). Tardive dyskinesia was assessed with AIMS. Video and FERA processing: FaceReader v8.0 produced time series (30 Hz) of seven facial EEs (angry, disgusted, happy, neutral, sad, scared, surprised) as continuous probabilities (0–1). For participants with multiple video files, outputs were concatenated, padding file boundaries with missing values to avoid spurious transitions. Short unusable segments (≤1 s, 30 frames) were interpolated using Stineman interpolation. Participants with ≤5 minutes total usable data or ≥90% missing were excluded. Data were downsampled from 30 Hz to 2 Hz (500 ms) by averaging non-missing values within 15-frame bins, then standardized within participant (mean 0, SD 1). Prewhitening and modeling: To preserve focus on lag-1 dynamics, each EE time series underwent ARMA(8,2) prewhitening to remove higher-order autoregressive/moving-average structure while preserving the AR1 component to be modeled within GIMME. CS-GIMME (euSEM framework) estimated contemporaneous (lag-0) and cross-lagged (lag-1) paths among EEs at the group level and subgroup level (confirmatory subgroups: SZ, OP, NP), followed by individual-level tailoring. Group-level paths were retained if they improved fit for ≥75% of participants; subgroup-level paths for ≥65% within subgroup. Primary measures: EE inertia (AR1 coefficients) and cross-lagged transition coefficients were extracted per participant. Statistical analysis: Mixed-effects regression models tested group differences in AR1 and common cross-lagged paths, including video quality covariates; pairwise contrasts used Kenward–Roger df and Tukey corrections. Partial correlations (controlling for group, sex, age, race) tested associations between EE dynamics and symptom dimensions. Post hoc models within SZ examined associations between SZ-specific edges and symptom dimensions. Data and code availability: Anonymized variables and computed signals are on OSF (https://osf.io/8gsye/); an R package with functions is available via GitHub linked on OSF.
Key Findings
- Descriptive and modeling overview: FaceReader-derived seven-EE time series from clinical interviews were modeled with CS-GIMME to estimate inertia (AR1) and cross-lagged dynamics at 500 ms resolution. - Inertia (AR1): All EEs showed positive AR1 on average. Group differences were minimal; NP had a lower Angry AR1 than OP (d=0.49, P=0.009). Other AR1 group comparisons were nonsignificant. - Common cross-lagged transitions (present across groups): Neutral→Happy, Neutral→Sad, Neutral→Disgusted were significant positive paths at the group level, indicating movement away from neutral. - Subgroup-specific dynamics: • NP: Disgusted→Happy, Sad→Happy, and Surprised→Neutral. • SZ: Disgusted→Happy; Neutral→Surprised; Neutral→Scared; Scared→Sad. • OP: Happy→Sad; Surprised→Sad; Disgusted→Sad; and a negative Sad→Happy path (reduced likelihood of moving from Sad to Happy). - Magnitude differences among common paths (mixed-effects): • Neutral→Happy dynamics were weaker in SZ than NP (d=−0.38) and especially weaker in OP (OP vs NP d=−1.95; OP vs SZ d=−1.41). • Neutral→Sad dynamics were stronger in OP than NP (d=1.28) and SZ (d=1.18). • Sad→Happy transitions were less likely in OP than NP (d=−2.27). • No group differences for Neutral→Disgusted and Disgusted→Happy. - Associations with symptom dimensions (partial correlations controlling for group, sex, age, race): • Disorganization was negatively associated with inertia (AR1) for Neutral, Happy, Surprised, and Scared (greater volatility). • Tardive dyskinesia was associated with weaker Neutral→Sad dynamics and lower Angry and Disgusted inertia. • Reality distortion was associated with higher Sad inertia (AR1). - SZ-specific post hoc associations: Within SZ, disorganization was positively associated with Neutral→Scared and Neutral→Surprised; inexpressivity was negatively associated with Neutral→Surprised and positively with Disgusted→Happy. - Clinical interpretation: OP showed convergence of dynamics toward Sad and reduced recovery from Sad (negative Sad→Happy), aligning with mood pathology. SZ showed increased transitions from Neutral to uncommon expressions (Scared, Surprised), aligning with disorganization and affective incongruence. Mean levels of these EEs did not differ by diagnosis, highlighting the added value of dynamic modeling.
Discussion
The study demonstrates that combining automated FERA output with dynamic network modeling captures clinically meaningful facial emotion expression dynamics during interviews. Results align with theoretical expectations: all groups tend to depart from neutral, but diagnostic subgroups exhibit distinctive dynamic signatures. OP participants showed a dynamic pull toward sadness and reduced transitions from sadness to happiness, consistent with mood disorder phenomenology. SZ participants displayed more transitions from neutral to uncommon expressions (surprised, scared), potentially reflecting disorganization or environment-incongruent affect; these dynamics related specifically to clinician-rated disorganization. The approach goes beyond static mean levels, which did not differentiate groups for uncommon expressions, by revealing directed temporal relationships. Associations between inertia/dynamics and symptom dimensions (e.g., disorganization, tardive dyskinesia) suggest utility for dimensional clinical characterization, with possible neuromotor contributions in dyskinesia-related volatility. Given the scalability of video capture and FERAs, the framework has strong potential for integration into telehealth and other clinical settings to augment assessment, monitoring, and treatment personalization.
Conclusion
This work introduces a scalable computational framework that integrates pretrained FERAs with dynamic network modeling to quantify second-to-second facial emotion expression dynamics. Applied to psychotic disorders, the method revealed shared and diagnosis-specific dynamic patterns that align with symptom dimensions, particularly disorganization and mood-related sadness dynamics. The framework can leverage ubiquitous video data (e.g., telemedicine) to provide objective socio-emotional indices that complement clinician judgment. Future research should expand multimodal modeling by incorporating linguistic content, test generalizability in more diverse samples, and examine slower-timescale dynamics, ultimately moving toward clinically actionable, personalized markers of socio-emotional functioning.
Limitations
- Modality limitation: Inferences focus on facial expressions, not internal emotional experience; integrating speech/language content could improve validity. - Sample diversity: The sample was largely White; although FERAs have shown cross-racial performance, potential differential misclassification (e.g., in Black participants) warrants replication in more diverse cohorts. - Timescale: Analyses targeted rapid dynamics (~500 ms); slower dynamics over seconds to minutes were not captured. - Algorithmic constraints: Performance and classification accuracy depend on the FERA (FaceReader) and video quality; residual biases or errors may influence dynamics. - Ecological factors: Clinical interview context may influence expression patterns; causal interpretations are limited by observational design.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny