logo
ResearchBunny Logo
Towards the automatic detection of social biomarkers in autism spectrum disorder: introducing the simulated interaction task (SIT)

Medicine and Health

Towards the automatic detection of social biomarkers in autism spectrum disorder: introducing the simulated interaction task (SIT)

H. Drimalla, T. Scheffer, et al.

Explore the innovative Simulated Interaction Task (SIT), a groundbreaking tool developed by Hanna Drimalla and colleagues at Charité—Universitätsmedizin Berlin. This study unveils how SIT can effectively quantify social interaction deficits in Autism Spectrum Disorder, achieving an impressive 73% accuracy with machine learning technology.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of objectively assessing social interaction deficits, which are core features of autism spectrum disorder (ASD) and present in other psychiatric conditions. Current gold-standard assessments (e.g., ADOS, ADI-R) require expert training and subjective judgment, making them time-consuming, difficult to standardize, and not easily scalable. The research question is whether a fully standardized, computer-based simulated interaction task (SIT) combined with automated analysis of non-verbal behaviors (facial expressions, gaze, and voice) can validly quantify social communication biomarkers and aid in detecting ASD, particularly in adults without intellectual disability who are often diagnosed late due to camouflaging. The purpose is to create a reproducible, accessible, and cost-effective tool that captures naturalistic social behavior while enabling objective, quantitative assessment and machine-learning-based classification.
Literature Review
Prior work shows that individuals with ASD have difficulties encoding and signaling emotions and often display reduced spontaneous facial mimicry, atypical gaze (including eye contact avoidance), and aberrant prosody. While objective cognitive assessments are established (e.g., MoCA, MMSE), standardized tools for social interaction measurement are scarce; clinicians often rely on the "clinical gaze." Some studies have automated analyses of speech, facial expressions, and gaze, and machine-learning approaches have shown promise in predicting ASD, but prior work largely focuses on children or non-interactive paradigms and lacks standardized, reactive interaction partners. There is a need for adult-focused, standardized, naturalistic tasks with automated analysis to detect subtle social communication differences.
Methodology
Design: Two-part research program comprising (1) a preparatory facial electromyography (EMG) study in healthy controls (HC) to identify facial regions/features responsive to the SIT and (2) a clinical main study with adults with ASD and neurotypical controls (NT/HC) using automated video/audio analysis and machine learning. Participants: Main study included 37 adults with ASD (18 females; mean age 36.89 years; range 22–62) and 43 healthy controls (22 females; mean age 33.14 years; range 18–49). ASD diagnoses (Asperger, atypical ASD, or childhood ASD per ICD-10) were previously established; ADOS (n = 35) and ADI-R (n = 21) were used when available. Exclusions: current antipsychotic/anticonvulsant use, neurological disorders, age >55 years, inadequate language competence; controls excluded for any psychiatric history. Preparatory EMG study: 40 healthy male participants. Setting and procedure: Laboratory with constant lighting. Participants sat in front of a laptop and engaged in a 7-minute simulated video dialog with a pre-recorded actress discussing dinner routines and food preferences (neutral, positive, negative valence segments). Two cameras recorded synchronized video/audio. The actress’ performance was pre-recorded to ensure identical timing and behavior across participants; she exhibited empathic cues (smiling, nodding). Warm-up segment not analyzed. Participants were instructed to respond naturally. Simulated Interaction Task (SIT): Standardized conversational segments: (1) neutral (dinner preparation), (2) positive (liked food), (3) negative (disliked food), each followed by participant response windows (~30 seconds). The actress displayed appropriate affect for each segment. The actress’ non-verbal behavior was validated via automated AU analysis (e.g., more AU6/AU12 during positive, more negative/disgust-related AUs during negative content). Automated data acquisition and preprocessing: Face and gaze analyzed with OpenFace. Frames with unsuccessful tracking or low confidence (<0.5–0.75 thresholds depending on analysis) were excluded; participants with <90% successful tracking were excluded. Voice analyzed with Python libraries (librosa) and Praat scripts. Extracted features included facial action units (AUs), gaze angles (horizontal/vertical), and voice prosodic features (fundamental frequency F0, jitter, shimmer, RMS energy, MFCCs). For facial expressions, analysis focused on AUs associated with happiness (AU6, AU12) and negative affect/disgust/frown (AU4, AU9/8) with neutral baseline control. Social smiling was assessed across the dialog, and mimicry was quantified via time-shifted correlations between actress and participant AU time series within 10s windows. Gaze features included centered gaze angles and derived measures (median, absolute deviation, speed, acceleration), plus mimicry via gaze-angle correlations. Voice features computed over entire videos; summary statistics per participant used for group comparison and classification. Feature engineering and machine learning: Secondary features derived (mean, SD, min/max, time of maxima, skewness, kurtosis) per conversation segment for facial and gaze data, yielding 215 facial and 85 gaze features. Voice feature set included MFCCs, energy, mean/SD of F0, jitter, shimmer (58 primary features). Gender included as a feature. A single feature vector per participant (aggregated across segments) was constructed. Classifier: Random Forest (1000 trees) with hyperparameters tuned (max depth ∈ [1,2,4,8,16,32,64]; min samples per leaf ∈ [1,2,4,8,16,32,64]). Evaluation: leave-one-out cross-validation with nested 3-fold CV for hyperparameter tuning; AUC computed by training on all other participants and testing on the held-out participant. Expert ratings: Eight experienced clinicians (psychologists/psychiatrists) each rated 10 distinct videos (non-overlapping) focusing solely on non-verbal behavior. Ratings were on a continuous scale from neurotypical to ASD with a diagnostic threshold marked at the midpoint. Each video was rated once. Accuracy per rater was computed; associations with rater experience were analyzed. Statistical analysis: Due to skewness/non-normality, non-parametric tests were used for group comparisons (e.g., Wilcoxon, Mann–Whitney U). For HNR group effects, ANOVA results reported. Multiple comparisons corrections applied where noted.
Key Findings
- SIT successfully evoked target emotions: Participants showed increased positive AUs (AU12, AU6) in positive segments compared to neutral; trend towards more AU4 intensity in negative vs neutral. - Group differences in facial expressions and mimicry: - Positive segment: AU12 occurrence was lower in ASD vs NT (Mdn 0.10 vs 0.49; U = 5780.0, p = 0.227; trend-level), no AU6 differences (p > 0.05). - Negative segment: ASD showed more negative facial activity than NT (Mdn 0.08 vs 0.03; U = 568, p = 0.014, r = 0.29). - Social mimicry: ASD exhibited less mimicry and social smiling. Examples include reduced AU5 occurrence/intensity and reduced AU6 occurrence in ASD vs NT (e.g., AU6 occurrence ASD Mdn 0.03 vs NT 0.12; U = 598.5, p = 0.028, r = 0.25). Time-shift correlations indicated stronger micromimicry in NT (ASD Mdn 0.08 vs NT 0.19; Z = 1.130, p = 0.004, r = 0.34). - Gaze behavior: Descriptive metrics showed small group differences; no significant evidence for group differences across gaze features after correction. Gaze-based classification performed modestly. - Voice characteristics: - Fundamental frequency (F0) higher in ASD: females 218.91 Hz (SD 16.16) vs NT 209.08 Hz (SD 17.92); males 139.67 Hz (SD 16.22) vs NT 121.68 Hz (SD 11.68). - Harmony-to-noise-ratio (HNR): main effect of group F(1,76) = 7.97, p = 0.0061, η² = 0.055; main effect of gender F(1,76) = 60.78, p < 0.001, η² = 0.422. (Table values: female median HNR NT 12.30 vs ASD 6.91; male median HNR NT 10.84 vs ASD 8.50.) - Machine-learning performance: - Facial features: AUC = 0.65 using selected facial expression features; AUC = 0.74 using all 17 OpenFace variables. Better performance observed in females in exploratory analysis. - Gaze features: AUC = 0.63; no significant gender accuracy differences. - Combined features (face + gaze + voice): AUC = 0.78; at 0.5 probability threshold, accuracy = 73%, sensitivity = 67%, specificity = 79%. Slightly higher detection in females (70%) than males (67%). Class probability correlated with ADOS (r = 0.48, p < 0.0001) and age (r = 0.35, p = 0.001). - Expert ratings: Individual expert accuracies ranged 0.56–1.00 (mean 0.71). No significant gender bias in accuracy (χ² = 6.00, p = 0.111). Accuracy increased with experience (η² = 0.72, p = 0.014). - Comparative performance: Combined-feature ML classifier outperformed a majority vote (McNemar’s χ² = 9, p = 0.014) and was not significantly worse than clinical experts (p > 0.05).
Discussion
The SIT provides a standardized yet naturalistic paradigm to elicit and quantify non-verbal social behaviors relevant to ASD. Both the preparatory EMG study and automated analyses confirmed that the SIT reliably evokes positive and negative affective expressions, supporting its validity. In the ASD sample, reduced social smiling and decreased mimicry of positive affect, alongside differences in vocal prosody (higher F0 and group differences in HNR), characterize ASD phenotypes. These findings align with literature on diminished spontaneous mimicry and atypical prosody in ASD and indicate that automated, multi-domain behavioral features can capture ASD-related social communication differences. Although group differences were not observed in gaze metrics using appearance-based gaze estimation, classification results and methodological considerations suggest these measures may lack sensitivity relative to eye-tracking, warranting improved gaze measurement. Multivariate machine-learning approaches effectively leverage integrated behavior to avoid multiple-testing pitfalls and yield diagnostic information generalizing to unseen individuals. The combined-feature classifier’s accuracy (73%) approaches expert performance and exceeds naive baselines, indicating potential clinical utility as a supplement to traditional assessments. Associations between classifier probability and ADOS severity, as well as age effects, suggest the model captures meaningful variance related to symptom burden, while highlighting the need to monitor confounders and potential biases (e.g., age-related variability). The SIT is positioned as an adjunct for screening and monitoring, potentially scalable for remote use, with implications beyond ASD to other psychiatric conditions where social cognition differs.
Conclusion
This work introduces the Simulated Interaction Task (SIT), a fully standardized, computer-based, and time- and cost-efficient measure to quantify social communication behavior. The SIT evokes naturalistic non-verbal responses and, via automated analyses of facial expressions, gaze, and voice, identifies social biomarkers associated with ASD. A machine-learning classifier trained on these features achieves an AUC of 0.78 and 73% accuracy, comparable to clinical expert ratings and better than a majority vote. Key phenotypic markers include reduced social smiling and facial mimicry and altered vocal prosody (higher F0, HNR differences). The SIT can supplement clinical assessment, facilitate scalable screening and monitoring, and may extend to other psychiatric conditions. Future work should: (1) improve gaze measurement precision (e.g., eye tracking), (2) validate SIT in home and face-to-face settings, (3) explore adaptive/matched interaction partners, (4) test larger and more diverse clinical samples, and (5) evaluate benefits, costs, and risks of broad deployment.
Limitations
- Gaze estimation relied on appearance-based methods (OpenFace) which may lack the sensitivity of dedicated eye trackers; null gaze group differences may reflect measurement limitations. - Standardization via pre-recorded actress trades off with natural adaptability; responses are not contingent on participant behavior length/content, which may affect ecological validity. - Home setting use was not fully validated; self-administration can reduce accuracy without automated quality checks (camera/microphone calibration). - Sample focused on adults without intellectual disability; generalizability to broader ASD populations and other demographics requires testing. - Potential confounders (e.g., age) influenced model outputs; older participants were more often flagged as outliers, underscoring risk of bias in ML. - Some inconsistencies/ambiguities in voice metric labeling (HNR/NHR) and directionality across text and tables warrant cautious interpretation. - Expert rating variability was high; while ML approached expert accuracy, neither should be considered a stand-alone diagnostic replacement.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny