
Computer Science
Detection of eye contact with deep neural networks is as accurate as human experts
E. Chong, E. Clark-whitney, et al.
Discover a groundbreaking deep neural network model that automatically detects eye contact in egocentric video, achieving accuracy on par with human experts. This innovative research, conducted by Eunji Chong, Elysha Clark-Whitney, Audrey Southerland, Elizabeth Stubbs, Chanel Miller, Eliana L. Ajodan, Melanie R. Silverman, Catherine Lord, Agata Rozga, Rebecca M. Jones, and James M. Rehg, showcases precision and recall rates that could transform gaze behavior analysis in clinical and research contexts.
~3 min • Beginner • English
Introduction
The study addresses the challenge of objectively and scalably measuring eye contact—an essential element of social communication and a key marker in developmental conditions such as autism spectrum disorder (ASD). Traditional eye-tracking approaches are costly, burdensome, and difficult to deploy in naturalistic, face-to-face interactions, particularly with infants, young children, and individuals with high support needs. The authors propose using wearable glasses with an embedded point-of-view (PoV) camera worn by the interaction partner so that, during genuine eye contact, the subject’s gaze is directed toward the camera. The research questions are whether a deep learning model trained on large-scale egocentric data can automatically detect eye contact at expert human level accuracy, whether transfer learning can address limited subject diversity relative to image count, and whether automated measures replicate established findings on eye contact in prior developmental and clinical studies. The purpose is to provide a scalable, objective alternative to labor-intensive manual coding, with importance spanning basic social behavior analysis and clinical screening/assessment for ASD and related conditions.
Literature Review
The paper situates its contribution within prior work in automated social behavior analysis and medical AI. Deep learning has achieved expert-level performance across several biomedical domains (e.g., diabetic retinopathy, skin cancer, mammography, fractures, atrial fibrillation). In contrast, fewer works have targeted automated analysis of social behaviors in clinical contexts, and those that did often did not achieve or benchmark expert-level performance. Prior efforts examined social responses like “response to name,” robot-child interactions, and related behaviors, but did not address eye contact specifically or naturalistic face-to-face settings with expert-level parity. Previous multi-task learning approaches for eye contact/gaze estimation did not match the performance obtained here with transfer learning from 3D pose/gaze tasks. The present work advances the field by demonstrating expert-equivalent performance for eye contact detection in egocentric video and by empirically supporting the superiority of transfer learning over multi-task learning for this problem.
Methodology
Study design and data acquisition: The dataset was collected between 2015 and 2018 at two sites: typically developing (TD) subjects at Georgia Tech (GTL) and ASD subjects at the Center for Autism and the Developing Brain (CAB), Weill Cornell Medicine. Interactions were recorded using lightweight glasses worn by the interactive partner with a small outward-facing PoV camera in the bridge. During eye contact, the subject looks directly at the camera, enabling detection from the egocentric viewpoint. Lenses were removed to allow unobstructed view of the partner’s eyes.
Participants and protocols: A total sample included young children and adolescents across TD and ASD groups. Reported overall modeling dataset: 103 unique subjects with diverse demographics, including 57 with ASD. In a separate young children sample, 66 children (35 suspected ASD) were recruited at CAB and 18 TD at GTL (3 TD later excluded due to technical issues). All subjects completed ESCS and BOSCC protocols in randomized order; subsets completed follow-up sessions and parent-administered BOSCC. The validation set contained 18 sessions and was representative of the overall sample in diagnostic group, gender, age, race, ethnicity, and ASD social impairment severity.
Annotation and reliability: Video was decoded at 30 fps. In each frame, the subject’s face was detected and cropped; frames were labeled 1 for eye contact and 0 otherwise. Manual coding followed established protocols using INTERACT software, with inter-rater reliability established initially (mean ICC ~0.886–0.903). For validation analysis, ten expert human raters produced frame-level eye contact labels; consensus of nine served as ground truth per rater comparison. Overall dataset included 4,339,879 annotated images. Event-level post-processing included temporal smoothing to merge short segments and remove outliers via sliding window, with hyperparameters selected by grid search on held-out training data to maximize detection accuracy while minimizing event fragmentation at matched recall.
Model architecture and training: A deep convolutional neural network with a ResNet-50 backbone processed 224×224 face crops to output per-frame eye contact scores via softmax. Two-stage training implemented transfer learning: Stage 1 pretraining on public datasets to learn head pose and gaze direction relationships (3D-related estimation). Stage 2 fine-tuning on the eye contact dataset. Baselines included a multi-task learning approach (adapted from prior work) and ablations without transfer learning. The impact of face detection on accuracy was separately analyzed (Supplementary Table 1).
Evaluation: Three experiments were conducted: (1) Frame-level detection performance: precision-recall (PR) curves generated by varying the classification threshold on model scores, summarized by maximum F1 and average precision (AP). Temporal smoothing was evaluated. (2) Inter-rater reliability: the model was treated as an additional rater; Cohen’s kappa was computed for all human-human and human-model pairs. Equivalence between human and model reliability was assessed using two one-sided tests (TOST), with equivalence bounds Δ set to the SD of human kappas (0.025). (3) Reproducibility: Automated eye contact measures (frequency and duration) were used to replicate statistical findings from two prior studies; analyses mirrored the original tests (ANOVA, Mann–Whitney U, Wilcoxon Signed-Rank), and subjects used in this evaluation were excluded from model training. Additional correlation analyses examined relationships between automated eye contact measures and ASD symptom severity (ADOS CSS SA and BOSCC SA).
Key Findings
- Frame-level accuracy: On 18 validation sessions, the deep model achieved precision 0.936 and recall 0.943. With temporal smoothing, the model’s operating point had F1 = 0.940, precision = 0.934, recall = 0.938; without smoothing, F1 = 0.916, precision = 0.924, recall = 0.915. Mean human rater performance: F1 = 0.932, precision = 0.918, recall = 0.946. The model’s PR lay within one SD of human raters and exhibited higher precision at matched recall.
- Average precision (area under PR curve) without smoothing: ESCS AP = 0.948; BOSCC AP = 0.959; combined AP = 0.956.
- Transfer learning vs baselines: Removing transfer learning reduced F1; a prior multi-task learning approach yielded F1 = 0.906 (precision 0.920, recall 0.809), underperforming transfer learning.
- Inter-rater reliability: Mean human–human Cohen’s kappa m_h = 0.888. Mean human–detector kappa m_hd = 0.891. TOST equivalence testing at p = 0.05 demonstrated equivalence with Δ as low as 0.025 (the SD of human kappas), indicating the detector is as reliable as human annotators.
- Reproducibility of prior studies: Automated coding replicated the key statistical findings reported in two prior studies, with automated results matching manual coding outcomes for effects of context and within- and between-group comparisons (see Tables 3 and 4 and Fig. 4).
- Correlation with ASD severity: In ASD participants, automated eye contact frequency and duration correlated negatively with symptom severity. ESCS (n = 45): frequency r = -0.41 (p < 0.01), duration r = -0.36 (p < 0.05). BOSCC (n = 58): frequency r = -0.26 (p < 0.05), duration r = -0.29 (p < 0.05). BOSCC SA (n = 25): frequency r = -0.75 (p < 0.001), duration r = -0.78 (p < 0.001).
Discussion
The findings demonstrate that automatic detection of eye contact from egocentric PoV video can reach expert human performance. The model’s precision-recall characteristics and F1 score match or exceed mean human rater performance at comparable recall, supporting the primary hypothesis of expert-level equivalence. Reliability analyses further confirm that treating the detector as another rater preserves inter-rater reliability (human–detector kappa comparable to human–human). The equivalence tests reinforce that any differences are within a practically negligible bound defined by the variability among human raters. Negative correlations between automated eye contact measures and ASD symptom severity support construct validity, as more severe social affect symptoms are associated with lower frequency and duration of eye contact. Replication of prior studies’ results using fully automated measures indicates that automated coding can substitute for manual coding in developmental and clinical research, enabling scalable, objective assessments. Methodologically, pretraining on 3D pose/gaze tasks (transfer learning) outperformed multi-task learning baselines for eye contact detection from PoV data, suggesting value in leveraging related tasks to learn robust angular and pose representations even when subject diversity is limited relative to total frame count.
Conclusion
This work introduces an automated, egocentric-video-based eye contact detector that achieves expert-level accuracy and reliability, enabling scalable, objective analysis of gaze behavior in natural face-to-face interactions. By validating parity with human raters, demonstrating strong reliability, reproducing key findings from prior studies, and showing expected correlations with ASD severity, the method supports replacing labor-intensive manual coding in research and potentially in clinical workflows. The study also shows the advantage of transfer learning from 3D pose/gaze tasks over multi-task learning baselines. Future directions include expanding subject and context diversity to further test generalization, integrating the tool into clinical and educational settings, evaluating performance across broader age ranges and diagnostic groups, optimizing robustness to face detection failures and motion blur, and exploring real-time deployment on mobile platforms.
Limitations
- Dataset subject diversity: Although the dataset contains 4.34M annotated frames, it includes approximately 103 unique subjects, which may limit generalization across broader populations and settings.
- Training overlap in correlations: Some subjects included in correlation analyses overlapped with the training set to avoid excessive reduction of training data, potentially biasing correlation estimates.
- Annotation constraints: Certain annotation steps (e.g., frame-level labels derived from single raters in some phases) and incomplete manual annotation of all video segments may introduce labeling noise.
- Dependence on face detection and post-processing: Performance can be affected by face detection failures, motion blur, and eye blinks; temporal smoothing and sliding-window heuristics were required to stabilize predictions.
- Data sharing restrictions: IRB constraints prevent releasing the eye contact dataset, which may limit external reproducibility and benchmarking.
- Ecological and hardware constraints: The approach relies on PoV glasses worn by the interactive partner; setup specifics (e.g., lens removal) and controlled protocol contexts (ESCS, BOSCC) may not capture all real-world conditions.
Related Publications
Explore these studies to deepen your understanding of the subject.