logo
ResearchBunny Logo
Accelerating eye movement research via accurate and affordable smartphone eye tracking

Medicine and Health

Accelerating eye movement research via accurate and affordable smartphone eye tracking

N. Valliappan, N. Dai, et al.

Discover an innovative smartphone-based eye tracking method developed by a team of researchers from Google Research that rivals high-end mobile eye trackers at a fraction of the cost. This groundbreaking technique not only provides remarkable accuracy but also uncovers potential applications in reading comprehension and healthcare.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses whether accurate, scalable eye tracking can be achieved on smartphones using only the built-in front-facing camera, enabling rigorous eye movement research outside specialized, expensive hardware. Eye movements provide a window into attention and cognition, yet most research relies on desktop systems with infrared eye trackers that are costly and not easily scalable. With billions of smartphone users and increasing mobile content consumption, understanding eye movements on small displays is important. Prior machine-learning approaches using commodity cameras have shown limited accuracy (≈2.4–3°), insufficient for many research applications. The purpose here is to develop and validate an accurate, affordable smartphone-based eye tracker and to test whether canonical findings in oculomotor control, saliency during natural image viewing, and reading behavior can be replicated on phones. The importance lies in democratizing eye tracking for research, accessibility, and healthcare by scaling to large, diverse populations.
Literature Review
The paper situates its work in decades of eye movement research spanning attention, scene perception, visual search, and reading, and in applied domains such as usability, driving, gaming, accessibility, and medical research. Traditional eye tracking relies on specialized, infrared-based hardware with high spatial and temporal resolution but high cost and limited scalability. Cheaper desktop solutions exist but not for mobile. Recent ML-based gaze estimation using laptop/smartphone cameras reported ≈2.4–3° accuracy, below the ≈0.5–1° typical of specialized systems. The authors build upon GazeCapture and related appearance-based gaze estimation literature, aiming to improve accuracy via model architecture, calibration/personalization, and optimized use conditions on smartphones.
Methodology
Model and training: A multi-layer feed-forward convolutional neural network (ConvNet) takes RGB selfie-camera images cropped to eye regions (128×128×3), processed by two identical towers (left eye horizontally flipped for weight sharing) with three convolutional layers (7×7, 5×5, 3×3 kernels; 32, 64, 128 channels; strides 2, 2, 1; ReLU activations), each followed by 2×2 average pooling. Inner/outer eye corner landmarks (4×2 floats) pass through three fully connected layers and are fused with tower outputs via two additional fully connected layers. A regression head outputs x,y gaze on screen. Face bounding boxes and six landmarks are obtained via a MobileNets+SSD detector. The base model is trained on MIT GazeCapture. Personalization and calibration: The base model is fine-tuned using ~30 s calibration per user (~1000 image–target pairs). After fine-tuning (all layers trainable), features from the penultimate ReLU layer feed a lightweight support vector regression (SVR) fitted per user to produce final x,y screen coordinates minimizing deviation from targets. Calibration stimuli: (1) dot task with a green pulsating dot (18–50 dp, every 300 ms) at random locations; (2) zig-zag smooth movement for 60 s. Camera recorded at 30 Hz with synchronized timestamps. Evaluation: Accuracy is the Euclidean error (cm) between ground-truth stimulus positions and estimated gaze. Studies primarily used optimal UX: near-frontal head pose and short viewing distance (25–40 cm). Data collection used a custom Android app for stimulus presentation and front-camera capture; processing was done offline. Fixations/saccades identified via a velocity threshold (22°/s) and minimum fixation duration 100 ms when applicable. Participants with hold-out calibration error >1 cm were excluded. Study 1 (comparison with specialized mobile eye tracker): 30 participants (post-cleaning: 26 for the smartphone model; 13 for Tobii comparison) used Pixel 2 XL and Tobii Pro 2 glasses (50 Hz). Four conditions: with/without Tobii; phone on stand vs handheld. Dot task (41 dots over 1 min) and zig-zag (60 s). Tobii required one-point calibration; AprilTags were used for robust automapping of phone screen. For each dot, median gaze across frames was used; out-of-screen estimates snapped to nearest valid location. Study 2 (oculomotor tasks): 30 participants (post-cleaning: 22). Tasks across blocks with periodic recalibration: prosaccade (central fixation then peripheral target; 10 trials per sub-block; three sub-blocks), smooth pursuit circle and box (three trials each), and visual search (color intensity and orientation contrasts; participants tapped the target). Metrics: saccade latency; pursuit tracking error; number of fixations and fixation durations to target. Study 3 (saliency on natural images): 37 participants (post-cleaning: 32). Tasks: calibration, free viewing (700 OSIE images; each participant viewed 350 images for 3 s each; 1 s blank between; average 16 participants per image), and visual search for labeled objects (10 object classes; 10 blocks of six images). Gaze data smoothed with a bilateral filter (100 ms, 200 px). Fixation maps created by pixel rounding and Gaussian smoothing (24 px) to match OSIE settings. Comparisons made to EyeLink 1000 desktop data (OSIE) using pixel- and object-level heatmap correlations. Study 4 (reading comprehension): 23 participants (post-cleaning: 17). Ten tasks: five SAT-like English passages and five computer science passages with code snippets. Each task had two questions: factual (answer directly in passage) and interpretive (requires inference). Scrolling was allowed; viewport was synchronized to map gaze to page coordinates. Metrics included gaze entropy, proportion of fixation duration on relevant excerpt (normalized by height), time to answer, number of fixations. Task difficulty quantified as percent incorrect per task.
Key Findings
- Accuracy and efficiency: Personalization with ~100 calibration frames (<30 s) reduced error from 1.92 ± 0.20 cm (base) to 0.46 ± 0.03 cm (t(25)=7.32, p=1.13×10^-7). Participant errors ranged 0.23–0.75 cm ([5,95]th percentiles [0.31, 0.72] cm). At 25–40 cm viewing distance, this is ≈0.6–1° accuracy, substantially better than prior smartphone/laptop ML methods (≈2.44–3°). Model architecture changes plus fine-tuning/personalization reduced parameters by ~50× (8M vs 170K) while improving accuracy (0.73 cm to 0.46 cm under comparable personalization), enabling on-device use. - Spatial uniformity: Errors were similar across screen locations, slightly larger at bottom where eyelids partially occluded eyes when looking down. - Device/pose: Best accuracy under near-frontal head pose and shorter viewing distances; accuracy degrades with increased pan/tilt/roll or farther distances. - Comparison to Tobii Pro 2 glasses (n=13): On stand: smartphone 0.42 ± 0.03 cm vs. Tobii 0.55 ± 0.06 cm (t(12)=-2.12, p=0.06). Handheld: smartphone 0.50 ± 0.03 cm vs. Tobii 0.59 ± 0.03 cm (t(12)=-1.53, p=0.15). No significant difference in accuracy. - Oculomotor tasks (n=22): Prosaccade mean latency 210 ms (median 167 ms), matching known ranges (≈200–250 ms). Smooth pursuit (circle) tracking error 0.39 ± 0.02 cm; similar for box. Visual search: Fixations to find target decreased with increasing target saliency (color intensity: F(3,63)=37.36, p<1e-5; orientation: F(3,60)=22.60, p<1e-5). Set size effects depended on saliency: at low saliency (Δθ=7°) fixations increased linearly with set size (slope 0.17; F(2,40)=3.52, p=0.04); at medium-high (Δθ=15°) no significant set size effect (F(2,40)=0.85, p=0.44); at very high (Δθ=75°) negative slope (-0.06; F(2,40)=4.39, p=0.02). - Natural images (n=32): Gaze entropy higher for free viewing than visual search (16.94 ± 0.03 vs. 16.39 ± 0.04; t(119)=11.14, p=1e-23). Time to find targets decreased with target size (r=-0.56, p=1e-11, n=120 images) and with target saliency density (r=-0.30, p=0.0011). Clear center bias observed during free viewing. Mobile vs. desktop heatmaps highly correlated despite blur on mobile: pixel-level r=0.74; object-level r=0.90; shuffled desktop correlation much lower (0.11 pixel-level, 0.59 object-level). - Reading comprehension (n=17): Gaze entropy higher for interpretive vs. factual tasks (8.14 ± 0.16 vs. 7.71 ± 0.15; t(114)=1.97, p=0.05). For factual questions answered correctly, participants spent more fixation time on relevant excerpts (62.29 ± 3.63% vs. 37.71 ± 3.63%; t(52)=3.38, p=0.001). For wrong answers the trend inverted but was not significant (41.97 ± 6.99% relevant; t(12)=-1.15, p=0.27). With increasing task difficulty: time to answer trended up (Spearman r=0.176, p=0.63), number of fixations increased (r=0.67, p=0.04), and fraction of gaze time on relevant excerpt decreased strongly (r=-0.72, p=0.02). These indicate smartphone gaze can detect reading comprehension difficulty.
Discussion
The findings demonstrate that accurate gaze estimation on smartphones is feasible using only the front-facing RGB camera with brief per-user calibration and personalization, achieving ≈0.6–1°—comparable to specialized mobile eye trackers but at vastly lower cost. By reproducing canonical results in prosaccades, smooth pursuit, visual search, and saliency-driven gaze on natural images, the method meets standards for eye movement research on small screens. Reading experiments further show utility in detecting task demands and comprehension difficulty from gaze distributions. This enables scaling eye tracking across applications, populations, and study sizes, facilitating remote, large-N research in vision science, UX/usability, accessibility, and health. Although mobile heatmaps are more blurred due to smaller displays and lower precision, they correlate strongly with high-end desktop data, supporting their use in saliency analyses of mobile content. The approach’s lightweight model is suitable for on-device deployment, addressing privacy and enabling real-time interaction use cases.
Conclusion
This work introduces an accurate, lightweight, and affordable smartphone-based eye tracker that matches the accuracy of state-of-the-art mobile systems while requiring no specialized hardware. It validates the approach by replicating established eye movement phenomena on oculomotor tasks and natural image viewing, and by demonstrating detection of reading comprehension difficulty. The contributions include a compact ConvNet with per-user personalization, empirical evaluation across multiple studies, and direct comparison to Tobii Pro glasses. Future directions include: improving robustness to head pose and distance variations; extending to fully natural, handheld, remote use; leveraging higher frame-rate cameras for better temporal resolution; broadening device and demographic coverage; deploying fully on-device to enhance privacy; and applying the method to accessibility (gaze-based interaction) and healthcare (screening/monitoring through gaze phenotypes).
Limitations
- Data were collected in lab settings with phones mounted on a stand to avoid fatigue and large head pose changes; more natural handheld, in-the-wild use remains to be validated. - Temporal resolution was limited by the smartphone camera used (30 Hz on Pixel 2 XL), restricting precise saccade latency, velocity, and fixation duration measurements compared to 50 Hz (Tobii glasses) or 1000–2000 Hz (desktop EyeLink 1000). - Best performance requires near-frontal head pose, short viewing distance (25–40 cm), good indoor lighting, and participants without glasses (to avoid reflections). Performance degrades with extreme pan/tilt/roll, downward gaze (partial eyelid occlusion), or larger distances (small eye region). - Participants with calibration error >1 cm were excluded, which may limit generalizability to all users/devices. - Results were primarily on Pixel 2 XL; while methodology is device-agnostic, cross-device variability can affect accuracy and needs further study. - Mobile heatmaps are more blurred than desktop due to smaller displays and lower precision, which may impact fine-grained spatial analyses.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny