logo
Loading...
In-lab versus web-based eye-tracking in decision-making: A systematic comparison on multiple display-size conditions mimicking common electronic devices

Psychology

In-lab versus web-based eye-tracking in decision-making: A systematic comparison on multiple display-size conditions mimicking common electronic devices

S. Muñoz, V. Maksimenko, et al.

Research conducted by Sebastián Muñoz, Vladimir Maksimenko, Bastian Henriquez-Jara, Prateek Bansal, and Omar David Perez systematically compares a lab eye tracker (EyeLink 1000 Plus) and a webcam method (WebGazer) across two discrete-choice experiments. By varying display size (monitor to mobile) and task complexity, they show WebGazer matches EyeLink on larger screens and simple tasks but loses reliability on small displays and complex choice matrices—offering the first systematic evaluation and practical guidance for online behavioral studies.... show more
Introduction

The paper addresses whether webcam-based eye-tracking (WebGazer) can reliably capture attention and support valid behavioral inferences in decision-making tasks compared to a laboratory-grade tracker (EyeLink 1000 Plus). Motivated by the growth of online experimentation due to scalability, cost-effectiveness, and constraints such as COVID-19, the authors note prior demonstrations of reliability for online behavioral studies and recent validations of WebGazer in perceptual/cognitive paradigms. However, systematic evaluation in discrete choice experiments (DCEs) remains lacking. Because attention is central to explaining choices and DCEs provide clearly defined ROIs, the study asks if gaze metrics and model-based parameters (e.g., WTP/SVTTS) are comparable between methods. The authors highlight concerns stemming from WebGazer’s lower temporal/spatial resolution, especially with more complex tasks and smaller displays where ROIs are closer together. They manipulate task complexity (simple 2×2 vs. complex 3×3 matrices) and display size (monitor, laptop, tablet, mobile) to approximate common device classes via visual angle scaling. They evaluate both raw gaze metrics and parameters from attention-integrated choice models to delineate when web-based eye-tracking is viable for decision research.

Literature Review

The paper situates its contribution within literature on online behavioral research demonstrating reliability despite concerns about variability (e.g., Crump et al., 2013) and a growing body validating WebGazer for early childhood, cognitive state detection, and across paradigms (Steffan et al., 2024; Hutt et al., 2023; Slim et al., 2024). Prior work indicates webcam-based tracking can capture broad attentional patterns but with greater variability than lab systems. In decision-making, attention’s role in preferences and choices is well documented (e.g., Krajbich et al., 2010; Shimojo et al., 2003; Reutskaja et al., 2011), and DCEs are standard in applied domains (Bliemer & Rose, 2024). Yet, systematic comparisons of web-based versus in-lab eye-tracking in DCEs, especially under varying display sizes and task complexity, are absent. The study also builds on integrated choice and latent variable models linking attention to utility (Vij & Walker, 2016; Krucien et al., 2017; Train, 2009), extending them to compare parameter stability across eye-tracking methods.

Methodology

Design: Two discrete choice experiments (DCEs) were conducted while recording gaze simultaneously with EyeLink 1000 Plus and WebGazer. Display size was manipulated within subjects to mimic common devices (monitor, laptop, tablet, mobile) by scaling stimuli and setting viewing distance to approximate typical visual angles; actual physical devices were not used.

Choice tasks: Experiment 1 (simple 2×2) presented binary pizza choices with two attributes: Quality (values: 1, 3, 5, 7, 9) and Cost (5, 7, 9, 11, 13, 15 SGD). Experiment 2 (complex 3×3) presented three transport alternatives (Bus, Metro, Grab) with three attributes: Cost (fare), Time (minutes), and Comfort (categorical: low/medium/high).

Experimental conditions: Each participant completed four display-size tasks (monitor, laptop, tablet, mobile). Within each task, 12 trials were presented in random order. Attribute row order (up-down) was randomized per trial; alternative positions (left–right) were fixed within each experiment. The same 12 trials were used across display-size tasks within an experiment. No practice or catch trials were included. Each task lasted 3–5 minutes; total session ~40 minutes including instructions, calibrations, and breaks.

Participants: Exp. 1: N=40 (14 males, 26 females; mean age 23.9, SD 6.66). Exp. 2: N=40 (14 males, 26 females; mean age 22.3, SD 2.35). Recruited from National University of Singapore (NUS) Student Work Scheme. Prescreened to exclude participants wearing glasses causing reflection artifacts. Compensation: S$10. Ethics: NUS-IRB-2023-1055; informed consent obtained.

Apparatus and setup: Stimuli presented on 25-inch 1920×1080 monitors at ~60 cm viewing distance. EyeLink 1000 Plus (SR Research) sampled at 1000 Hz in Remote Mode; no chin rest used to mimic natural webcam conditions. WebGazer ran in the participant’s browser using built-in webcam; no video was sent to a server. Display scaling produced device-like visual angles (monitor, laptop, tablet, mobile) with 80% of screen allocated to the choice matrix. Typical viewing distances and visual angles for each device class are provided (e.g., monitor usable display ~37.6×22.4 deg; mobile ~11.2×6.4 deg). Cell width/height were computed from usable display dimensions given numbers of alternatives (M) and attributes (N); example cell sizes ranged from ~12.5×7.5 deg (monitor, 2×2) to ~2.8×1.6 deg (mobile, 3×3).

Calibration procedures: EyeLink used a standard 13-point calibration with verification and drift checks; accuracy threshold <0.5 deg visual angle. WebGazer used an 8-point edge sequence (1 s per point) with a central fixation validation; default success thresholds were applied; all participants passed. EyeLink was calibrated first, then WebGazer; both recalibrated before each display-size task.

Design generation: Exp. 1 used a D-efficient design (Ngene), balancing attribute levels and avoiding dominance across 12 trials. Exp. 2 used an optimal fractional factorial design (R choiceDes) to achieve balanced, low-collinearity 12-trial sets.

Gaze data processing and ROIs: ROIs were rectangles matching visible table cells for alternatives and attributes, defined in stimulus coordinates and scaled with display size. Two ROI families per experiment: alternatives (2 in Exp. 1; 3 in Exp. 2) and attributes (Quality/Cost for Exp. 1; Time/Comfort/Cost for Exp. 2). EyeLink fixations used the device’s native parser; WebGazer streams were parsed offline to yield comparable fixation events.

Gaze metrics: For each display condition, two outcomes were averaged across trials: (i) fixation count per ROI (reported proportionally as count_ROI/count_trial), and (ii) mean relative fixation duration per ROI (ROI mean fixation duration divided by the trial’s mean fixation duration). Trials with zero fixations to an ROI were missing for duration analyses.

Inclusion criteria for analyses: All completed trials with valid gaze were retained. For fixation-duration ANOVAs, participants with no fixations to a given ROI in any display condition were excluded from that specific analysis (Exp. 1: excluded 3 participants for alternatives and 3 for attributes; Exp. 2: excluded 10 for alternatives and 5 for attributes). As a robustness check, trial-level linear mixed-effects models (lmer) treating missing ROI-trials as missing reproduced ANOVA results.

Statistical analyses: For each experiment, two repeated-measures ANOVAs were run (one for alternatives, one for attributes), with within-subject factors: method (EyeLink, WebGazer), display (monitor, laptop, tablet, mobile), and ROI identity (levels depending on experiment). Significant effects were followed by Holm-corrected paired t-tests on marginal means; direction of effects was reported. Bootstrapped within-subject CIs (1,000 iterations) are shown in figures.

Behavioral modeling: The authors estimated preferences using an Integrated Choice and Latent Variable (ICLV) approach, specifically a Latent Information Processing (LIP) model that links visual attention to attribute weights. Baseline choice followed a Random Utility Maximization (MNL) structure with utility as a linear function of attributes. In LIP, attribute coefficients are adjusted by a latent attention factor: tilde_beta_k = beta_k + alpha_k * IP_n, where IP_n is a latent variable linked to fixation-based indicators via measurement equations. Estimation used maximum simulated likelihood with the Apollo package in R. Posterior distributions for WTP (Exp. 1: beta_quality/beta_cost) and SVTTS (Exp. 2: beta_time/beta_cost) were compared across methods and display sizes using t-tests, Kolmogorov–Smirnov tests, and distributional overlap metrics. Posterior distributions for the latent contribution (alpha_k * IP_n) were also compared.

Key Findings

Experiment 1 (pizza; 2 alternatives × 2 attributes):

  • Fixation counts, alternatives: Significant main effect of alternative, F(1,39)=175.30, p<.001; participants fixated more on pizza1 (M=0.299, SE=0.005) than pizza2 (M=0.186, SE=0.004). Interaction alternative×display, F(3,117)=18.41, p<.001. Post hocs: EyeLink showed consistent pizza1>pizza2 across all displays; WebGazer showed no significant difference in the mobile condition.
  • Fixation counts, attributes: No main effect of attribute, F(1,39)=0.03, p=.873; no interactions with method or display.
  • Fixation durations, alternatives: Main effect of method only, F(1,36)=39.62, p<.001; WebGazer slightly longer (M=1.002 s, SE=0.002) than EyeLink (M=0.968 s, SE=0.004); no other significant effects.
  • Fixation durations, attributes: Main effect of method, F(1,36)=40.31, p<.001; WebGazer longer; no other significant effects (all ps>.1).

Model-based results (Exp. 1):

  • QWTP posterior distributions overlapped substantially between methods across all display sizes (overlap ≥ .59; monitor overlap .90) with non-significant KS tests (all ps> .12) and non-significant mean comparisons (all ps>.19). Example posterior means (95% CI), WebGazer vs EyeLink: monitor 2.82 (2.55,3.10) vs 3.19 (2.50,3.87); laptop 3.34 (2.70,3.97) vs 2.68 (1.94,3.42); tablet 2.35 (2.19,2.51) vs 2.30 (2.19,2.41); mobile 2.31 (2.24,2.37) vs 2.31 (2.19,2.43).
  • Latent contribution (alpha_k * IP_n) distributions were largely consistent across displays (overlaps > .64), except for the cost attribute on tablet (KS p=.019; overlap .507).

Experiment 2 (transport; 3 alternatives × 3 attributes):

  • Fixation counts, alternatives: Main effect of alternative, F(2,78)=111.167, p<.001 (Metro > Bus > Grab overall); interaction alternative×display, F(6,234)=4.217, p<.001; three-way method×alternative×display, F(6,234)=8.747, p<.001. Metro attracted more fixations than Bus and Grab on monitor; differences flattened on smaller displays, with WebGazer reliability reduced especially on mobile.
  • Fixation counts, attributes: Main effect of attribute, F(2,78)=8.25, p<.001; Time (M=0.113, SE=0.002) > Comfort (M=0.097, SE=0.002; t(39)=3.80, p<.001) and > Cost (M=0.106, SE=0.002; t(39)=2.80, p=.008); Comfort vs Cost not significant (t(39)=1.41, p=.168). No significant method or display interactions.
  • Fixation durations, alternatives: Main effects of alternative, F(2,58)=10.51, p<.001, and method, F(1,29)=21.76, p<.001; interaction alternative×method, F(2,58)=10.38, p<.001. EyeLink: Metro > Bus/Grab (e.g., Metro M=1.006 s vs Bus M=0.960 s, t(39)=5.301, p<.001; vs Grab M=0.978 s, t(39)=3.487, p=.001). WebGazer did not show these differences. No display effects.
  • Fixation durations, attributes: Main effect of method only, F(1,34)=32.08, p<.001; WebGazer longer; no other significant effects.

Model-based results (Exp. 2):

  • SVTTS consistent only on monitor: WebGazer 0.50 (0.40,0.60) vs EyeLink 0.46 (0.36,0.56); t=0.53, p=.60; KS=.17, p=.71; overlap .94.
  • Divergence on smaller displays: laptop 0.47 (0.36,0.59) vs 0.69 (0.65,0.73), t=-3.56, p=.001; KS=.67, p<.001; overlap .34. Tablet 0.73 (0.70,0.76) vs 0.59 (0.40,0.78), t=1.39, p=.17; KS=.78, p<.001; overlap .18. Mobile 0.78 (0.77,0.80) vs 0.47 (0.38,0.55), t=7.51, p<.001; KS=.86, p<.001; overlap .14.
  • Latent contribution distributions showed significant differences for several attributes on laptop/tablet/mobile, with lowest overlaps for cost (e.g., tablet cost overlap .043; mobile cost .064; KS p<.001), while monitor condition showed consistency for cost and time (non-significant KS; overlaps .855 and .726) but not comfort (KS p=.001; overlap .25).

Overall patterns:

  • WebGazer matched EyeLink on larger displays (especially monitor; also laptop/tablet for simple tasks) for alternative-based fixation counts and for model-based parameters in the simple task. Reliability degraded on smaller displays and with more complex matrices, particularly for SVTTS and latent-variable contributions.
  • Across experiments, WebGazer produced slightly longer fixation durations than EyeLink, suggesting systematic method-related temporal differences.
Discussion

Findings indicate that webcam-based eye-tracking can reproduce key patterns of visual attention in discrete choice contexts when stimuli subtend larger visual angles (monitor/laptop/tablet) and the task is simple (few ROIs, wider spacing). Alternative-based attention effects (e.g., more fixations and longer durations to salient options) were robustly detected by EyeLink across displays and by WebGazer on larger displays, but weakened for WebGazer on mobile-sized displays where ROIs are smaller and closer. Attribute-based fixation counts showed uneven attention in the complex task (time > comfort/cost), captured similarly by both methods, whereas fixation durations were less diagnostic overall and more sensitive to method differences, with WebGazer yielding slightly longer durations. At the modeling level, willingness to pay for pizza quality (QWTP) agreed across methods for all displays, suggesting that for simple DCEs, WebGazer-based attention measures are sufficient to support stable parameter inference. In contrast, subjective value of travel time savings (SVTTS) converged only in the monitor condition for the complex task; for smaller displays, posterior distributions diverged and overlaps dropped sharply, implying that web-based gaze quality and ROI resolvability significantly affect identification of attention-modulated parameters when tasks are complex. These results extend prior validations of WebGazer in perceptual/cognitive paradigms by demonstrating boundary conditions in decision-making: device size and task complexity jointly determine the fidelity of attention measures and downstream behavioral inferences. The loss of precision with small ROIs and lower temporal/spatial resolution likely leads to fixation misassignments and merged short fixations, particularly harming latent-variable estimation in complex layouts. Overall, fixation count emerged as a more reliable indicator of decision salience than fixation duration across methods. The work provides practical guidance: use larger displays and simpler matrices for web-based eye-tracking in DCEs; anticipate reduced reliability on mobile-like displays, especially for complex designs and parameter-rich models linking attention to choice.

Conclusion

The study provides the first systematic comparison of in-lab (EyeLink 1000 Plus) and web-based (WebGazer) eye-tracking for decision-making across multiple display-size conditions and task complexities. WebGazer closely matches EyeLink on larger displays for simple DCEs at both the gaze-metric and model levels (QWTP), supporting its use for scalable online behavioral studies. Reliability decreases with smaller displays and higher task complexity, where parameter estimates (e.g., SVTTS) and latent-variable contributions diverge between methods. The authors recommend employing larger screens, adequately sized and spaced ROIs, and attention to calibration to ensure data quality. Future research should test enhanced calibration routines, individualized gaze correction, and improved fixation parsing to boost spatial and temporal precision on small displays, and extend validation to additional decision paradigms and diverse populations.

Limitations
  • Device manipulation used scaled stimuli and viewing distances to mimic monitors/laptops/tablets/mobiles; actual physical devices were not tested.
  • No chin rest was used to mimic natural webcam conditions; increased head movement may introduce noise, particularly affecting WebGazer.
  • ROI proximity on small/mobile-sized displays and the complex 3×3 layout likely increased fixation misassignment risk for webcam data due to lower spatial precision and coarse temporal sampling.
  • Some participants were excluded from fixation-duration analyses due to lack of fixations in specific ROIs, potentially reducing power for those tests.
  • The sample comprised NUS students/staff, which may limit generalizability.
  • No practice or catch trials were included, which could affect participant acclimation/attention.
  • Left–right alternative positioning was fixed within experiments; potential positional or reading-direction biases may have influenced alternative-based attention.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 22+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny