Psychology
Eye movements track prioritized auditory features in selective attention to natural speech
Q. Gehmacher, J. Schubert, et al.
The study addresses how eye movements interact with auditory selective attention during naturalistic speech listening. Prior work shows an overlapping neural network for attention and oculomotor control, and interactions between eye movements and auditory cognition in localization, spatial discrimination, and attention. Yet, it is unknown whether eye movements continuously track acoustic features of ongoing speech in natural scenes, whether such ocular tracking relates to adaptive behavior (e.g., intelligibility and effort), and whether eye movements contribute to neural markers of selective speech processing. The authors hypothesized that eye movements would track prioritized (attended) acoustic features of speech (envelope and acoustic onsets), that this tracking would be sensitive to selective attention even without spatial cues, that it would relate to intelligibility, and that oculomotor activity would share variance with neural speech-tracking responses.
Background work supports a joint attention–oculomotor network and cross-modal interactions: midbrain circuits involving inferior and superior colliculus integrate auditory and visual information; human studies link eye movements with auditory spatial discrimination and attention; gaze direction can enhance auditory object tracking and cocktail-party performance. Blink-related eye activity aligns with higher-order structures in artificial speech, suggesting global neural entrainment engaging motor systems in speech perception. Prior work (Popov et al.) indicates goal-driven oculomotor contributions to neural effects in auditory attention even after ocular artifact removal, motivating examination of eye movements’ contribution to neural speech tracking.
Participants: N=30 healthy German native speakers (19 female; mean age 26.27, SD 9.08), normal hearing and (corrected-to-)normal vision; ethics approved. Design and task: Three conditions modulated selective attention. Condition 1: attend visual Gabor rotation sequence while speech (male voice) is a distractor. Condition 2: attend a single target speech stream (male voice) with visual Gabor as distractor. Condition 3: attend a target speech stream (female) in a dual-speaker mixture (female target + male distractor) with concurrent visual Gabor. Each trial: 4 s silent prestimulus with central Gabor fixation; concurrent sentence presentation and 3 Hz Gabor tilts (100 ms each) in ordered/random sequences; 1 s poststimulus. After each trial: behavioral questions—Gabor sequence type (Cond 1) or word-by-word intelligibility for attended speech (Conds 2–3). Block-wise ratings of effort and task engagement (5-point Likert). All auditory stimuli presented binaurally at phantom center. Stimuli: 200 total sentences based on the German Oldenburger Satztest (OLSA) structure (Name–Verb–Number–Adjective–Noun). 100 male-voice sentences (original OLSA words) and 100 female-voice ‘surrogate’ sentences with different word lists to prevent memorization across conditions. Text-to-speech synthesis (IBM Watson, 44.1 kHz): male 'de-DE_DieterV3Voice' with −10% prosody rate to match female 'de-DE_ErikaV3Voice'. Mean durations: male 2.02 s (SD 0.16), female 2.22 s (SD 0.13). Analysis cropped to equalize durations in multispeaker trials. After excluding occasional noisy trials and random subsampling for consistency, 98 trials per condition were analyzed. Data acquisition: MEG (Elekta Neuromag Triux): 102 magnetometers, 204 planar gradiometers; 10 kHz sampling; hardware filters 0.1–3300 Hz; MaxFilter SSS and head position realignment. Eye tracking: TrackPixx3 binocular system, 2 kHz; left/right averaged; pixels converted to degrees of visual angle; blinks and saccades detected and removed by device, gaps interpolated with shape-preserving cubic Hermite. Preprocessing: MEG resampled to 1 kHz; bandpass 0.1–40 Hz; ICA (runica) per block with 50 components; ocular and cardiac components identified and removed; epoched −1 to 4 s around speech onset; 16 ms sound delivery delay corrected. Eye data bandpass 0.1–40 Hz, resampled to 1 kHz; epochs −1 to 4 s; 16 ms correction. Gaze density maps computed with 2D Gaussian-smoothed histograms. Blink rate mean 5.00% (SD 4.27%), saccade rate mean 1.13% (SD 2.81%); removed from eye traces; interpolation used. Predictors for TRF: Control predictors included impulse trains for visual (Gabor) onsets at 3 Hz and for trial/speech onsets. Acoustic predictors derived from gammatone spectrograms (256 ERB-spaced bands, 20–5000 Hz; resampled to 1 kHz; magnitude^0.6): 1-band broadband Envelope (sum across bands) and 1-band Acoustic Onsets (neural edge detection via TRF-Tools, c=30; summed across bands). TRF modeling for ocular tracking: Forward/encoding TRFs estimated with ridge regression (mTRF-Toolbox) over lags −100 to 550 ms. Data downsampled to 50 Hz after 20 Hz anti-aliasing; control impulse trains restored at original onsets post-filtering. Leave-one-trial-out cross-validation; predictors and responses L1-normalized. Prediction accuracy: Spearman correlation between predicted and measured ocular responses, Fisher z-transformed and averaged over trials. Ocular speech tracking quantified as Δz′ = (combined model including Envelope or Onsets + controls) − (control-only model). Nested CV determined a global ridge λ=10^3 minimizing mean absolute error across subjects/conditions/channels. Horizontal and vertical channels were averaged for statistics. Behavioral measures: Intelligibility per condition computed as proportion correct on word-by-word probes (0–100%); effort averaged per block (1–5). Engagement scores computed similarly. Gaze behavior analysis: Cluster-based permutation tests (10,000 permutations; cluster alpha 0.05; Bonferroni-corrected) contrasted gaze density maps of attend-speech conditions (2 and 3) vs attend-visual (1). Statistics for ocular tracking and behavior: Bayesian multilevel regression (Bambi/PyMC3), Student-T likelihood, weak/non-informative priors. Dependent variables: Δz′ for Envelope or Onsets modeled as 0 + condition + (1|subject), testing each condition vs zero. Post hoc comparison contrasted target vs distractor within multispeaker condition. Behavioral models related intelligibility (logit-transformed) or effort to centered Δz′ with condition interaction and (1|subject). Convergence diagnostics ensured. Mediation analysis linking eyes and neural tracking: Encoding models estimated with boosting (Eelbrain) over −100 to 550 ms, using continuous data created by concatenating trials per condition after 700 ms zero-padding; 4-fold CV with L2 error minimization. Plain effect model: neural response = TRF(c)·Envelope. Direct (residual) effect model: neural response = TRF(c′)·Envelope + TRF(b)·Eye movements. Absolute TRF weights compared via cluster-based permutation tests across 102 magnetometers (10,000 permutations; time window −50 to 500 ms; one-sided contrasts c′<c for target single, target dual, distractor dual), reporting p-values and Cohen’s d. Control with time-shuffled eye data confirmed results.
- Gaze position: Participants maintained fixation on the central visual stimulus in all conditions. When attending speech (Conditions 2 and 3), gaze density shifted slightly top-right compared to attending the visual task (Condition 1). Condition 2 vs 1: positive cluster t(29)=7.72, p<0.001, d=1.41; negative cluster t(29)=-11.34, p<0.001, d=-2.07. Condition 3 vs 1: positive cluster t(29)=6.26, p<0.001, d=1.14; negative cluster t(29)=-8.69, p<0.001, d=-1.59.
- Ocular speech tracking depends on attention: Bayesian multilevel models showed significant tracking only for attended speech. Single-speaker target: Envelope β=0.032, 94%HDI [0.022, 0.041]; Acoustic Onsets β=0.040, 94%HDI [0.030, 0.050]. Single-speaker distractor: Envelope β=0.006, 94%HDI [-0.004, 0.015]; Onsets β=0.004, 94%HDI [-0.006, 0.015] (no evidence). In multi-speaker trials, target > distractor: Envelope difference β=-0.007, 94%HDI [-0.013, -0.001] (distractor weaker); Onsets difference β=-0.012, 94%HDI [-0.020, -0.005]. No tracking when speech ignored in the visual-attend condition.
- Temporal profiles: TRFs for envelope showed positive peaks around 0 ms with rapid decrease after ~200 ms; Onset TRFs showed broader positive peaks around ~200 ms; similar patterns for horizontal/vertical channels. No meaningful TRF weights to ignored speech features.
- Behavioral relevance: Stronger ocular speech tracking predicted higher intelligibility for attended speech. Envelope Δz′ effect on intelligibility β=19.113, 94%HDI [8.859, 29.223]; Acoustic Onsets β=11.695, 94%HDI [0.5705, 23.110]. Interaction (multi- vs single-speaker) reduced the envelope–intelligibility link β=-13.224, 94%HDI [-25.212, -0.586]; no substantial interaction for Onsets β=-5.380, 94%HDI [-17.822, 6.210]. No evidence linking ocular tracking to subjective effort (Envelope β=-8.048, 94%HDI [-19.252, 3.663]; Onsets β=4.227, 94%HDI [-15.643, 7.834]).
- Shared eye–brain contributions: Mediation-style analysis revealed that adding eye movements reduced speech-envelope encoding weights over left temporoparietal sensors: single-speaker target t(29)=-4.40, p<0.001, d=-0.80; multi-speaker target t(29)=-4.90, p<0.001, d=-0.90; multi-speaker distractor t(29)=-4.68, p<0.001, d=-0.85. A shuffle control of eye data produced similar effects. Overall, results establish ocular speech tracking for attended speech, show prioritization of a target over a distractor in multi-speaker scenes, relate ocular tracking to intelligibility, and demonstrate shared contributions of oculomotor and neural activity to speech processing.
The findings demonstrate that eye movements track prioritized acoustic features of speech during selective attention. This ocular speech tracking emerges without spatial cues or meaningful visual content and differentiates target from distractor in a cocktail-party-like scenario, indicating attention-specific prioritization. Its positive relationship with intelligibility supports a functional role for oculomotor engagement in optimizing speech perception. The mediation results suggest overlapping mechanisms between oculomotor control and neural speech encoding over left temporoparietal regions, emphasizing that eye movements can contribute to neural signatures of speech tracking. Potential mechanisms include: (1) learned associations linking auditory spectrotemporal patterns with visual/oculomotor systems, aiding multisensory integration and prioritization; (2) predictive processing, with early TRF peaks suggesting rapid oculomotor engagement to support temporal attention; (3) broader motor-system involvement in speech perception, potentially aligning ocular dynamics with speech rhythms; and (4) partial gaze disengagement from irrelevant visual input to free resources, though timing- and feature-specific TRF patterns argue beyond mere gaze aversion. These interpretations align with evidence for oculomotor influences on auditory processing from midbrain through cortical levels and suggest an active sampling strategy in complex auditory scenes.
This work identifies a novel phenomenon—ocular speech tracking—where continuous eye movements track attended speech features and distinguish targets from distractors in naturalistic, multi-speaker contexts. Ocular tracking predicts intelligibility and shares variance with neural speech-tracking responses over left temporoparietal sensors. These results extend the conception of a joint attention–oculomotor network to natural speech perception and highlight the need to consider eye movements when interpreting neural auditory attention effects. Future research should use continuous speech without visual confounds to map temporal dynamics and causality, examine additional ocular metrics (blinks, pupil, micro-/saccades), explore cross-species generality, and perform source-level analyses to clarify mechanisms and functional implications in social communication.
- Task design: Short, highly structured 5-word sentences may bias temporal TRF interpretability (e.g., anticipatory effects); limited window (<500 ms) restricts insights into later dynamics where target–distractor separation may evolve.
- Visual confounds: Rotating Gabor stimulus likely confounded pupillometry; hence pupil-based effort measures were not interpretable. No eyes-closed condition to test gaze-aversion hypotheses directly.
- Ocular event scarcity: Few blinks and (micro)saccades within trials limited analyses of specific eye-movement types; eye data relied on interpolation after blink/saccade removal.
- Non-spatial paradigm: Speech presented at phantom center without spatial cues; generalization to spatially distributed sources requires testing (though proposed for future work).
- Mediation analysis is exploratory: Sensor-level effects do not establish causality or directionality; further source-level, continuous-speech studies are needed. Although ICA removed ocular components in MEG, residual shared variance with ocular dynamics remains.
- Minor data handling: Some trials excluded due to hardware buffer noise; durations between voices differed before analysis cropping; single global ridge parameter used for all TRF models.
- Subjective effort measure may lack sensitivity; no relationship with ocular tracking was found, potentially due to scale limitations.
Related Publications
Explore these studies to deepen your understanding of the subject.

