
Psychology
Rapid learning of a phonemic discrimination in the first hours of life
Y. J. Wu, X. Hou, et al.
This fascinating study, conducted by Yan Jing Wu and colleagues, reveals how neonates quickly adapt their brain's neural mechanisms to discriminate phonemes, showcasing remarkable neuroplastic changes with just five hours of postnatal exposure to vowel sounds.
~3 min • Beginner • English
Introduction
Human neonates display remarkable sensitivity to speech from birth, preferring speech to non-speech and their mother’s voice to other voices. A central question concerns phoneme discrimination, a foundational ability for speech perception development. Although neonates can often discriminate phonemes across languages at birth, likely reflecting prenatal auditory learning in utero, the degree of immediate postnatal plasticity and the neural dynamics supporting rapid phonological learning remain unclear. Prior work suggests prenatal exposure shapes vowel perception, but whether newborns can rapidly acquire sensitivity to subtle phonological contrasts that are unlikely to be learned in utero is not well understood. This study investigates whether, within hours after birth, neonates can learn to discriminate natural (forward) from temporally reversed (backward) vowels, and maps the neuroanatomical dynamics and connectivity changes underpinning such rapid learning using functional near-infrared spectroscopy (fNIRS). The authors hypothesized that all groups might initially differentiate forward from backward vowels due to prenatal exposure to natural vowels, but that only neonates trained on the specific forward–backward contrast would show enhanced, vowel-specific phonological contrast sensitivity after training, particularly engaging superior temporal and inferior frontal regions, with possible consolidation effects and increases in resting-state functional connectivity.
Literature Review
The literature indicates that neonates prefer speech over complex non-linguistic sounds and can discriminate phonemes at birth, consistent with fetal auditory system function from around 24 weeks gestation and prenatal learning shaping speech representations. Studies show increased sucking to non-native vowels and electrophysiological indices (e.g., mismatch negativity, MMN) of phonetic processing in newborns. Cheour et al. demonstrated that 2.5–5 h of exposure can enhance MMN responses to vowel contrasts, with effects persisting and generalizing, suggesting short-term ex utero learning is possible. Neuroimaging work in neonates and infants implicates superior temporal (ST) regions in phonological processing and inferior frontal (IF) regions (Broca’s area) in speech-related functions. Sleep is known to facilitate memory consolidation, including phonetic learning. However, the neural mechanisms and temporal dynamics of ultra-rapid postnatal phonological learning immediately after birth, particularly for subtle contrasts like forward vs backward vowels that likely are not learned in utero, remain insufficiently characterized.
Methodology
Design and participants: Seventy-five healthy full-term neonates (38 boys; gestational age 38–41 weeks, mean 39.0 ± 0.7) were randomly assigned within 1–3 h of birth (mean 2.1 ± 0.4 h) to an experimental group (n=25), an active control group (n=25), or a passive control group (n=25). Inclusion criteria included normal birth weight, no clinical symptoms, no sedation, normal hearing (otoacoustic emissions), Apgar >8 at 1 and 5 min, and no neurological abnormalities within 6 months. Exclusion criterion during recording: crying >2 min. Final datasets: experimental n=22 (11 boys), active control n=23 (12 boys), passive control n=21 (10 boys). Ethical approval was obtained; parental consent given.
Stimuli: Six Mandarin Chinese vowels common across languages (/a:/, /ɔ:/, /i:/, /u:/, /ə:/, /æ/) recorded by an adult female native speaker (Peking dialect). One token per vowel, edited to 1 s duration without compression; minimal prosodic variation. Backward versions were created by temporal reversal. Adult raters showed high recognition accuracy for forward vowels (98.3%) vs backward (73.2%) and uniformly low prosodic ratings (<1.2), confirming minimal prosody.
Procedure and training: fNIRS data were recorded at baseline (T0, 8 min), immediately after a 5 h training or rest period (T1, 8 min), and again after a 2 h consolidation period (T2, 8 min). During baseline and tests, 12 forward vowel strings and 12 backward vowel strings (each string = 6 concatenated vowels, 6 s) were presented in random order with 12–16 s silent inter-trial intervals. Experimental group training: alternating 10-min blocks of forward and backward strings using three vowels (/a:/, /ɔ:/, /i:/), repeated to total 5 h (15 forward + 15 backward blocks, 2 s ITI within blocks, 24 s inter-block interval). Active controls received the same training structure but with different vowels (/u:/, /ə:/, /æ/) than those used at test. Passive controls received no training but were kept in the same environment. After T1, all neonates had a 2 h consolidation period (polysomnography indicated >90% sleep; no group differences).
fNIRS acquisition and preprocessing: NirSmart system (20 lasers, 16 detectors; 760 and 850 nm) in a 34-cm cap following the 10/5 system, yielding 52 channels (mean source–detector distance 2.3 cm), sampling at 10 Hz. Optodes covered frontal and temporal regions bilaterally. Data were checked for saturation (none) and artifacts (>20% dynamic range segments removed; ~17.8% ± 10.2% removed). Spike jumps (>6 SD) were interpolated, intensities converted to optical density, then band-pass filtered (0.01–0.2 Hz), and converted to Δ[HbO] and Δ[Hb]. Analyses focused on Δ[HbO]. Epochs: −2 to 20 s relative to stimulus onset; baseline-corrected to pre-stimulus mean.
Analysis: Because HRFs can vary in infants, analyses targeted (1) mean Δ[HbO] amplitude 6–16 s post-onset, and (2) peak Δ[HbO] latency per trial. Linear mixed-effects regressions (lme4/lmerTest) modeled single-trial amplitudes/latencies with fixed effects: stimulus type (forward vs backward), participant group (Helmert contrasts: passive vs mean(active, experimental); active vs experimental), phase (Helmert contrasts: T0 vs mean(T1,T2); T1 vs T2), and their interactions; maximal random effects by participant and channel (omitting correlations as needed). BLUPs were computed per channel. Resting-state functional connectivity: 3-min pre-test Δ[HbO] timeseries (10 Hz) filtered 0.01–0.2 Hz were Pearson-correlated between 7 seed channels (selected via FDR q<0.15 from amplitude and latency effects: channels 7, 10, 45, 2, 6, 43, 44) and all other channels (336 pairs), Fisher z-transformed, and analyzed with reduced LME models including group and phase contrasts (no stimulus factor). Spatial registration used 3D digitizer to map channels to neonatal MNI cortical atlas and AAL regions.
Key Findings
- Δ[HbO] mean amplitude: A significant super-additive three-way interaction among stimulus type (forward vs backward), group (active control vs experimental), and phase (T1 vs T2) indicated a greater forward–backward difference for the experimental group compared to active controls, specifically larger post-consolidation (T2) than immediately post-training (T1) (β=0.125 μmol l⁻¹, s.e.m.=0.058, t(86.7)=2.15, P=0.034). The topography of BLUPs was maximal bilaterally over superior temporal (ST) and supramarginal (SM) regions, and over the left inferior parietal (IP) region. Significant channels included: Left ST (chan. 7: β=0.679, P=0.003; chan. 10: β=0.519, P=0.008), Left IP (chan. 25: β=0.554, P=0.012), Left SM (chan. 19: β=0.409, P=0.032), Right SM (chan. 37: β=0.452, P=0.041), Right ST (chan. 45: β=0.653, P=0.005).
- Peak Δ[HbO] latency: A significant three-way interaction among stimulus type, group (active control vs experimental), and phase (T0 vs mean(T1,T2)) indicated that the forward–backward peak latency difference was greater for the experimental group after training than at baseline (β=−0.569, s.e.m.=0.209, t(95.0)=−2.72, P=0.008), reflecting shorter latencies for forward vs backward vowels post-training. Maximal effects were over inferior frontal (IF) regions bilaterally: Left IF (chan. 2: β=−2.60, P=0.003; chan. 6: β=−2.96, P<0.001; chan. 16: β=−2.15, P=0.016), Right IF (chan. 43: β=−3.79, P<0.001; chan. 44: β=−2.87, P<0.001).
- Functional connectivity at rest: An interaction between group (passive vs mean(active, experimental)) and phase (T1 vs T2) revealed stronger increases in connectivity after sleep in the trained groups vs passive controls (β=0.217, s.e.m.=0.062, t(50.2)=3.48, P=0.001). Many connections over left IF, left ST, right IF, and right ST regions surpassed uncorrected α=0.05, including left IF–left ST links (e.g., chan. 6–7: β=0.886, P=0.031; chan. 2–7: β=1.078, P=0.007), left IF–left IP (chan. 6–25: β=1.078, P=0.004), and interhemispheric ST (chan. 7–45: β=0.905, P=0.041).
- Baseline (T0): The forward–backward contrast elicited minimal differences across groups, suggesting no initial discrimination of these subtle stimuli before exposure. Post-training (T1): the experimental group showed reduced peak latencies for forward vs backward vowels. Post-consolidation (T2): the experimental group showed increased mean Δ[HbO] amplitude for forward vs backward vowels, particularly over bilateral ST/SM and left IP.
Discussion
The study demonstrates that within hours after birth, neonates can rapidly acquire sensitivity to a subtle phonological contrast—natural (forward) versus reversed (backward) vowels—when exposed to that specific contrast. The key three-way interactions indicate that this learning was specific to the trained vowel set: despite both experimental and active control groups receiving extensive vowel exposure, only the experimental group (trained on the same vowels later tested) showed faster haemodynamic responses (reduced peak latencies) to forward vs backward vowels immediately after training and increased response amplitudes after a consolidation period. Topographically, amplitude learning effects localized to bilateral superior temporal and supramarginal regions and left inferior parietal cortex, areas implicated in phonological processing, speech segmentation, and aspects of language comprehension. Latency effects localized to bilateral inferior frontal regions, consistent with engagement of a nascent dorsal speech stream and sensorimotor speech circuitry (Broca’s area involvement in speech production/monitoring). Minimal forward–backward differences at baseline suggest the specific contrast was not available prenatally, aligning with the low-prosody, single-token stimuli and the unlikelihood of in utero learning for temporal reversal. The delayed amplitude enhancement relative to immediate latency changes suggests rapid gains in processing efficiency followed by consolidation-dependent increases in response magnitude, consistent with sleep-mediated memory consolidation. Resting-state connectivity increases after sleep between IF and ST/IP regions in trained groups further support consolidation-driven strengthening of a speech-related sensorimotor loop. Overall, findings delineate an early-developing network—encompassing IF, ST, SM, and left IP—that supports ultra-rapid phonological tuning and may form the foundation for later imitation-based language learning.
Conclusion
This work shows that human neonates exhibit ultra-fast tuning to natural phonemes within the first day of life: after 5 h of targeted exposure, they discriminate forward vs backward vowels faster and, after a subsequent 2 h rest, with greater neural response amplitudes. The learning is network-specific, engaging inferior frontal, superior temporal, supramarginal, and left inferior parietal regions, and is accompanied by consolidation-related increases in resting-state functional connectivity between key speech regions. The study provides direct evidence that a putative sensorimotor speech network is operational at birth and can be rapidly shaped by experience. Future research should track how this early network develops (e.g., lateralization trajectories), supports sensorimotor learning and perceptual narrowing, and whether its dynamics can serve as early biomarkers for neurodevelopmental risk (e.g., delayed cooing/babbling in ASD or ADHD).
Limitations
- Stimuli used a single token per vowel, potentially limiting generalizability to naturalistic variability; minimal prosodic variation may reduce comparability to studies using richer speech.
- Baseline discrimination was minimal, possibly reflecting the subtlety of the forward–backward contrast; results may depend on stimulus set and acoustic properties.
- Data distribution was assumed normal but not formally tested; fNIRS HRF variability in infants necessitated amplitude/latency metrics instead of HRF-based GLM, which may limit comparability with standard fNIRS analyses.
- Functional connectivity analyses were restricted to seed channels selected by an FDR threshold (q<0.15), and multiple comparisons for connectivity pairs were not fully corrected when reporting illustrative pairs.
- No a priori power analysis; sample sizes were based on precedent studies.
- The active control group trained on different vowels, so generalization across vowel categories was not assessed.
- fNIRS spatial resolution and channel-to-region mapping in neonates, while standardized, may involve localization uncertainty.
Related Publications
Explore these studies to deepen your understanding of the subject.