Linguistics and Languages
Human cortical encoding of pitch in tonal and non-tonal languages
Y. Li, C. Tang, et al.
The study investigates how the human auditory cortex encodes pitch information that serves different linguistic roles across languages. In tonal languages like Mandarin, pitch contours (lexical tones) distinguish word meanings, whereas in non-tonal languages like English, pitch conveys prosodic intonation. Prior work localized tone processing to superior temporal gyrus (STG), but the nature of the encoded representation remained unclear: absolute fundamental frequency (F0), speaker-normalized pitch (relative height and change), or abstract lexical tone categories. The authors hypothesize that STG encodes high-order, speaker-normalized pitch cues general to auditory processing, with potential language-specific tuning shaped by linguistic experience. Using a cross-linguistic ECoG paradigm where native Mandarin and native English monolinguals listened to both Mandarin and English speech, they aim to disentangle language-general pitch encoding from language-specific categorical representations.
Neuroimaging consistently implicates bilateral non-primary auditory cortex (STG) in lexical tone processing and phonological feature encoding. Behavioral and electrophysiological studies show listeners normalize pitch across speakers and rely on relative pitch height and pitch change for tone perception. Debates persist on whether higher-order auditory cortex encodes absolute pitch, speaker-normalized pitch, or categorical tone identity. Hemispheric asymmetries have been reported (left-lateralized tone processing in tonal language speakers, rightward biases for prosody in non-tonal languages). Speech prosody often engages right-hemisphere networks. However, the precise stimulus–response mapping in STG for tones and how language experience shapes encoding remain unresolved.
Participants: 15 monolingual adults (7 male, 8 female; 31–55 years; all right-handed): 11 native Mandarin speakers (brain tumor patients undergoing awake language mapping; 7 left, 4 right hemisphere coverage) and 4 native English speakers (epilepsy patients with left-hemisphere grids). All provided informed consent under UCSF and Fudan IRB approvals.
Neural recording: High-density ECoG grids (Integra or PMT; 4 mm spacing; 1.17 mm contacts), 128 or 256 channels as clinically indicated. Signals recorded at 3052 Hz (TDT), notch filtered (50/100/150 Hz in Mandarin group; 60/120/180 Hz in English group), re-referenced to common average, artifacts removed. High-gamma (70–150 Hz) analytic amplitudes (Hilbert across 8 Gaussian bands), downsampled to 100 Hz, z-scored per ~5 min block.
Stimuli: Natural, continuous speech. Mandarin: 68 ASCCD passages from 10 speakers (5F/5M), totaling 4711 syllable tokens (tones T1–T4), passages 10–60 s, 0.5 s silences. English: TIMIT (499 sentences, 402 speakers, 0.4 s inter-sentence gaps) and BURSC (75 news passages, 6 speakers, 10–60 s passages, 0.7 s gaps). All 15 completed ≥4 Mandarin blocks; subset completed English blocks.
Acoustic features: 30 mel-band spectrogram (0–8 kHz), intensity. Pitch extraction via Praat autocorrelation (log-F0 with correction for halving/doubling). Absolute pitch = log F0. Relative pitch = within-speaker z-scored log F0. Pitch change = first derivative of log F0 (octaves/s). Features discretized into bins for modeling.
Electrode selection: Speech-responsive electrodes identified by comparing HG responses around onsets/offsets (paired t-tests, Bonferroni corrected). Tone-discriminant electrodes: F-tests across tones per time point (200–1000 ms after syllable onset), requiring ≥3 consecutive significant points (Bonferroni corrected).
Encoding models (TRF): Time-lagged linear models predicting HG from features with L2-regularization and cross-validation. Full model included spectrum, intensity, absolute pitch, relative pitch, pitch change, and (for Mandarin) tone-category regressors. Unique variance (ΔR²) computed by removing feature groups. Significance via phrase-level permutation (n=200; 99th percentile threshold). Relationship to tone discriminability assessed by correlating ΔR² with max F-statistic.
Cross-language generalization: For subjects with both languages, fit TRF models on Mandarin or English data using identical feature sets (excluding tone labels), then predict Mandarin neural responses. Performance evaluated by correlation with actual HG (bootstrapped over trials; K–S test). Similarity of speaker-normalized pitch tuning quantified by correlating blurred 2D TRF weight matrices between Mandarin- and English-fitted models within electrodes (Mann–Whitney U vs baseline across-electrode correlations).
Tuning analyses: Electrode-wise tuning curves for relative pitch height and pitch change (20-bin discretization over middle 95%). Metrics: modulation depth (max–min HG) and linearity (Fisher-transformed slope). Temporal integration assessed via average absolute TRF beta weights over time.
Population decoding and RSA: Pool speech-responsive STG electrodes across subjects within each language group. Sliding 50 ms windows; features are concatenated HG across electrodes and time points; classifier: logistic regression with group lasso, nested CV. Report peak pairwise tone classification accuracy relative to acoustic baseline (built from 250 ms contours of relative pitch height and pitch change). Representational similarity analysis: Create 16 groups per continuum by partitioning each tone into 4 levels by relative pitch height or pitch change; compute confusion matrices and categorical index (CI = between-tone accuracy − within-tone accuracy). Significance via permutation (n=200).
-
Single-electrode encoding favors speaker-normalized pitch:
- In Mandarin listeners, 541 speech-responsive electrodes; speaker-normalized pitch features (relative height, pitch change) uniquely explained significant variance on 112 electrodes (20.7%; unique R² up to 6%). Absolute pitch encoded on 42 electrodes (7.8%) with much smaller variance explained (up to 2%; only one electrode >1%); paired t-test comparing ΔR² speaker-normalized vs absolute pitch: t(129) = -5.84, p = 4×10^-8.
- Discrete tone-category predictors explained little unique variance (≤1.5%), significantly less than speaker-normalized pitch: t(151) = -5.46, p = 2×10^-7.
- Across tone-discriminating electrodes, variance explained by speaker-normalized pitch strongly correlated with tone discriminability (F-statistic): r = 0.85, p = 9×10^-32; absolute pitch showed no significant correlation (r = 0.17, p = 0.3).
- Electrodes often tuned preferentially to either relative pitch height or pitch change, with few encoding both strongly.
-
Language-general encoding at single electrodes:
- TRF models trained on English speech predicted Mandarin tone-evoked responses comparably to Mandarin-trained models at single electrodes (example: r = 0.58 for Mandarin model vs r = 0.54 for English model with actual responses; predictions between models r = 0.90; all p < 1×10^-10).
- Within electrodes, TRF weights for speaker-normalized pitch were highly similar across languages (example r = 0.88; across electrodes, within-electrode Mandarin–English correlations significantly exceeded across-electrode baseline; Mann–Whitney U test p = 1.9×10^-10), indicating language-independent local pitch tuning.
-
Language experience shapes tuning distributions and temporal integration:
- Relative pitch height tuning: Mandarin speakers showed a wider, more balanced distribution including strong negative tuning; English speakers biased toward high relative pitch (p < 0.05, permutation test).
- Pitch change tuning: similar between groups (p > 0.5).
- Temporal receptive fields (TRFs): English speakers showed transient pitch-height TRFs peaking ~100 ms; Mandarin speakers showed longer integration extending to ~300 ms, with significantly higher weights from 180–270 ms (paired t-tests, FDR-corrected α < 0.05). No group difference for pitch-change TRFs.
-
Population-level categorical representation of tones:
- Pooled STG population decoding: Mandarin group (316 electrodes) achieved peak pairwise tone classification 28% above chance relative to acoustic baseline (t(198)=7.88, p=2.2×10^-13); English group (171 electrodes) showed 4.2% change (t(198)=1.09, p=0.28).
- Removing strong negative pitch-height electrodes in Mandarin reduced decoding significantly (t(198)=8.67, p=1.6×10^-15) to near-acoustic levels (-2.8% change, t(198)=-0.77, p=0.44), indicating the importance of balanced tuning for categorical representation.
- Representational similarity (categorical index, CI): Acoustic space showed no categorical structure (CI = -0.023 and -0.045; ns). Neural STG space showed significant categorical structure in both groups (Mandarin: CI = 0.11 and 0.16; English: CI = 0.072 and 0.076; all p < 0.005). Mandarin CI exceeded English by 52% and 111%. Removing negative-tuning electrodes reduced Mandarin CI to 0.078 and 0.099 (both p < 0.005).
Overall, STG encodes speaker-normalized pitch features at local sites in a language-general manner, while language experience shapes the distribution and temporal dynamics of tuning, yielding stronger population-level categorical sensitivity to lexical tones in Mandarin speakers.
The findings address core questions about pitch encoding in speech: local STG activity primarily represents speaker-normalized pitch features (relative height and change), not absolute pitch or discrete tone labels, providing a general auditory mechanism that supports both lexical tones and intonation across languages. Cross-language generalization of TRF tuning within electrodes confirms language-independent local encoding.
Language experience sculpts the distribution and integration times of these local tunings—Mandarin speakers exhibit balanced positive/negative relative pitch height tuning and longer temporal integration consistent with syllable-sized tone contours—enhancing population-level categorical sensitivity to tone boundaries. In contrast, English speakers show biases toward high relative pitch and shorter integration, aligning with prosodic stress patterns and yielding weaker categorical coding for Mandarin tones.
These results support a multi-scale model of speech processing in non-primary auditory cortex: linear, veridical encoding of higher-order acoustic features at local sites and emergent categorical representations at the distributed population level. The shared local encoding suggests overlapping mechanisms for tonal and intonational pitch processing, while language-specific tuning distributions implement adaptation to linguistic statistics, enabling robust tone categorization in native tonal language listeners.
This study demonstrates that human STG implements a general, language-independent code for speaker-normalized pitch features that underlies tone perception, while language experience tunes the distribution and temporal integration of these features, producing stronger categorical representations of lexical tones at the population level in Mandarin speakers. The work reconciles general auditory and speech-specific accounts by revealing multi-scale encoding: local acoustic-feature coding and distributed categorical organization.
Future directions include: expanding coverage to bilateral and extra-temporal language networks to trace how categorical representations emerge; examining developmental and training effects on tuning distributions; comparing diverse tonal systems and prosodic structures; and integrating causal perturbation or connectivity analyses to test top-down contributions to categorical coding.
- Limited ECoG coverage: grids placed based on clinical needs and mostly unilateral, precluding comprehensive assessment of hemispheric lateralization and long-range fronto-temporo-parietal interactions.
- Clinical populations and small sample size (11 Mandarin tumor patients; 4 English epilepsy patients) may limit generalizability.
- Inability to determine whether categorical population coding arises from recurrent processing within non-primary auditory cortex versus top-down influences from broader language networks.
- Tone-category regressors applied only to Mandarin stimuli; cross-language comparisons necessarily exclude explicit category features.
Related Publications
Explore these studies to deepen your understanding of the subject.

