Psychology
Humans need auditory experience to produce typical volitional nonverbal vocalizations
K. Pisanski, D. Reby, et al.
The study investigates whether human volitional nonverbal vocalizations (e.g., roars, cries, screams) require auditory experience to develop typical, stereotyped acoustic forms that map onto their communicative functions. While some spontaneous vocalizations (e.g., infant cries) appear innate, many adult emotional vocalizations are produced voluntarily and may engage neural pathways shared with speech, implying a need for auditory input and feedback. The authors hypothesize that auditory deprivation will impair the typical maturation of volitional nonverbal vocalizations, weakening form–function mapping, and that the severity of deprivation (from limited to zero auditory experience) will be associated with more atypical vocal production. They test this by comparing acoustics and listener perceptions of intentionally produced aggression, pain, and fear vocalizations from profoundly deaf adults and matched typically hearing controls.
Background research shows nonverbal vocalizations are evolutionarily ancient, with form–function mappings homologous to affective calls in other mammals. Some human vocalizations may be universal across cultures. Unlike speech, certain spontaneous vocalizations can emerge without sensory input; infant cries at birth are a canonical example of innate signaling with harsh nonlinear phenomena serving attention-capturing functions. However, many adult emotional vocalizations are volitional and likely depend on higher cortical pathways overlapping with speech production, suggesting a role for auditory learning. Comparative work demonstrates that in songbirds, auditory experience is necessary and sufficient for typical song development; evidence in mammals is mixed, with some species (cats, bats) showing atypical vocalizations after early deafening while others (mice) show species-typical calls despite engineered deafness. In primates, vocal production learning appears limited though some plasticity exists (e.g., marmosets). In humans, speech development critically depends on auditory feedback; post-lingual deafness disrupts segmental and suprasegmental features, indicating ongoing regulation by auditory feedback. Prior work on non-linguistic signals in deaf individuals focused mainly on cries and laughter, suggesting neonatal cries are initially typical but may become atypical without feedback, and that spontaneous laughter in adults is broadly typical with some acoustic differences. One prior study of eight hearing-impaired adults producing volitional vocalizations found structural abnormalities and reduced recognition by listeners, especially for anger. Open questions remain regarding how complete auditory deprivation affects volitional nonverbal vocalizations and whether effects scale with the degree of deprivation.
Design: Observational-comparative study combining acoustic analyses of intentionally produced nonverbal vocalizations (aggression, pain, fear) from 120 adults (60 profoundly deaf; 60 typically hearing controls) with four independent psychoacoustic perception experiments involving 444 listeners. The study was approved by the University of Wrocław IRB (IPE0021); informed consent obtained. Participants (vocalizers): 60 profoundly deaf adults (30 male, 30 female; mean age ~29–30 years; range 16–53) and 60 age-, sex-, and education-matched hearing controls. Among the deaf group, ~70% (n = 41) were congenitally deaf; the remainder had onset in infancy/childhood (mean onset 3 ± 3.2 years). Hearing support: 25 with no hearing support, 20 with hearing aids, 15 with cochlear implants. A subset of 20 were identified as having zero auditory experience (congenitally deaf with no history of aids/implants). Hearing screening: All vocalizers completed speech audiometry (word identification at variable SPLs) and a speech-in-noise digit triplet test (variable SNR), administered via calibrated web-based platform using Sennheiser HD-280 headphones. Nearly 90% of deaf participants scored 0% intelligibility on both tests in both ears. Deaf participants using aids completed testing with their device on. All deaf participants declared minimum 90 dB hearing thresholds. Voice recordings: Conducted in private rooms using a Tascam DR05 recorder, 48 kHz/24-bit, microphone 150 cm from mouth; standardized input levels and distance. Instructions provided in writing; for deaf participants also via pre-recorded signed video and an interpreter on site. Each participant intentionally produced vocalizations in three contexts: aggression, pain, and fear. Acoustic analysis: Using Praat v6.1.21 and R package soundgen. Fourteen core parameters extracted via custom Praat scripts with verification: fundamental frequency metrics (mean, min, max f0; f0 coefficient of variation), frequency modulation metrics (inflex2 major modulations; inflex25 minor/vibrato-like), amplitude metrics (mean/max intensity; intensity CV), noise/perturbation (HNR, jitter, shimmer), temporal metrics (total duration; proportion voiced). f0 search range 60–2000 Hz; window length 0.05 s; timestep 0.01 s; smoothed contours; manual inspection for tracking errors. Nonlinear phenomena (NLP: sidebands, subharmonics, deterministic chaos) were annotated manually by two independent raters from spectrograms (0–5 kHz; window length 0.05 s) and waveforms; %NLP computed as proportion of voiced segments with NLP. For formant analyses, in a subset of 140 calls with mean f0 < 400 Hz (primarily male vocalizers), F1–F4 were measured in soundgen with manual LPC smoothing adjustments; calls defined as voiced segments separated by ≥150 ms of aspiration/silence; nasalized closed-mouth calls excluded. Apparent vocal tract length (VTL) and average formant spacing (AF) computed from F1–F4 using validated regression methods. Stimulus preparation: Each vocal stimulus padded with 500 ms silence. Amplitude normalization was not applied (standardized mic distance preserved amplitude variance). Listeners and perception experiments: 444 adults (16–60 years; ~59% female) with self-reported normal hearing, randomly assigned to one of four lab experiments. In each session, a computer randomly selected 20 deaf and 20 control vocalizers (balanced by sex); all 3 vocalizations per selected vocalizer were presented, totaling 120 stimuli per listener. Stimuli presented through Sennheiser HD-280 headphones at individually set comfortable levels; order randomized; vocalizer sex was blocked within sessions.
- Experiment 1 (n = 139): Forced-choice emotion classification (aggression, pain, fear) using the same contextual scenarios provided to vocalizers.
- Experiment 2 (n = 51): Open-ended one-word emotion labeling without provided categories; responses later coded for valence (negative/neutral/positive) by two blind raters (Cohen’s kappa = 0.82).
- Experiment 3 (n = 117): Authenticity judgments on 1–7 scale, with intended emotion provided.
- Experiment 4 (n = 137): Two-alternative forced-choice deafness detection (hearing-impaired vs typically hearing) from nonverbal vocalizations. Statistical analysis: Conducted in R 4.3.2 and SPSS 25. Linear mixed models (LMMs; REML) tested group differences (deaf vs control) for acoustic parameters and perception outcomes, typically split by vocalizer sex and emotion; vocalizer ID and listener ID (for perception) included as random intercepts; Šidák-corrected pairwise comparisons; alpha = .05. PCA with varimax rotation on 15 acoustic variables yielded 5 components (pitch, amplitude, noise/perturbation with some NLP influence, duration, frequency modulation) explaining 83% of variance; component scores entered into leave-one-out DFA for emotion distinctiveness. Perception accuracy also analyzed with GLMMs (logistic link) on binary correct/incorrect data; unbiased hit rates (Hu) computed to control for response biases; Wilcoxon signed-rank tests on arcsine-transformed Hu scores. Equivalence tests (TOST) applied where relevant.
Acoustic structure:
- Across aggression, pain, and fear, deaf vocalizations overlapped more in acoustic space than controls; DFA classified aggressive calls at 37% in deaf vs 58% in controls, indicating greater homogeneity and weaker form–function mapping.
- Aggression: Deaf vocalizers produced unusually tonal calls with higher HNR and significantly fewer nonlinear phenomena (NLP). Mean %NLP: deaf males 7.5 ± 4.3%, deaf females 9 ± 4.7% vs control males 45.3 ± 4.3%, control females 40.5 ± 4.7% (LMM F1,58 = 38.6, p < 0.001). Less jitter (F1,58 ≈ 7.8, p = 0.007) and higher HNR (F1,58 ≈ 24.3, p < 0.001). Deaf males’ aggressive f0 was ~100 Hz higher (mean 451 ± 30 Hz) than control males (349 ± 33 Hz).
- Pain: Similar pattern with near-absence of harshness in deaf calls. %NLP: control males 36 ± 3.7%, control females 15 ± 3.1% vs deaf males 4.9 ± 3.7%, deaf females 3.3 ± 3.1% (males F1,58 ≈ 36.4, p < 0.001; females F1,58 ≈ 7.2, p = 0.01). Deaf males showed higher HNR (F1,58 ≈ 5.8, p = 0.019) and slightly lower amplitude (trend; p = 0.057).
- Fear: Acoustic structure comparatively similar between groups in %NLP (4–11%; no significant group differences for %NLP). Deaf males’ fear calls were more tonal (lower jitter F1,58 = 4.7, p = 0.034; higher HNR F1,58 = 8.9, p = 0.004) and higher pitched (mean f0 446 ± 34 Hz vs 323 ± 37 Hz; F1,58 = 6.0, p = 0.018).
- Temporal/segmentation: Deaf vocalizations had more unvoiced breaks (mean 1.44; max 8) vs controls (mean 1.12; max 4) across contexts.
- Formants/VTL (male subset, f0 < 400 Hz): In aggression, deaf males had wider formant spacing (F1,35 = 12.1, p = 0.003) and shorter apparent VTL (F1,34 = 10.7, p = 0.003). Controls extended VTL by nearly 2 cm more than deaf vocalizers when sounding aggressive. Controls lowered upper formants (F3/F4) in aggression; deaf males raised them (F3: F1,35 = 8.7, p = 0.006; F4: F1,33 = 6.1, p = 0.019). Controls consistently had lower F2 across contexts (F1,44 = 25, p < 0.001), indicating more articulation (e.g., lip rounding). Perception:
- Experiment 1 (forced-choice emotion): Controls’ vocalizations were classified at 60% ± 0.005 SEM vs deaf at 43% ± 0.005. The deficit was driven by aggression: control males 70.6% and females 62.4% correct vs deaf males 20.4% and females 29.1%, both below chance (33%). Pain was also harder to recognize in deaf (particularly female vocalizers; F1,276 ≈ 27.0, p < 0.001). Raw fear identification rates were equivalent across groups; however, unbiased hit rates (Hu) controlling for response bias showed significantly lower accuracy for all emotions from deaf vocalizations (all Z > 5.8, p < 0.001). Confusions revealed a strong bias to perceive deaf vocalizations as fear (e.g., aggression misperceived as fear 42%/35% for male/female deaf vocalizers).
- Experiment 2 (open-ended): Listeners more frequently applied appropriate labels to control vocalizations than to deaf ones, especially for aggression. The word “anger” labeled aggressive controls 293 times vs deaf 91 times; deaf aggressive calls were often labeled “fear” (193) or “pain” (133). Deaf vocalizations received more positive/neutral labels (e.g., “joy,” “excitement,” “relief,” “pleased,” “puzzled”).
- Experiment 3 (authenticity): Deaf vocalizations were rated less authentic across emotions and sexes (all F1,233 ≥ 36.5, p < 0.001), except female fear (marginal; F ≈ 3.8, p = 0.052). Deaf male aggressive calls had mean authenticity 2.7 ± 0.06 vs controls 5.0 ± 0.06 on a 1–7 scale.
- Experiment 4 (deafness detection): Listeners detected deafness at 64% ± 0.004 overall (above 50% chance; all t > 11.7, p < 0.001). Detection varied by emotion (males F2,547 = 24.5, p < 0.001; females F2,547 = 6.7, p = 0.001): highest for aggression (males 67.7% ± 0.01; females 65.5% ± 0.01) and lower for fear (~61%). Additive effects of deprivation:
- Subgroup analysis showed graded effects: aggression identification (raw) was 67% for controls, 28% ± 0.010 for deaf with limited auditory experience, and 18% ± 0.014 for deaf with zero auditory experience (F1,279 = 30.8, p < 0.001). Pain identification: 57% ± 0.012 (limited) vs 45% ± 0.016 (zero experience; F1,278 = 35.4, p < 0.001). Fear identification was equivalent across groups (~51–53%). Deafness detection and lower authenticity were most pronounced for the zero-experience group.
Findings support the hypothesis that typical development of volitional nonverbal vocalizations depends on auditory experience. Deaf adults’ aggressive and pain vocalizations lacked hallmark low, harsh profiles (few nonlinear phenomena, higher pitch, reduced articulation and VTL extension), leading listeners to misidentify emotion (especially perceiving fear) and judge calls as less authentic, and enabling above-chance detection of deafness from nonverbal vocalizations alone. In contrast, fear vocalizations were comparatively typical, potentially because (a) fear cries are more hard-wired and emerge earlier in ontogeny, (b) the default production without auditory-motor learning yields high-pitched, tonal, minimally articulated calls resembling fear/distress, or (c) fear calls are mechanistically easier to produce volitionally than aggressive roars, which require active pitch lowering, vocal tract extension, and inducing aperiodic vocal fold vibration. The graded atypicality among deaf subgroups (most severe in those with zero auditory experience) indicates additive effects of auditory deprivation. Multiple mechanisms may contribute: reduced external auditory input (hearing others) and reduced internal auditory feedback (hearing oneself) likely both affect vocal control; additional factors include limited vocal motor practice, physiological changes in laryngeal/articulatory control, and social inhibitions surrounding vocalization. Perceptual results align with known listener biases (“low/harsh is large/threatening”), explaining systematic misclassification. Overall, results suggest that vocal learning extends beyond speech to encompass volitional non-linguistic emotional vocalizations.
Volitional human nonverbal vocalizations, particularly those expressing aggression and pain, require auditory experience to develop conventional form–function mappings. Profoundly deaf adults produced homogeneous, high-pitched, tonal, and under-articulated calls across emotions, which listeners misidentified and rated as less authentic; fear calls were relatively preserved. The degree of auditory deprivation scaled with atypicality, supporting additive effects of lack of auditory experience. These findings imply that vocal learning and auditory-motor feedback are important not only for speech but also for non-linguistic emotional vocalizations, with implications for theories on the evolution of vocal control and for clinical and pedagogical interventions to aid communication between deaf and hearing communities. Future research should: (1) examine spontaneous (reflexive) emotional vocalizations in real-life contexts to disentangle effects of volition vs reflex; (2) investigate positive affective vocalizations; (3) parse contributions of auditory input versus feedback and quantify vocal motor experience; (4) explore multimodal encoding/decoding and include deaf-blind populations; and (5) broaden formant analyses across sexes by acquiring suitable low-f0 samples.
The study focused on negatively valenced, posed (volitional) vocalizations; results may not generalize to spontaneous emotional vocalizations in real-world contexts. Although privacy minimized inhibition, producing emotions on demand can affect naturalness. The design cannot disentangle effects of external auditory input from internal auditory feedback; measures of individual vocal motor experience and physiological control were not included. Formant analyses were limited to a male subset with f0 < 400 Hz due to measurement constraints. Multimodal (audiovisual) contributions were not directly tested, and cross-modal compensation was not assessed.
Related Publications
Explore these studies to deepen your understanding of the subject.

