logo
ResearchBunny Logo
A practical guide to calculating vocal tract length and scale-invariant formant patterns

Linguistics and Languages

A practical guide to calculating vocal tract length and scale-invariant formant patterns

A. Anikin, S. Barreda, et al.

Explore the fascinating world of vocal tract length calculation and formant analysis with insights from Andrey Anikin, Santiago Barreda, and David Reby. This guide provides essential tools and theoretical frameworks that will transform your understanding of speech and non-speech vocalizations through robust statistical methods and practical software solutions.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses how to measure and normalize formant frequencies to estimate apparent vocal tract length (VTL) and extract scale-invariant formant patterns across speakers and species. Traditional phonetic research treats inter-speaker VTL differences as a confound; however, in many behavioral and comparative contexts VTL itself is of interest (e.g., acoustic size exaggeration, cross-species articulation). The authors note that existing normalization literature is technical and human-centric, with limited practical guidance and tools for broader applications. The goal is to provide a clear framework and open-source tools for: (1) reliable formant measurement and verification, (2) VTL estimation under a single-tube approximation, and (3) scale-invariant normalization of formant patterns, including scenarios where phonetic content cannot be tightly controlled.
Literature Review
The authors summarize longstanding findings that vowel identity is largely determined by the relative positions of lower formants (F1–F2), while absolute formant frequencies scale with vocal tract length and hence speaker size. They review LPC-based formant tracking and its limitations (bias toward harmonics, especially at high F0, and sensitivity to noise). Classical and modern normalization approaches are discussed: dispersion-based methods (now largely obsolete), regression-based VTL estimation from formant spacing (Reby & McComb), log-mean (Nearey) normalization and sliding-template models, and recent multilevel modeling approaches (Barreda & Nearey). They emphasize that upper formants are more stable and informative for VTL, the uniform scaling hypothesis on logarithmic scales, and the lack of consensus on best practices in non-phonetic applications. Prior work shows modest correlations between acoustic measures and actual body size and strong perceptual effects of formant scaling on perceived size.
Methodology
- Measurement: Use LPC for initial formant extraction (Praat, phonTools::findformants) with careful tuning of analysis parameters. Because automatic tracking is error-prone, manually verify/correct measurements using the soundgen::formant_app interactive tool, which provides spectrogram/spectrum displays, diagnostic plots, and auditory synthesis for feedback. Spectrogram settings (window length, reassigned spectrograms, smoothing) are adjusted for task and signal type. - Linear-scale VTL estimation (eVTL): Model the vocal tract as a single uniform tube. Estimate formant spacing dF via linear regression of observed formant frequencies against their expected positions with the intercept fixed at zero (closed–open or closed–closed tube assumptions), then compute eVTL = c / (2·dF). Implemented in soundgen::estimateVTL. The method accepts any combination of measured formants with missing values, provided formant indices are correct. Upper formants drive eVTL; F1–F2 can be omitted if unreliable. - Scale-invariant formant pattern (linear): Use dF-normalized residuals (distance of each formant from the regression line in units of dF) to obtain a VTL-normalized representation of vowel quality. Implemented in soundgen::schwa(), which also returns predicted schwa formant positions for the estimated eVTL. - Logarithmic normalization: Represent formants on a log2 scale; normalize by subtracting the mean log-formant (equivalent to dividing by the geometric mean). This transposes the formant chord to a reference. Implemented in phonTools::normalize and via custom code for geometric-mean normalization. - Mixed-effects modeling of scale (k): Fit multilevel Bayesian models (e.g., brms) predicting log-formant frequency as a function of formant index, vowel, and their interaction, with random intercepts per speaker to estimate scale factor k (speaker-specific transposition). Convert k to kVTL relative to a reference VTL. This approach handles missing formants and provides uncertainty estimates; formants can be normalized in Hz by dividing by 2^k. - Comparative evaluation: Using Hillenbrand et al. (1995) vowels, compare intrinsic (single-token) vs extrinsic (multi-token) normalization and methods (eVTL, mean log-formant, k) for (a) estimating speaker VTL, (b) predicting perceived size (reanalysis of Barreda 2017 data), and (c) vowel separation via k-means purity and Bayesian multi-logistic classification. Also test robustness when only subsets of formants (F1–F2, F1–F3, F1–F4) are available and when training on children and testing on adult men (non-overlapping VTL ranges).
Key Findings
- Measurement: Automatic LPC is often biased/noisy, especially for high-F0 or noisy signals; manual verification with formant_app substantially improves accuracy and workflow efficiency. - eVTL properties: With intercept fixed at zero, eVTL is driven primarily by upper formants; even F3–F4 alone yield VTL estimates highly correlated with those from F1–F4 (reported Pearson r ≈ .97), whereas relying only on F1–F2 is unstable. - Scale-invariant patterns: dF-normalized residuals provide a robust VTL-normalized vowel space; clusters are more compact than in raw Hz, and the approach generalizes to non-human vocalizations. - Cross-method agreement (multi-token): Speaker-level scale estimates from multiple tokens using eVTL, mean log-formant, and mixed-model k are very similar across human VTL ranges (Pearson r = .96–.97), indicating practical interchangeability when multiple vowels per speaker are available. - Vowel effects on apparent VTL: Apparent VTL varies systematically across vowels; methods differ in how they weight lower vs upper formants (e.g., eVTL vs mean log-formant vs k), producing non-identical vowel-specific residuals (e.g., r ≈ .87 between eVTL-based and kVTL-based vowel effects). This complicates single-token VTL interpretation. - Perceived size prediction: In reanalysis of Barreda (2017), single-token eVTL and mean log-formant correlate with perceived height at r ≈ .27, while multi-token estimates improve prediction: multiple eVTL r ≈ .36, multiple mean log-formant r ≈ .41, mixed-model k r ≈ .41 (CIs reported in text). Similar advantages of multi-token eVTL observed in other datasets (supplements). - Vowel separation: Extrinsic normalization (using multiple tokens) yields higher clustering purity and classification accuracy and remains effective even with only F1–F2. Intrinsic normalization benefits strongly from including upper formants (≥F3); log-based intrinsic methods can generalize better across disjoint VTL ranges (train on children, test on men) than intrinsic eVTL. - Practical guidelines: Record multiple tokens per individual; measure at least 3–4 formants; manually verify/correct; treat uncertain values as missing; choose linear (eVTL/schwa) or log-based (geometric-mean, mixed-model k) normalization based on context and data completeness.
Discussion
The work addresses the dual objectives of estimating apparent VTL and deriving scale-invariant formant patterns for applications beyond traditional phonetics, including animal communication and paralinguistic cues. By combining rigorous manual verification, theoretically motivated linear-scale regression (eVTL), log-scale normalization (geometric mean), and multilevel Bayesian modeling (k), the framework offers flexible, robust tools that accommodate missing data and heterogeneous recordings. Findings show that when multiple tokens are available, different normalization methods converge on similar speaker-specific scale estimates and better predict perceived size, mirroring human listeners’ implicit normalization. Scale-invariant representations (dF residuals or log-ratio chords) enable meaningful cross-speaker/cross-species comparisons of formant patterns. However, apparent VTL is influenced by vowel-specific articulation, and single-token estimates are unreliable proxies for anatomical VTL, underscoring the importance of extrinsic information or standardized phonetic content when VTL is of primary interest.
Conclusion
The paper delivers a practical, open-source toolkit and clear guidance for reliable formant measurement, VTL estimation, and scale-invariant normalization. Key recommendations: verify LPC measurements (use formant_app), measure several tokens and at least one or two upper formants, and model scale using methods suited to the context (eVTL/schwa for linear-scale tube models; log-mean or mixed-model k for log-scale transposition). For speaker-typical VTL, extrinsic normalization with multiple tokens is strongly preferred and better predicts perceived size than single-token estimates. For analyzing formant patterns (vowel quality), normalized representations are essential across wide VTL ranges and are applicable to non-human vocalizations. Future work should pursue anatomically validated datasets and improved intrinsic (single-token) methods, and extend tools to complex resonators (e.g., nasalization) beyond the single-tube assumption.
Limitations
- Apparent VTL (eVTL, kVTL, mean log-formant) is not anatomical VTL; estimates depend on tube-model assumptions and vowel-specific articulation, limiting single-token interpretability. - Lack of datasets with concurrent anatomical VTL measurements and audio precludes strong validation of algorithms and vowel-specific VTL adjustments. - LPC tracking is prone to errors (especially at high F0/noise); results depend on careful parameterization and manual correction. - Single-tube assumptions break in strongly nasalized or multi-cavity configurations; methods are not recommended for comparing nasalized/closed-mouth vs non-nasalized calls without special handling. - Intrinsic normalization underperforms when only lower formants are available; missing upper formants reduce robustness. - Cross-method differences in weighting lower vs upper formants lead to divergent vowel-specific VTL effects, complicating interpretation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny