Linguistics and Languages
Using a forced aligner for prosody research
H. Wu, J. Yun, et al.
The study investigates whether a widely used forced aligner (Montreal Forced Aligner, MFA) can provide accurate and efficient annotations for prosodic research in a tonal language (Mandarin Chinese). Prosody research requires precise boundaries for units such as syllables and phrases to extract duration, pitch, and intensity. Manual annotation is laborious and time-consuming. While forced alignment has been positively evaluated mainly at phone and word levels, little work has assessed its suitability for suprasegmental analysis, especially in tonal languages. The authors pose the research question: How accurate and efficient is MFA for syllable-by-syllable and phrase-by-phrase alignment in Mandarin, including ambiguous sentences whose meanings are disambiguated prosodically? They aim to quantify accuracy against human annotations and evaluate efficiency gains, while analyzing factors influencing performance such as audio quality, gender, tone categories, and functional words.
Forced alignment tools have proliferated (at least 15 open resources per Pettarin, 2018). Prior evaluations report strong performance for phoneme boundaries (e.g., DiCanio et al., 2013; Hosom, 2009; Goldman, 2011; Gorman et al., 2011; Yuan & Liberman, 2015; Yuan et al., 2014, 2018; McAuliffe et al., 2017; Sella, 2018; MacKenzie & Turton, 2020; McAuliffe, 2021; Liu & Sóskuthy, 2022). Specifically for MFA, Mahr et al. (2021) found highest accuracy among several aligners on children’s English (avg 86%); Liu & Sóskuthy (2022) showed strong agreement in Chinese varieties (median 17 ms onset displacement). However, prior work emphasizes phone- or word-level alignment, with limited attention to suprasegmentals/phrase boundaries in tonal languages. Recording quality effects are documented (Sanker et al., 2021). This study fills the gap by evaluating MFA’s performance at syllable and phrase levels in Mandarin prosody.
Tool: Montreal Forced Aligner (MFA) v1.0 with pre-trained Mandarin acoustic model and G2P; custom Mandarin pronunciation dictionaries prepared per MFA guide. Inputs: WAV audio, transcripts, and pronunciation dictionaries.
Dataset: 33 adult native Mandarin speakers (18 lab recordings in a sound booth with professional equipment; 15 local recordings via phones/laptops). Stimuli: 56 ambiguous target sentences with wh-words, spanning 4 structural groups (transitive, ditransitive, left-dislocation, conditional), 7–20 syllables. Participants read sentences for possible meanings or stated they were not acceptable; could re-record. After processing, 4096 audio files (1–4 s each; total 216 min) were retained; one speaker excluded due to incomplete data. Files were labeled for matching target sentences; those labeled “null” or “misread” were excluded from alignment and evaluation. Audio prep: converted to WAV via Praat; MFA/Kaldi handled bit depth/sampling rate normalization; segments were trimmed to target sentences.
Syllable-level alignment procedure: Created a syllable-level Mandarin pinyin dictionary from the 3500 Commonly Used Chinese Characters list; tones encoded as numerals (1–4) appended to nuclear vowels; transcripts in pinyin with spaces between syllables. Ran MFA to generate TextGrid annotations (phones and syllables). Four trained annotators reviewed MFA outputs and adjusted boundaries/labels to create human references. Evaluation subset: 1120 randomly selected audio files (~27.34% of data). Exclusion criterion: files where MFA and human differed in number/labels of intervals (often due to spurious silent intervals) were removed (130 excluded). Final comparison included 10,487 syllable-level boundary pairs. Metrics: absolute time difference per syllable boundary; binary agreement using 25 ms threshold (gold-standard per McAuliffe et al., 2017).
Syllable-level efficiency test: Two expert annotators timed a 30-minute session each. Hardware: Windows Lenovo Legion 5 (i5-10300H, 8GB RAM) and macOS MacBook Pro (1.4 GHz quad-core i5, 8GB RAM). Workflow: copy files, run MFA to generate TextGrids, annotate and log notes. Annotator X: 29 usable files (~100 s total), finalized 14 files in 21 min after setup; Annotator Y: 48 files (131 s total), finalized 14 files in 23 min after setup.
Phrase-level alignment exploration: Investigated whether MFA can directly produce phrase boundaries suited for prosodic analysis. Tested combinations of: (a) phrase-level transcripts with syllable-level dictionaries (large “3500-char” and a smaller transcript-derived dictionary) — both unsatisfactory; (b) phrase-level transcripts with phrase-level dictionaries, trying two formats: phoneme-by-phoneme and syllable-by-syllable pronunciations. Best results came from phrase-level transcripts plus phrase-level dictionary with phoneme-by-phoneme pronunciations.
Phrase-level alignment procedure: Using the best combination, ran MFA on the same set of 1120 files. Human annotators adjusted MFA phrase-tier boundaries/labels to create references. Excluded files with mismatched numbers/labels of intervals (354 excluded). Final comparison included 3944 phrase-level boundary pairs. Metrics: absolute boundary-time difference; binary agreement using the 25 ms threshold. Also analyzed effects of recording condition (lab vs local) and speaker gender. Phrase-level efficiency test: same annotators/laptops; 30-minute sessions on lab-recorded subsets (Lab Speaker 6, Groups 3 and 2). Annotator X: 19 usable files (54 s total), finalized 14 in 19 min after setup; Annotator Y: 16 files (50 s), finalized 12 in 22 min after setup.
Syllable-level accuracy: Average human–MFA absolute boundary difference = 15.59 ms (SD 30.41) across 10,487 pairs; per speaker-group averages ranged 2.94–28.58 ms. 73.49% of differences were within 25 ms. Recording condition: local recordings 17.02 ms (SD 31.41), 71.31% within 25 ms; lab recordings 13.80 ms (SD 29.02), 76.20% within 25 ms; difference significant (chi-square, p<.001). Gender: female 15.74 ms (SD 32.29), 73.66% within 25 ms; male 15.48 ms (SD 29.08), 73.38% within 25 ms; no significant effect (p>0.05). Outliers >100 ms (n=197) concentrated on wh-words (shuí), adverb (yě), negation (méi), and tones: third tone and neutral tone syllables.
Syllable-level efficiency: MFA-aided annotation completed phone- and syllable-level annotations for ~1 minute of audio in about 30–40 minutes, versus up to 13 hours per minute for manual phone-level annotation reported in literature (≈800× audio duration), implying substantial time savings; authors note at least 20× faster than previously reported manual workflows.
Phrase-level accuracy: Average human–MFA absolute boundary difference = 22.49 ms (SD 38.39) across 3944 pairs; per speaker-group averages ranged 1.33–282.11 ms; some extreme errors (e.g., Speaker9_Group4_M up to 300–700 ms) attributed to MFA mistakes. 65.57% within 25 ms. Recording condition: local 25.86 ms (SD 42.48), 57.48% within 25 ms; lab 16.86 ms (SD 38.87), 74.16% within 25 ms; significant difference (chi-square, p<.001). Gender: female 17.64 ms (SD 32.31), 71.95% within 25 ms; male 24.13 ms (SD 41.85), 61.20% within 25 ms; significant gender effect (p<.001), with male speech more difficult at phrase boundary detection.
Phrase-level inputs: Best-performing configuration was phrase-level transcript plus phrase-level dictionary with phoneme-by-phoneme entries; syllable-level dictionaries were unsuitable for phrase alignment.
Error patterns: Lower accuracy around de-stressed/neutral tone syllables, third tone and tone sandhi contexts, functional words (classifiers, negation, adverbs, determiners, prepositions, aspect markers, clause markers), pronunciation variants (e.g., sheí/něi), and er-suffixation (Erhua) causing resyllabification; recording quality strongly influences phrase-level performance.
The findings show MFA can produce accurate, usable alignments for prosodic research in Mandarin at both syllable and phrase levels. Agreement within 25 ms matched or exceeded benchmarks from prior phone-level evaluations for syllables, and phrase-level agreement (65.57%) is respectable given the greater complexity of suprasegmental boundaries. By enabling automatic boundary proposals that human annotators can refine, MFA substantially reduces annotation time and effort, facilitating larger-scale prosody studies.
Performance varies with factors central to prosody: de-stressed regions (neutral tones, functional words) and complex tonal phenomena (third tone, tone sandhi) yield larger errors, reflecting subtle and context-dependent acoustic realizations. Pronunciation variability (dialectal variants) and Erhua-induced resyllabification further challenge alignment, especially when dictionaries lack variant forms. Recording quality and speaker characteristics (lower pitch in male voices) impact phrase boundary detection, underscoring the role of pitch/intensity cues in prosodic parsing. These insights inform both experimental design (favor professional recording conditions for phrase-boundary studies) and aligner development (robust handling of de-stressed speech, tone sandhi, and variant pronunciations).
This study provides the first detailed evaluation of MFA for prosodic research in Mandarin at syllable and phrase levels. Average human–MFA differences were below the 25 ms gold standard at both levels, with substantial efficiency gains: MFA-assisted workflows reduce annotation time dramatically compared to manual methods. While accuracy declines in non-salient acoustic regions, under pronunciation variation, and with lower-quality audio, overall performance is sufficiently strong to streamline prosody annotation workflows in tonal languages. Recording quality significantly affects accuracy, especially for phrase boundaries, and a gender effect (male speech more difficult at phrase level) was observed. The authors propose a practical workflow: generate phrase-level alignments with phrase-level transcripts and phoneme-by-phoneme dictionaries using MFA, then have annotators adjust boundaries/labels. Anticipated improvements in newer MFA versions (e.g., Mandarin Erhua dictionary including reduced variants) may further enhance performance, especially for Erhua. Future work will extend evaluations to other tonal languages and test updated models/dictionaries.
- Data are read speech from designed ambiguous sentences rather than spontaneous conversational speech, which may limit generalizability to spontaneous prosody.
- Evaluation focused on Mandarin; findings may not directly transfer to other tonal languages without additional testing.
- Reliance on MFA v1.0 models and custom dictionaries; newer versions (e.g., MFA 2.0, Mandarin Erhua dictionary) may perform differently.
- Exclusion of files with mismatched interval counts/labels may bias accuracy estimates by removing more challenging cases (e.g., silent interval discrepancies).
- Phrase-level definitions depend on research aims; results may vary with alternative phrase segmentation schemes.
- Persistent challenges include de-stressed regions (neutral tones, functional words), third tone/tone sandhi, pronunciation variants, and Erhua-induced resyllabification, which can reduce alignment accuracy.
- Gender effect at phrase level suggests variability across speaker populations (e.g., pitch range) that warrants further study.
Related Publications
Explore these studies to deepen your understanding of the subject.

