logo
ResearchBunny Logo
Exploring the effects of modality and variability on EFL learners’ pronunciation of English diphthongs: a student perspective on HVPT implementation

Linguistics and Languages

Exploring the effects of modality and variability on EFL learners’ pronunciation of English diphthongs: a student perspective on HVPT implementation

A. Almusharraf, A. Aljasser, et al.

Discover how High Variability Pronunciation Training (HVPT) impacts the pronunciation accuracy of English diphthongs among Saudi EFL learners. This study, conducted by Asma Almusharraf, Amal Aljasser, Hassan Saleh Mahdi, Haifa Al-Nofaie, and Elham Ghobain, reveals that both high and low variability training significantly enhance pronunciation skills, with low variability without captions leading the way in performance. Explore the promising results!

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the persistent challenge EFL learners face in perceiving and producing English vowels—particularly diphthongs—owing to limited exposure, L1 influence, and perceptual biases. Prior work has emphasized High Variability Phonetic Training (HVPT), often for consonants, and shown that explicit training integrating perception and production can improve outcomes. With videos widely used in EFL and tools like YouGlish enabling multimodal, authentic input, questions remain about how modality (captioned vs. non-captioned) and variability (high vs. low talker variability) interact to influence vowel production, especially for Arab learners whose L1 vowel system differs from English. Guided by Cognitive Load Theory, which predicts potential benefits or costs of adding visual text to auditory input, the study investigates whether modality and variability interact to affect pronunciation accuracy of English diphthongs, and explores students’ perceptions of using HVPT via YouGlish. Research questions: (1) Do modality (captions) and variability (HVPT vs. LVPT) interact to affect EFL learners’ pronunciation of English diphthongs? (2) What are students’ perceptions of using HVPT for learning English diphthong pronunciation?
Literature Review
The review synthesizes evidence on speech production training approaches and multimodal input. It contrasts explicit vs. incidental and single- vs. dual-modal training, noting learners’ shifting attention toward production as proficiency increases. Videos can enhance engagement and pronunciation when integrated with clear instructional design and scaffolding; effectiveness depends on video selection, practice duration, and focus (phonetic form vs. meaning). Captioning often aids phonological form retention, word recognition, and pronunciation, though benefits can vary by proficiency and cognitive load: some studies show captions help both low and high proficiency learners, while others caution that captions may be unnecessary or even taxing when attention to phonetic detail is required. HVPT exposes learners to multiple talkers to bolster perception/production robustness, whereas LVPT uses a single talker for controlled exposure; HVPT has shown benefits across ages and settings, though LVPT remains informative. For Arab EFL learners, differences between Arabic and English vowel systems, including fewer vowel phonemes in Arabic and allophony, can make English diphthongs particularly difficult (e.g., fair–fear). Cognitive Load Theory provides a framework to predict how adding captions and varying talkers may increase or reduce extraneous load, shaping learning outcomes.
Methodology
Design: Experimental study with pretest–posttest, generalization, and delayed tests, plus an attitudes survey. Four conditions were implemented crossing variability and modality: (1) Low-variability (LV) no captions, (2) High-variability (HV) no captions, (3) HV with captions, (4) LV with captions. Training used YouGlish videos (American English) to target English diphthongs. Participants: 56 Saudi female undergraduates (ages 18–24) majoring in English at a Riyadh public university (initially 64; attrition yielded 56). All had STEP ≥60 and were taking level-two listening/speaking courses (assumed CEFR B2). Participation was voluntary. Context and grouping: Data collection spanned 13 weeks in Spring 2023. Participants were organized by course section into the four groups using random sampling at the section level. LV groups heard one speaker repeat words multiple times; HV groups heard multiple speakers produce the same words in varied contexts. Sessions were conducted in-class; tests were recorded individually in a soundproof room. Pronunciation targets and tests: Three English diphthongs were targeted (/ǝo/, /au/, /ai/), each represented by five minimal pairs (total 15 pairs). Each student submitted four recordings: pretest (baseline), posttest (after training), generalization test (five novel minimal pairs to assess transfer), and delayed test (two months later) to assess retention. Total audio files = 56×4 = 224. Perception (discrimination) test: A native speaker produced one item from each minimal pair; learners selected the heard word (randomized order) via Google Forms to verify perceptual discrimination independent of production. Training procedures: Three YouGlish-based sessions focused on the target diphthongs. Captioned groups viewed highlighted captions during video exposure; non-captioned groups did not. LV groups heard a single speaker multiple times; HV groups heard multiple speakers per word. Learners were instructed to attend to accurate pronunciation and, when applicable, to read captions. Rating: Three PhD-level applied linguistics experts independently scored each produced word as correct (1) or incorrect (0). Scores were later combined; discrepancies were resolved via discussion to ensure consensus. Survey (attitudes): Post-intervention, a revised Likert-scale questionnaire (5-point) adapted from Fu and Yang (2019), reviewed by five EFL experts for content validity, measured perceptions of YouGlish features (accents, multiple speakers, replay, captions, speed). Cronbach’s alpha for reliability was later reported. Pilot: Conducted with 15 level-two students to validate word lists and task flow; overly easy words were removed, and list ordering was adjusted to reduce rhyming effects. Data analysis: Descriptive statistics, independent-samples t-tests, and two-way repeated measures ANOVA were conducted (SPSS v27) to assess effects of modality (captions) and variability (HV vs. LV) on pronunciation scores, and one-way ANOVA assessed group differences in attitudes. Partial eta squared reported for effect sizes.
Key Findings
- Descriptive performance means (pronunciation scores): LV no caption = 11.25 (SD 3.99); LV caption = 13.19 (SD 2.10); HV no caption = 12.55 (SD 1.97); HV caption = 9.91 (SD 1.65). - Repeated-measures ANOVA: - Main effect of captions: F(1,29) = 22.55, p = 0.000, partial η2 = 0.159 (significant). - Main effect of modality (HV vs. LV): F(1,29) = 2.073, p = 0.153, partial η2 = 0.017 (non-significant). - Caption × Modality interaction: F(1,29) = 110.48, p = 0.000, partial η2 = 0.481 (significant), indicating that the effect of captions depended on variability. - Pairwise comparisons: - With captions: significant mean difference between LV and HV conditions (mean diff = 0.992, SE = 0.209, p = 0.000). - Without captions: significant mean difference between HV and LV conditions (mean diff = −0.992, SE = 0.209, p = 0.000). - Significant differences observed between LV with captions vs. HV without captions (p = 0.000) and HV with captions vs. LV without captions (p = 0.000). - Attitude survey (reliability and group means): Cronbach’s alpha = 0.80. Group means: LV no caption = 3.94 (SD 0.379), HV no caption = 3.98 (SD 0.345), HV with captions = 4.13 (SD 0.555), LV with captions = 4.31 (SD 0.453). One-way ANOVA: F = 2.30, p = 0.08 (no significant differences across groups). Overall means above 3.0 indicate generally positive perceptions of using YouGlish for pronunciation learning.
Discussion
Both HVPT and LVPT led to pronunciation gains, consistent with prior work showing benefits of video-based and explicit pronunciation training. The significant Caption × Modality interaction indicates that captions do not uniformly aid performance; rather, their impact depends on the variability context. LV with captions yielded the highest mean performance, suggesting captions plus controlled, consistent input may reduce extraneous cognitive load and help learners focus on phonetic details, aligning with Cognitive Load Theory. HV without captions produced the second-highest performance, supporting HVPT’s role in building robust representations across talkers. In contrast, HV with captions produced the lowest mean, implying that combining high talker variability with concurrent caption processing may overload working memory and impede phonetic discrimination/production for these learners. Although many studies report benefits of captioning for pronunciation and lexical processing, the present results suggest that when variability is high, added on-screen text can become a competing demand. Learner perceptions of YouGlish were positive across conditions, indicating broad acceptability and perceived usefulness, even where objective benefits varied. These findings underscore the need to align multimodal supports with task demands and learner characteristics to optimize pronunciation training.
Conclusion
The study contributes evidence on how modality (captions) and variability (HV vs. LV) interact in video-based pronunciation training for English diphthongs among Saudi EFL learners using YouGlish. LV with captions was most beneficial, HV without captions also helped, and HV with captions underperformed—highlighting the role of cognitive load in multimodal pronunciation training. Learners reported positive perceptions of YouGlish across all conditions. Implications include tailoring caption use to variability level and learner needs to balance support with cognitive demands. Future research should: (a) examine different proficiency levels and L1 backgrounds; (b) manipulate explicit instruction alongside HVPT; (c) vary caption types (full/partial, phonetic cues) and pacing; (d) test additional vowel contrasts and accents; (e) measure cognitive load directly (e.g., dual-task, subjective load scales) and track retention longitudinally.
Limitations
- Sample restricted to 56 Saudi female undergraduates from a single university and course level (CEFR B2), limiting generalizability across genders, proficiencies, institutions, and L1 backgrounds. - Training targeted only three diphthongs and used American English talkers, narrowing phonetic and accent coverage. - Grouping was by course sections, which may introduce classroom/teacher effects despite random sampling at section level. - Potential measurement constraints: 0/1 scoring may mask gradations in pronunciation accuracy; interrater consensus mitigated bias but was not accompanied by formal interrater reliability coefficients in the report. - Some inconsistencies in diphthong notation across materials and sessions could affect replication clarity. - The precise duration and intensity of exposure per session were not fully quantified, and no direct cognitive load measures were collected.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny