Computer Science
Deepfake audio as a data augmentation technique for training automatic speech to text transcription models
A. R. Ferreira and C. E. C. Campelo
The study addresses the challenge of obtaining large, diverse, and high-quality labeled datasets required to train robust speech-to-text transcription models, particularly for languages and accents underrepresented compared to English. Traditional data augmentation methods (e.g., noise addition, pitch changes) improve robustness to background conditions and voice tone but do not address accent variability. The authors propose a data augmentation framework using deepfake audio (voice cloning) to generate new utterances that preserve speaker characteristics such as accent, hypothesizing that this could improve transcription performance. The framework pairs a voice cloning model capable of cloning a voice from a few seconds of reference audio with a small, accent-homogeneous dataset (Indian English). The approach is evaluated by fine-tuning a pre-trained transcription model and measuring performance changes on held-out data using WER. Contributions include: (1) a ready-to-use framework for deepfake-audio-based augmentation (pluggable with different cloners), and (2) an empirical evaluation across scenarios. Two experiments were conducted: using the cloner’s default models and using partially retrained cloner components. In both cases, fine-tuning with synthetic data degraded performance versus the pre-trained baseline (about +2% WER in Experiment 1; about +6% WER in Experiment 2), likely due to low quality of generated audio. The authors posit that higher-quality cloning could yield better results.
Prior augmentation techniques for ASR include speed perturbation (modifying sampling rate) to generate additional variants; SpecAugment, which warps spectrograms and masks frequency/time regions; and SpecSwap, which swaps spectrogram blocks. These methods improve robustness but primarily introduce distortions rather than changing speaker identity or accent. Text-to-speech–based augmentation has also been explored for low-resource settings, showing improvements using synthetic speech and generated text. The present work differs by exploring deepfake audio via voice cloning to preserve specific speaker/accent characteristics while varying linguistic content, a scenario not found in prior literature according to the authors.
Theoretical foundation: The voice cloner used is Real-Time Voice Cloning (RTVC), an implementation of the SV2TTS architecture with three independently trained components: (1) a speaker encoder (trained for speaker verification) producing a fixed-size embedding from a short reference audio; (2) a Tacotron 2–based synthesizer generating a mel spectrogram from the embedding and text; and (3) a WaveRNN-based vocoder converting the mel spectrogram to waveform. The transcription model is DeepSpeech, an RNN-based open-source ASR offering pre-trained models and support for fine-tuning. Dataset: The NPTEL2020 Indian English Speech Dataset (from YouTube videos) was chosen to control accent (South Asian). Because the full set is massive and variable in transcription quality, the experiments use NPTEL’s Pure-Set: 1,000 audio segments with manually verified transcriptions. Preprocessing: A custom script assigns unique sequential IDs, normalizes audio (ffmpeg-normalize) to 16 kHz, applies noise removal and a high-pass filter, and removes items with empty transcriptions. Subsets can be created by random partition without repetition. Additional scripts prepare data for cloner training (placing files into the expected directory structure) and for DeepSpeech (creating training/validation/test CSVs with lowercased, normalized text; numbers mapped to words). For generated audios, a quality filter discards likely-bad samples by comparing duration against the original audio using two thresholds: gap size percentage (relative overlength) and gap size (absolute overlength). Only audios exceeding both thresholds are discarded. Cloner training: Since the dataset lacks speaker IDs, only the synthesizer and vocoder can be trained (the encoder requires speaker-labeled data). The RTVC repository’s preprocessing and training scripts are used. Multiple combinations of fine-tuning and retraining for synthesizer/vocoder are explored with different training steps. Audio generation: A main script selects, for each reference audio, up to a limit of other references whose transcriptions will be cloned into new audios using the reference’s voice (no self-transcription reuse). An auxiliary script (adapted from RTVC’s CLI demo) performs inference through encoder→synthesizer→vocoder and may skip items on error or tiny spectrograms. Transcriptor training and evaluation: DeepSpeech fine-tuning uses the preprocessed CSVs; validation loss is monitored each epoch and the best epoch model is retained. Inferences are conducted via a dedicated script that loads a model/scorer and runs on a CSV of samples, computing average Word Error Rate (WER) across items. Multiple training scenarios vary hyperparameters (e.g., dropout, scorer inclusion), with typically 200 training epochs per scenario. Experiments: Two experiments were conducted. Experiment 1 uses RTVC default models to generate 10,458 synthetic audios from 498 reference clips (limit=21), trains DeepSpeech in various setups, and evaluates on a separate 500-clip split from the Pure-Set. Experiment 2 retrains RTVC’s synthesizer and/or vocoder on 494 audios (from a designated split) under various combinations; a qualitative assessment of generated audio quality (poor/reasonable/good scoring) selects the best cloner combo (“sys zero voc”). Using this combo, 10,400 audios are generated from 200 references (limit=52), DeepSpeech is fine-tuned in scenarios analogous to Experiment 1, and evaluated on a 300-clip split. Long-duration generated audios are preferentially discarded due to observed poor quality.
- Experiment 1 (RTVC default models): Training DeepSpeech on 10,458 cloned audios degraded performance relative to the pre-trained model. WERs (on a 500-audio eval split): Pre-trained 0.636; fine-tuning (standard, no scorer) 0.657; fine-tuning (dropout=0.4, no scorer) 0.709; fine-tuning (standard, with scorer) 0.681. Qualitative inspection revealed many generated audios were partially or wholly unintelligible.
- Experiment 2 (partially retrained cloner): Multiple synthesizer/vocoder training combinations were evaluated qualitatively; the best-performing combinations were “standard” and “sys zero voc,” with “sys zero voc” selected for further use. Long-duration synthetic audios were predominantly poor quality, validating the duration-based discard heuristic. Using “sys zero voc,” 10,400 audios were generated and used for DeepSpeech fine-tuning; evaluation on 300 clips showed a further degradation with WER increasing by approximately six percentage points compared to the pre-trained model.
- Overall, fine-tuning with deepfake-generated data reduced transcription quality in both experiments, consistent with the low fidelity of the cloned audio.
- Likely causes: poor-quality generation from the cloner (especially for longer utterances), noisy source data from YouTube, specialized/technical vocabulary uncommon in pre-training corpora. The inability to retrain the encoder (due to missing speaker labels) likely limited cloner quality gains.
The research question—whether deepfake audio can serve as an effective data augmentation technique to improve ASR performance—was not supported under the tested conditions. In both experiments, augmented training with cloned audio degraded WER relative to the pre-trained baseline, indicating that model fine-tuning was harmed by synthetic data of insufficient quality. The qualitative analyses showed widespread artifacts (e.g., speech interruptions, noise bursts when the synthesizer failed on words), with long-duration samples particularly problematic. These characteristics likely biased the ASR toward erroneous patterns during fine-tuning, overwhelming any accent preservation benefits. The literature suggests that augmentations that modestly perturb acoustics or leverage high-quality TTS can help; however, when synthetic data are low-fidelity, the net effect can be negative. The authors argue that improvements are plausible with higher-quality cloning—especially by training the encoder on speaker-labeled data—and/or by using cleaner, less technical source datasets. Thus, the framework remains promising contingent upon better cloners and data conditions.
The paper introduces a framework to use deepfake audio (voice cloning) as a data augmentation method for ASR training and evaluates it on Indian-accent English using DeepSpeech. Despite the intended advantage of preserving accent while varying content, experiments with RTVC default and partially retrained cloner components showed consistent degradation in WER after fine-tuning versus the pre-trained ASR, attributed to low-quality synthetic audio and dataset noise/technical vocabulary. The main contributions are the deployment-ready augmentation framework and the first empirical exploration (to the authors’ knowledge) of deepfake-audio augmentation for ASR. Future work should aim to improve cloned audio fidelity by better training of cloner components (especially the encoder with speaker-labeled data), exploring alternative cloning models, refining hyperparameters/architectures, and employing cleaner, larger datasets with speaker IDs and less technical language. With higher-quality synthesis, the proposed framework may yield positive gains.
- Synthetic audio quality: RTVC-generated audios frequently contained artifacts, with long utterances especially poor; this likely undermined ASR fine-tuning.
- Cloner training constraints: The encoder could not be trained due to absence of speaker labels in the dataset, limiting potential improvements in voice similarity and synthesis quality.
- Dataset noise and domain mismatch: Source audios from YouTube were noisy and often contained technical vocabulary uncommon in standard ASR/cloner pre-training corpora.
- Single cloner model tested: Only Real-Time Voice Cloning (open-source) was used; potentially better-performing (but closed-source) cloners were unavailable.
- Limited data scale for cloner training and evaluation splits; results may not generalize across languages, accents, or cleaner datasets.
- Evaluation limited to WER; no analysis of accent robustness improvements independent of absolute WER was reported.
Related Publications
Explore these studies to deepen your understanding of the subject.

