logo
ResearchBunny Logo
Introduction
The rapid advancement of artificial intelligence (AI), fueled by increased computational power and the vast amount of data available online, has led to significant progress in various applications, including speech-to-text transcription models. These models are integral to technologies like translators, virtual assistants, voice search, and audio sentiment analysis. Training these models requires labeled datasets comprising audio samples and their corresponding human-generated transcriptions. Robust ASR models must handle variations within a language, such as accents, but creating such models necessitates extensive and diverse datasets. Acquiring these datasets is difficult, particularly for languages other than English, and producing them is costly and time-consuming, requiring significant resources and manual transcription efforts to ensure quality. Data augmentation techniques offer a solution. Existing methods, such as adding background noise or modifying pitch, improve model robustness to specific conditions (e.g., background noise) but don't address variations like accents. This work explores deepfake audio as a novel data augmentation approach. Deepfake audio technology synthesizes audio mimicking specific individuals, preserving characteristics like accent. The research investigates the impact of using deepfake audio for training ASR models by implementing a framework that utilizes a voice cloning model and a small dataset to generate augmented data. This framework is evaluated through experiments using a pre-trained transcription model, assessing its effectiveness before and after training with the augmented data.
Literature Review
Numerous data augmentation techniques have been developed to address the need for large datasets in training speech-to-text models. Some methods modify existing audio, such as audio speed perturbation (adjusting sampling rates), SpecAugment (modifying spectrograms), and SpecSwap (swapping frequency and time blocks in spectrograms). Other approaches generate synthetic audio using text-to-speech models. Zevallos (2022) demonstrated successful augmentation using synthetic audio and text generation in the Quechua language. This paper distinguishes itself by proposing deepfake audio as an augmentation technique, which offers an advantage over simpler methods or standard text-to-speech models by preserving dataset characteristics like accents, thereby offering a potential solution to the challenge of diverse accent representation in training data.
Methodology
This research employs a qualitative experimental study. The methodology involved several key stages: **A. Dataset:** The study utilized the NPTEL2020 Indian English Speech Dataset, a subset of 1000 manually transcribed audios from YouTube videos spoken by Indian speakers. This was chosen to ensure a consistent accent. **B. Data Preprocessing:** Custom scripts were developed to preprocess the dataset. This included generating unique IDs, normalizing audio (16000 Hz, noise removal, high-pass filter), removing audios with empty transcriptions, and creating subsets. Further preprocessing scripts handled data for the voice cloner training and for use in the DeepSpeech transcription model. The scripts for the transcriptor also included a mechanism to discard generated audios with poor quality (excessively long compared to original) to improve training data quality. **C. Voice Cloner Training:** The Real-Time Voice Cloning system, an implementation of the SV2TTS architecture, was used. This system comprises an encoder, synthesizer (Tacotron 2-based), and vocoder (WaveRNN-based). Due to dataset limitations, only the synthesizer and vocoder models were trained, using scripts provided in the Real-Time Voice Cloning repository. **D. Audio Generation:** Custom scripts were developed to generate deepfake audio. A reference audio is selected, and multiple audios are generated using the voice from the reference audio but with the transcription of different audios from the dataset. **E. Training the Transcriptor:** The open-source DeepSpeech model was used. Preprocessed data was fed into DeepSpeech for training, employing various scenarios with different hyperparameters (dropout, scorer). **F. Inference and Evaluation:** A script facilitated inference on the DeepSpeech model. The Word Error Rate (WER) metric was used to assess transcription accuracy, comparing original and generated transcriptions. This process was performed before and after training the model with the augmented deepfake audio.
Key Findings
Two main experiments were conducted: **Experiment 1:** Used pre-trained voice cloner models. The dataset was split into two portions: one for evaluation and one for generating deepfake audio for training the ASR model. Despite training the ASR model in multiple scenarios, the WER increased by about 2%, indicating decreased transcription quality. This was attributed to poor quality deepfake audio. **Experiment 2:** Trained the voice cloner's synthesizer and vocoder models to improve deepfake audio quality. The dataset was split into three portions: one for voice cloner training, one for evaluation, and one for generating deepfake audio for training the ASR model. Various training scenarios were explored for the voice cloner. Qualitative analysis of generated audios revealed that even after retraining, the majority were of poor quality. Using the best-performing combination of trained voice cloner models, the WER increased by approximately 6% after ASR model training, again highlighting the negative impact of poor deepfake audio quality. The analysis included investigation into the relationship between audio length and quality, confirming the correlation between longer audios and poorer quality. The manual analysis of audio also revealed that the presence of technical language and background noise in the source dataset was likely a significant factor contributing to the low quality of the synthesized speech.
Discussion
The experiments demonstrate that the proposed framework for data augmentation using deepfake audio is viable but heavily reliant on the quality of the deepfake audio generated. The consistently poor quality of the audios produced, even after retraining the voice cloner, resulted in a degradation in the ASR model's performance. This suggests that the chosen voice cloning model (Real-Time Voice Cloning) and/or the source dataset's characteristics were insufficient for generating high-quality deepfake audio suitable for this augmentation approach. The increase in WER across all experiments highlights the importance of high-quality synthetic data for effective data augmentation.
Conclusion
This research introduced a framework for utilizing deepfake audio as a data augmentation method for ASR model training. Experiments revealed that the approach's effectiveness is strongly tied to the quality of the generated deepfake audio. Future research should focus on improving deepfake audio quality through better training of the voice cloner (potentially using a different model), improving the source dataset (reducing noise, simpler language), and exploring advanced data augmentation strategies to mitigate the challenges observed. The development of a robust deepfake audio generation technique is crucial for successful data augmentation in ASR model training.
Limitations
The primary limitation was the quality of the deepfake audio generated by the Real-Time Voice Cloning system. This was influenced by the characteristics of the source dataset (noisy audios, technical language). The inability to train the encoder model of the voice cloner, due to the absence of speaker identification information in the dataset, also limited the potential for improvement. The study's scope was limited to a single voice cloning model and a specific dataset.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny