Training robust automatic speech-to-text (ASR) models requires large, diverse, labeled datasets. This is challenging, especially for less-popular languages. This paper proposes a data augmentation framework using deepfake audio to address this issue. Experiments using a voice cloner and an Indian English dataset showed that while the framework is functional, the quality of the deepfake audio significantly impacted the ASR model's performance, leading to increased Word Error Rate (WER).
Publisher
arXiv
Published On
Sep 22, 2023
Authors
Alexandre R Ferreira, Cláudio E C Campelo
Tags
speech-to-text
ASR models
data augmentation
deepfake audio
Word Error Rate
Indian English
audio processing
Related Publications
Explore these studies to deepen your understanding of the subject.