logo
ResearchBunny Logo
Abstract
Training robust automatic speech-to-text (ASR) models requires large, diverse, labeled datasets. This is challenging, especially for less-popular languages. This paper proposes a data augmentation framework using deepfake audio to address this issue. Experiments using a voice cloner and an Indian English dataset showed that while the framework is functional, the quality of the deepfake audio significantly impacted the ASR model's performance, leading to increased Word Error Rate (WER).
Publisher
arXiv
Published On
Sep 22, 2023
Authors
Alexandre R Ferreira, Cláudio E C Campelo
Tags
speech-to-text
ASR models
data augmentation
deepfake audio
Word Error Rate
Indian English
audio processing
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny