logo
ResearchBunny Logo
SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue in Multiple Domains

Computer Science

SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue in Multiple Domains

S. Si, W. Ma, et al.

Discover the groundbreaking SpokenWOZ dataset, a large-scale speech-text resource for task-oriented dialogue, featuring over 203k turns and 249 hours of real human interactions. This research, conducted by an accomplished team from Alibaba Group and the University of Michigan, tackles the complexities of spoken language that traditional text datasets often overlook.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses the gap between task-oriented dialogue systems trained on written, annotator-generated datasets and the realities of human-to-human spoken conversations. Written datasets enable strong model performance but do not reflect spoken phenomena such as backchannels, disfluencies, incremental (word-by-word) processing, and the need for implicit reasoning. Existing spoken TOD datasets are limited by small scale, lack of human-to-human audio, and emphasis mainly on ASR robustness rather than unique spoken challenges. The authors propose SpokenWOZ, a large-scale, dual-modal (speech-text), multi-domain benchmark to better model spoken characteristics. They introduce two new evaluation challenges—cross-turn slot detection (to handle values provided over multiple turns) and reasoning slot detection (temporal, mathematical, and semantic reasoning for implicit expressions)—and provide comprehensive baselines to quantify the difficulty of spoken TOD.
Literature Review
The work situates itself among spoken TOD datasets: ATIS (41 dialogues; single-domain travel), DSTC2 (spoken corpora but limited scale, collected via human-to-machine setup), and DSTC10 (provides ASR N-best hypotheses without audio; 107 dialogues). Compared to these, SpokenWOZ is the first large-scale speech-text dataset for TOD with human-to-human audio, broader domains, and explicit focus on spoken-specific challenges beyond ASR noise.
Methodology
Dataset construction: SpokenWOZ extends the Wizard-of-Oz pipeline to real-time voice calls between humans. It covers 8 domains (7 inherited from MultiWOZ plus a new 'profile' domain for collecting personal info as booking confirmation). Scale: 5,700 dialogues, 203,074 utterances, and 249 hours of audio; split into 4,200/500/1,000 train/dev/test dialogues with corresponding audio hours 183/22/44. Data collection: 250 qualified participants (from 1,520 applicants; 16.4% pass rate) conducted phone-call dialogues where one played user (following template-generated task goals) and the other agent (querying an online database mirroring MultiWOZ content). Diversity: participants from Canada, Singapore, China, and South Africa. Quality control: crowdsourced audio quality review removed low-quality or unsuccessful goal-oriented dialogues. Annotation: 15 trained annotators labeled dialogue state and dialog acts on ASR-transcribed text; agent utterances were manually transcribed/cleaned from audio. Schema expands MultiWOZ (e.g., adds a 'backchannel' act) and includes new cross-turn and reasoning slot definitions. Strict QC included scripted checks, full inspections, and 10% random checks; batches below 97% turn-level accuracy were re-inspected. Costs and timeline: >8 months total; approx. $55k (audio $30k, annotation $20k, ASR $5k). Audio details: two-track recordings (user/agent), 8 kHz sample rate; word-level timestamps enabling text-speech alignment; ASR word error rate 6.1% (from agent utterances). Spoken characteristics captured: word-by-word incremental processing (backchannels, disfluencies, incomplete utterances), natural ASR noise (audio first, then ASR), and reasoning in spoken language (temporal, mathematical, semantic). New tasks: Cross-turn slot detection for values provided over multiple turns (e.g., phone/email/ID/name/license-plate in profile domain), requiring fine-grained updates and corrections; reasoning slot detection requiring temporal, mathematical, or semantic inference for implicit expressions (e.g., deriving bookday from 'tomorrow' or inferring headcount from family composition). Baselines: Text-only DST—BERT+TripPy, SPACE+TripPy, UBAR, SPACE; Dual-modal DST—SPACE+WavLM+TripPy, SPACE+WavLM, SPACE+WavLM align (with word-level text-speech alignment); Response generation baselines (policy optimization and end-to-end): UBAR, GALAXY, SPACE, SPACE+WavLM, SPACE+WavLM align. LLMs (ChatGPT gpt-3.5-turbo; InstructGPT text-davinci-003) evaluated zero-shot for DST via standardized prompts; response generation with LLMs not evaluated due to delexicalized token output issues.
Key Findings
- Dataset scale and content: SpokenWOZ includes 8 domains, 5,700 dialogues, 203k+ turns, and 249 hours of human-to-human audio; word-level alignment is provided. Train/dev/test splits: 4,200/500/1,000 dialogues; average ~35–37 turns per dialogue; two-track 8 kHz audio; ASR WER 6.1% (agent). - Dialogue State Tracking (JGA; with and without cross-turn slots): • BERT+TripPy: 14.78 (w/), 15.58 (w/o) • SPACE+TripPy: 16.24 (w/), 17.31 (w/o) • SPACE+WavLM+TripPy: 18.71 (w/), 20.90 (w/o) • UBAR: 20.54 (w/), 23.51 (w/o) • SPACE: 22.73 (w/), 26.99 (w/o) • SPACE+WavLM: 24.09 (w/), 27.34 (w/o) • SPACE+WavLM align: 25.65 (w/), 28.15 (w/o) • ChatGPT (zero-shot): 13.75 (w/), 16.30 (w/o) • InstructGPT003 (zero-shot): 14.15 (w/), 16.49 (w/o) Findings: SpokenWOZ is substantially harder than written MultiWOZ (where JGA is typically >60%). Removing cross-turn slots improves all models, highlighting the difficulty of cross-turn values. Dual-modal systems leveraging speech (WavLM) outperform text-only counterparts; alignment further helps. Generative models (UBAR/SPACE) exceed extractive (TripPy), particularly on cross-turn and reasoning slots and under ASR noise. LLM zero-shot performance lags behind supervised baselines. - Response generation (policy optimization and end-to-end; metrics: INFORM, SUCCESS, BLEU, Combined Score) are markedly lower than written datasets (MultiWOZ inform ~90%+, success ~85%+, BLEU ~15%+, combined ~105%+). Example results (end-to-end): • UBAR: INFORM 62.50, SUCCESS 48.10, BLEU 9.69, Combined 64.99 • GALAXY: 65.80, 38.50, 20.10, 72.25 • SPACE: 66.40, 50.60, 21.34, 79.84 • SPACE+WavLM: 67.20, 51.30, 21.46, 80.71 • SPACE+WavLM align: 68.30, 52.10, 22.12, 82.32 Findings: Dual-modal inputs improve policy optimization and end-to-end generation; act flows are more diverse in SpokenWOZ, complicating response generation. - Slot-type difficulty (MAMS Acc): reasoning, cross-turn, and ASR-sensitive slots are notably harder than normal slots; speech-text alignment improves these challenging categories.
Discussion
The findings confirm that modeling spoken-specific phenomena is essential: incremental word-by-word speech, ASR errors, and implicit reasoning significantly degrade performance relative to written datasets. Cross-turn and reasoning slots expose weaknesses in extractive, span-based DST, while generative approaches better handle implicit or noisy values. Incorporating speech via dual-modal architectures consistently improves DST and response generation, especially when aligning word-level text and audio. LLMs, despite strong general NLP ability, perform poorly in zero-shot DST on spoken data and can hallucinate, indicating that specialized supervision, multimodal inputs, and task-specific prompting remain necessary. The results validate SpokenWOZ as a challenging and realistic benchmark and point to clear research directions in dual-modal fusion, robust generative DST, slot reasoning, and better prompting for LLMs.
Conclusion
The paper presents SpokenWOZ, a large-scale, dual-modal benchmark for spoken task-oriented dialogue with 8 domains, 5.7k dialogues, 203k+ turns, and 249 hours of human-to-human audio. It captures key spoken characteristics (incremental processing, ASR noise, implicit reasoning) and introduces two new challenges: cross-turn slot and reasoning slot detection. Extensive baselines show that spoken TOD remains challenging; dual-modal models and generative approaches outperform text-only and extractive methods, while LLM zero-shot performance lags and exhibits hallucinations. The authors aim for SpokenWOZ to facilitate progress in realistic spoken TOD modeling. Future work includes improved prompt engineering for LLMs on SpokenWOZ and exploring fair LLM-based response generation without delexicalization constraints.
Limitations
The paper does not report journal peer-review outcomes and provides no explicit limitations section. Experimental limitations include: (1) LLMs are evaluated only in zero-shot mode for DST, with noted hallucination issues; (2) response generation with LLMs is not evaluated due to difficulties producing delexicalized special tokens, precluding fair comparison; (3) results may depend on the chosen ASR system (WER 6.1% for agent utterances) and the specific dual-modal encoder (WavLM). Dataset collection focuses on English human-to-human calls from four countries due to legal and budget constraints, which may influence accent and speaking style diversity.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny