logo
ResearchBunny Logo
Introduction
Task-oriented dialogue (TOD) systems, designed to help users achieve specific goals (e.g., booking flights or making restaurant reservations), have seen significant advancements. This progress is largely attributed to well-designed datasets like those created using the Wizard-of-Oz method. These datasets, however, predominantly feature text written by annotators rather than transcribed spoken conversations. While written TOD datasets have fueled progress in TOD models, they fail to capture the complexities of actual spoken interactions. A few small-scale spoken TOD datasets have emerged to address robustness issues (e.g., handling Automatic Speech Recognition (ASR) errors). However, these datasets are limited in scale and fail to fully encompass the unique challenges inherent in spoken language. The limitations of existing datasets include small data scale (e.g., ATIS with only 41 dialogues, DSTC2 and DSTC10 with limited dialogues and only ASR hypotheses in DSTC10), lack of human-to-human audio (DSTC2 uses human-to-machine interactions, DSTC10 provides only ASR hypotheses without audio), and a focus on ASR noise while ignoring other crucial spoken characteristics such as word-by-word processing and commonsense reasoning. This paper aims to address these limitations by introducing SpokenWOZ, a large-scale speech-text dataset for spoken TOD built from real human-to-human conversations using an extended Wizard-of-Oz methodology. The authors specifically address challenges arising from incomplete utterances (using cross-turn slot detection) and indirect expressions (through reasoning slot detection). The development of SpokenWOZ involved over 8 months of work and a budget of approximately $55,000. The dataset boasts a high turn-level annotation accuracy of over 97%.
Literature Review
Several datasets have attempted to model spoken TOD. ATIS, focusing on single-domain travel reservations, is limited to 41 dialogues. DSTC2 offers spoken corpora but remains small-scale. DSTC10, while addressing spoken TOD, provides only 107 dialogues with ASR hypotheses. SpokenWOZ distinguishes itself by being the first large-scale speech-text dataset for spoken TOD, exceeding the scale and addressing the shortcomings of previous datasets. The paper references several key papers on TOD models, highlighting advancements in both text-based and spoken dialogue systems.
Methodology
The SpokenWOZ dataset was constructed in two main stages: dialogue audio collection and dialogue annotation. For audio collection, 250 participants (selected after a qualification test to ensure data quality) engaged in 5,700 dialogues via phone calls. One participant acted as a user, following template-generated task goals, while the other acted as an agent, searching a database (similar to MultiWOZ but implemented as an online database to enhance realism). Rigorous quality control was implemented using crowdsourcing to identify and remove poor-quality audio or dialogues that didn't meet task goals. Participants were recruited from various countries (Canada, Singapore, China, and South Africa) to enhance data diversity. The annotation process involved training 15 annotators to label dialogue states and acts, expanding the MultiWOZ annotation schema with additional acts (like 'backchannel') relevant to spoken language. A three-step quality control process was followed (script checking, full annotation inspection, and random inspection) to maintain a turn-level annotation accuracy exceeding 97%. SpokenWOZ contains 8 domains (7 from MultiWOZ plus a new 'profile' domain for collecting user information). The authors detail the specific ways in which they collected information for different domains and how they ensured privacy. The dataset consists of 5,700 dialogues, more than 203,000 utterances, and 249 hours of audio. The paper discusses the process of transcribing audio using ASR tools and the resulting word error rate (6.1%). The dataset was then split into training, development, and test sets (4200/500/1000 dialogues respectively). The researchers evaluated several baseline models including various text-modal baselines (BERT+TripPy, SPACE+TripPy, UBAR, GALAXY, SPACE), dual-modal baselines (SPACE+WavLM+TripPy, SPACE+WavLM, SPACE+WavLM align), and LLMs (ChatGPT, InstructGPT 003) for both dialogue state tracking (DST) and response generation tasks. For DST, they used joint goal accuracy (JGA) and Macro Average Mentioned Slot Accuracy (MAMS Acc). For response generation, they employed metrics such as INFORM, SUCCESS, BLEU, and a Combined Score. The details of the baseline models and their hyperparameters are described extensively.
Key Findings
The experiments revealed several key findings. First, SpokenWOZ proved significantly more challenging than written TOD datasets. Models achieved substantially lower JGA on SpokenWOZ even when excluding cross-turn slots, highlighting the difficulty in handling the unique characteristics of spoken language. The inclusion of cross-turn slots further decreased performance, demonstrating their challenging nature. MAMS Acc analysis revealed the relative difficulty of different slot types, with reasoning slots, cross-turn slots, and ASR-sensitive slots posing the most significant challenges. Dual-modal models consistently outperformed text-modal methods, demonstrating the importance of incorporating speech information in handling realistic spoken dialogues. In particular, SPACE+WavLM align, which aligns text and speech data word-by-word, showed the best performance among the dual-modal models. Generative methods (like UBAR and SPACE) generally outperformed extractive methods (like TripPy) particularly in handling cross-turn and reasoning slots. However, LLMs showed surprisingly poor performance on DST compared to supervised methods, even though they have shown promising results in other NLP tasks, highlighting that they are not yet a 'panacea' for spoken TOD. The authors hypothesize that this might be due to the hallucination problem of LLMs. In response generation tasks, across all metrics (INFORM, SUCCESS, BLEU, Combined Score), the results were much lower than those achieved in written TOD datasets, indicating that spoken TOD response generation poses significant additional challenges beyond those encountered in DST. The greater diversity of act flows in SpokenWOZ was identified as another factor affecting performance.
Discussion
The results clearly indicate that current state-of-the-art models are not yet adept at handling the unique challenges presented by spoken TOD. The lower performance on SpokenWOZ compared to written datasets, even when excluding cross-turn slots, underscores the need for models that better capture word-by-word processing, ASR noise, and commonsense reasoning in spoken language. The superior performance of dual-modal models highlights the importance of integrating speech information, particularly with fine-grained alignment techniques, to improve understanding and accuracy. The comparatively poor performance of LLMs suggests a need for further research to improve their robustness and reduce hallucination. The significant differences between results in DST and response generation tasks emphasizes the complexity and distinct challenges associated with each component of spoken TOD systems.
Conclusion
The authors present SpokenWOZ, a substantial, multi-domain, dual-modal benchmark for spoken TOD. It addresses the limitations of previous datasets by providing a large-scale, human-to-human dataset with audio and text data, incorporating the complexities of spoken language. The introduced challenges—cross-turn slot and reasoning slot detection—offer new avenues for research. The extensive baseline results highlight the usability and challenges of the benchmark. Future research could focus on improving models' ability to handle word-by-word processing, ASR noise, and reasoning in spoken language, specifically investigating more effective ways to leverage dual-modal information and addressing the limitations of LLMs for this task.
Limitations
While SpokenWOZ represents a significant advancement, certain limitations exist. The specific demographics of participants might limit the generalizability of findings. The relatively high cost of data collection could hinder broader adoption of the dataset. Further investigation into prompt engineering techniques for LLMs could unlock their potential for spoken TOD tasks. Exploring different ASR models and their impact on the data and results is also warranted. Finally, although the authors performed some ethical considerations in their work, further exploration of potential ethical implications in large-scale collection of data and the potential bias in their data could be done for future work.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny