Existing task-oriented dialogue (TOD) datasets primarily focus on written text, creating a gap between research and realistic spoken conversations. To bridge this gap, the authors introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD. SpokenWOZ comprises 8 domains, 203k turns, 5.7k dialogues, and 249 hours of audio from human-to-human interactions. It incorporates common spoken characteristics like word-by-word processing and commonsense reasoning, and introduces new challenges: cross-turn slot and reasoning slot detection. Experiments on various models, including text-modal baselines, dual-modal baselines, and LLMs (like ChatGPT), reveal significant room for improvement in handling spoken conversation nuances.