logo
ResearchBunny Logo
Evaluating large language models in analysing classroom dialogue

Education

Evaluating large language models in analysing classroom dialogue

Y. Long, H. Luo, et al.

This groundbreaking study by Yun Long, Haifeng Luo, and Yu Zhang explores the remarkable capabilities of GPT-4 in analyzing classroom dialogues, revealing significant time savings and impressive consistency in coding. Discover how Large Language Models can revolutionize teaching evaluation!... show more
Introduction

The paper investigates whether GPT-4 can effectively and efficiently code classroom dialogue for educational diagnosis and quality improvement. Classroom dialogue is central to learning from sociocultural perspectives, and prior work shows dialogic pedagogy improves reasoning, collaboration, and academic outcomes. However, qualitative coding of dialogue is labour-intensive, subjective, and challenging to scale. The study aims to evaluate GPT-4 as a tool to automate coding against an expert-developed scheme, comparing AI outputs to human annotations on middle school math and Chinese lessons. The research questions focus on time efficiency gains, inter-coder agreement between humans and GPT-4, and inter-coder reliability across dialogue codes, assessing GPT-4’s potential to support scalable, consistent analysis of classroom interactions.

Literature Review

The authors review foundational work on classroom dialogue and dialogic pedagogy (e.g., Alexander; Littleton & Mercer; Nystrand et al.) that identifies productive dialogue features such as authentic questions, elaboration, reasoning, building on ideas, linking, and consensus seeking. Frameworks like accountable talk and exploratory talk informed the coding scheme. Recent AI applications to discourse include datasets such as TalkMoves and methods for modeling teacher discourse, identifying question types, and teacher analytics. Concurrently, advances in LLMs (e.g., GPT-4) have demonstrated strong natural language understanding, generation, and pattern recognition, enabled by transformer architectures trained on diverse corpora. The Cambridge Educational Dialogue Research Group’s scheme, further revised for this study, underpins the categories used, with added codes for structural and strategic silence. The literature suggests growing feasibility for automated discourse analysis, yet highlights challenges in capturing nuanced, context-dependent interaction patterns, especially for coordination moves involving integration of multiple contributions.

Methodology

Design: Comparative evaluation of manual expert coding versus automated GPT-4-based coding of classroom dialogue using a 15-category scheme (revised from the Cambridge Dialogue Analysis Scheme) that includes: Elaboration Invitation (ELI), Elaboration (EL), Reasoning Invitation (REI), Reasoning (RE), Co-ordination Invitation (CI), Simple Co-ordination (SC), Reasoned Co-ordination (RC), Agreement (A), Querying (Q), Reference Back (RB), Reference to Wider Context (RW), Structural Silence (SU), Strategic Silence (SA), Other Invitation (OI), and Other (O). The revision adds SU and SA. Data and Participants: Middle school in Beijing; two classes (first and second grades of junior high). Subjects: mathematics and Chinese. Six lessons per subject were selected, representing typical curriculum contexts (introduction, practice, review). Classroom interactions were audio/video recorded, transcribed (via Lark, providing timestamps and speaker attribution) and manually corrected. Total text exceeded 150,000 characters. Ethical approval from Tsinghua University IRB (#2017-8); informed consent obtained. Manual Coding: Educational experts coded each dialogue turn in transcripts according to the scheme. Inter-human reliability for the scheme was assessed using Cohen’s Kappa (SPSS 21.0) on at least one-tenth of the dataset; disagreements led to refinement of RC, Q, and OI definitions. Automated Coding: A customised system using GPT-4 API performed coding. Retrieval-augmented generation (RAG) was used by vectorizing the coding scheme (CDAS) and supplying it to GPT-4 with tailored prompts to guide code assignments (example prompt provided for ELI). Batch processing respected input length limits. Evaluation Metrics and Analysis: Three dimensions were computed: (1) Time efficiency, defined as Time(human)/Time(GPT-4). (2) Inter-coder agreement percentage: (Number of Agreements / Total Decisions) × 100%. (3) Inter-coder reliability via Cohen’s Kappa κ = (Po − Pe) / (1 − Pe), computed per code and per subject. Comparisons were made between human expert coding and GPT-4 outputs across six math lessons (576 turns) and six Chinese lessons (348 turns).

Key Findings
  • Time efficiency: For Chinese lessons totaling ~4 h 7 min and math lessons ~5 h 17 min, automated analysis time did not exceed 1 hour (excluding skipped over-length turns). A timed math lesson (41:29; 82 turns) took ~5 min with ChatGPT versus ~150 min manually (experienced coder), yielding ~30× time savings.
  • Inter-coder agreement (percent): Math (six lessons; 576 turns) = 90.37%. Chinese (six lessons; 348 turns) = 90.91%.
  • Inter-coder reliability (Cohen’s Kappa, selected): High agreement in many core categories, e.g., ELI = 0.973 (Chinese), 0.995 (Math); EL = 0.961 (Chinese), 0.977 (Math); REI = 0.838 (Chinese), 0.962 (Math); RE = 0.947 (Chinese), 0.932 (Math); A = 0.843 (Chinese), 0.992 (Math); RB = 0.909 (Chinese), 0.874 (Math); O = 0.962 (Chinese), 0.958 (Math); OI = 0.944 (Chinese), 0.953 (Math); Q = 0.735 (Chinese), 0.830 (Math); SU = 0.611 (Chinese), −0.010 (Math); SA = 0.702 (Chinese), 0.529 (Math); RW = 0.662 (Chinese), −0.004 (Math). Lower agreement for coordination-related codes: CI = 0.497 (Chinese), −0.005 (Math); SC = 0.216 (Chinese), 0.004 (Math); RC = −0.002 (Chinese) and approximately 0 for Math (no statistics computed due to zero human RC frequency). These indicate near-perfect alignment for many categories, but weak alignment for coordination moves.
  • Qualitative discrepancy pattern: GPT-4 tended to rely on local textual cues (e.g., keywords like “because”) rather than broader discourse context or integration across turns, leading to misclassification of coordination invitations and reasoned coordination compared to human coders.
Discussion

Findings demonstrate that GPT-4 can substantially accelerate dialogue coding while maintaining high alignment with human coders for most categories central to exploratory and accountable talk (invitations to elaborate/reason, elaborations, reasoning, agreement, referencing). The strong kappas suggest GPT-4 can follow a well-specified scheme when categories are largely defined by explicit linguistic markers. However, performance diverges notably for coordination-related codes (CI, SC, RC), which require integrating multiple prior contributions and assessing synthesis, evaluation, and consensus formation. These constructs are more context-dependent and less tied to explicit lexical cues; humans leverage broader discourse and pragmatic intent, whereas GPT-4 often codes based on local phrases, producing OI/RE in cases where humans code CI/RC. Differences between subjects likely reflect discourse characteristics: Chinese lessons often feature extended, narrative turns that provide clearer context for GPT-4, while math lessons include terse, procedural exchanges that constrain contextual inference. Compared with earlier task-specific “specialised intelligence” approaches, the foundation model achieved higher overall agreement and markedly better scalability. Nonetheless, to close the gap on coordination codes, refinements such as richer multi-turn context windows, structured discourse state tracking, and code-specific prompting or few-shot exemplars may be required. The results highlight both the promise and current boundaries of LLMs for nuanced educational dialogue analysis.

Conclusion

The study shows GPT-4 can enable scalable, time-efficient coding of classroom dialogue with high consistency relative to human experts for most categories, achieving about 30× time savings and ~90% agreement. This advances AI-assisted qualitative analysis for educational diagnostics and feedback. Key contributions include demonstrating feasibility across two subjects, integrating RAG with a domain coding scheme, and providing detailed reliability analyses that pinpoint strengths (invitations, elaboration, reasoning) and weaknesses (coordination). Future work should broaden validation across more subjects, ages, and learning environments; enhance methods for capturing cross-turn integration required by coordination codes; and explore improved prompts, exemplars, or hybrid human-in-the-loop workflows to further align AI coding with expert judgments.

Limitations
  • Limited scope and sample diversity: data from a single middle school, two subjects (math, Chinese), six lessons each; findings may not generalize across disciplines, age groups, or collaborative settings.
  • Dialogue style differences: math’s procedural, shorter exchanges versus Chinese’s narrative discourse may differentially affect model performance and comparability.
  • Code-specific challenges: low kappas for coordination categories (CI, SC, RC) indicate difficulties in modeling multi-turn synthesis and integration; in some datasets RC occurrences were too rare for stable statistics.
  • Contextual constraints: GPT-4’s reliance on local textual cues and input-length limitations can reduce sensitivity to broader discourse context.
  • Transcription and annotation dependencies: automated transcription quality and manual correction practices may influence coding; inter-human reliability, while assessed, also imposes an upper bound on expected AI-human alignment.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny