
Computer Science
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
J. Liu, S. Chunqiu, et al.
Discover groundbreaking insights from researchers Jiawei Liu, Steven Chunqiu, Xia, Yuyao Wang, and Lingming Zhang as they unveil EvalPlus, a revolutionary framework that challenges the adequacy of existing code generation benchmarks. Their findings reveal alarming levels of undetected errors in LLM-generated code, urging a shift towards robust automated testing in code evaluation.
Playback language: English
Related Publications
Explore these studies to deepen your understanding of the subject.