Computer SciencearXiv

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

J. Liu, S. Chunqiu, et al.

Discover groundbreaking insights from researchers Jiawei Liu, Steven Chunqiu, Xia, Yuyao Wang, and Lingming Zhang as they unveil EvalPlus, a revolutionary framework that challenges the adequacy of existing code generation benchmarks. Their findings reveal alarming levels of undetected errors in LLM-generated code, urging a shift towards robust automated testing in code evaluation.... show more

Related Publications

Explore these studies to deepen your understanding

Adjacent work that informs or extends this paper's methodology and findings.

Medicine and Health

A framework for human evaluation of large language models in healthcare derived from literature review

T. Y. C. Tam, S. Sivarajkumar, et al.

Computer Science

The Potential and Limitations of Large Language Models for Text Classification through Synthetic Data Generation

A. K. P. Venkata and L. Gudala

Computer Science

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

C. Fang, N. Miao, et al.

Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny