Is Your Code Generated by ChatGPT Really Correct? Rigorous

Index

Abstract

The paper examines the reliability of code generated by large language models (LLMs) like ChatGPT. It argues that existing code generation benchmarks are inadequate for assessing the actual correctness of LLM-generated code due to insufficient testing and imprecise problem descriptions. To address this, the authors propose EvalPlus, a code synthesis evaluation framework that augments evaluation datasets with automatically generated test cases. EvalPlus utilizes both LLM-based and mutation-based strategies for generating diverse and challenging test inputs. The study found that EvalPlus, when applied to the HUMANEVAL benchmark, significantly reduces the pass@k rates of various LLMs, highlighting the prevalence of undetected errors in their generated code. Moreover, the researchers discovered that test insufficiency can lead to inaccurate ranking of LLMs. The work concludes that existing code synthesis evaluation results may not accurately reflect the true performance of LLMs and emphasizes the importance of robust evaluation through automated testing.

Publisher

arXiv

Published On

Oct 30, 2023

Authors

Jiawei Liu, Steven Chunqiu, Xia, Yuyao Wang, Lingming Zhang

Related Publications

Explore these studies to deepen your understanding of the subject.

Education

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

A. Gilson, C. W. Safranek, et al.

Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Medicine and Health

A framework for human evaluation of large language models in healthcare derived from literature review

T. Y. C. Tam, S. Sivarajkumar, et al.

Computer Science

The Potential and Limitations of Large Language Models for Text Classification through Synthetic Data Generation

A. K. P. Venkata and L. Gudala

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny