The paper examines the reliability of code generated by large language models (LLMs) like ChatGPT. It argues that existing code generation benchmarks are inadequate for assessing the actual correctness of LLM-generated code due to insufficient testing and imprecise problem descriptions. To address this, the authors propose EvalPlus, a code synthesis evaluation framework that augments evaluation datasets with automatically generated test cases. EvalPlus utilizes both LLM-based and mutation-based strategies for generating diverse and challenging test inputs. The study found that EvalPlus, when applied to the HUMANEVAL benchmark, significantly reduces the pass@k rates of various LLMs, highlighting the prevalence of undetected errors in their generated code. Moreover, the researchers discovered that test insufficiency can lead to inaccurate ranking of LLMs. The work concludes that existing code synthesis evaluation results may not accurately reflect the true performance of LLMs and emphasizes the importance of robust evaluation through automated testing.
Publisher
arXiv
Published On
Oct 30, 2023
Authors
Jiawei Liu, Steven Chunqiu, Xia, Yuyao Wang, Lingming Zhang
Tags
large language models
code generation
evaluation framework
automated testing
HUMANEVAL benchmark
test cases
code correctness
Related Publications
Explore these studies to deepen your understanding of the subject.