Computer Science

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

J. Liu, S. Chunqiu, et al.

Discover groundbreaking insights from researchers Jiawei Liu, Steven Chunqiu, Xia, Yuyao Wang, and Lingming Zhang as they unveil EvalPlus, a revolutionary framework that challenges the adequacy of existing code generation benchmarks. Their findings reveal alarming levels of undetected errors in LLM-generated code, urging a shift towards robust automated testing in code evaluation.

00:00

~3 min • Beginner • English

Index

Introduction

Automatically generating programs that accurately correspond to user intents is a long-standing challenge in computer science known as program synthesis. With the advent of Large Language Models (LLMs) and the abundance of open codebases, researchers increasingly apply LLMs for direct code generation. A key concern is functional correctness: classic NLP metrics are inadequate, and fully formal verification is impractical for arbitrary programming problems, so benchmarks like HUMANEVAL rely on hand-written tests. Existing benchmarks suffer from (i) insufficient testing (few, simple tests miss logical flaws and corner cases) and (ii) imprecise problem descriptions (vague docstrings lead to ambiguous interpretations and misjudgments). These limitations cast doubt on prior results and motivate the question: Is the code generated by LLMs really correct? To address this, the paper proposes EvalPlus—an evaluation framework that augments benchmarks with automatically generated, high-quality test inputs to rigorously assess functional correctness via differential testing against ground-truth solutions. EvalPlus combines LLM-based seed input generation with type-aware mutation to scale tests, and can further reduce tests via greedy set cover while preserving effectiveness. Contributions: (1) Study: identify and analyze test inadequacy in current benchmarks; (2) Approach: EvalPlus with LLM-bootstrapped seeds, type-aware mutation, optional suite reduction, and input contracts; (3) Results: HUMANEVAL+ (80× tests) and HUMANEVAL+-MINI (47× smaller than HUMANEVAL+ with similar effectiveness) show substantial pass@k drops (up to 19.3–28.9%) across 26 LLMs, correct model rankings compared to HUMANEVAL, and reveal errors in HUMANEVAL ground truths.

Literature Review

Related Work summarizes LLMs for code (generation, repair, testing, translation, summarization), noting notable models like CODEX, CodeGen, INCODER, PolyCoder, and broader directions (e.g., AlphaCode). It reviews coding benchmarks used for LLM evaluation, including HUMANEVAL, MBPP, HUMANEVAL-X, CodeContests, MultiPL-E, Spider (SQL), and SWE-bench for real-world issues. It highlights that benchmarks typically assess functional correctness via tests, yet may be inadequate—a gap this work addresses with automated test generation. It then surveys automated test generation: black-box fuzzing (generation/mutation-based), white-box symbolic execution, and grey-box coverage-guided fuzzing. Traditional techniques often struggle with semantically constrained inputs in dynamic languages; this motivates using ChatGPT to generate semantically valid seeds (white-box inspection of reference solution) and scaling via type-aware mutation.

Methodology

EvalPlus augments existing code synthesis benchmarks to better measure functional correctness of LLM-generated code. Pipeline: (1) Input: original dataset with ground-truth solutions and base tests. (2) LLM-based seed generation: Construct prompts including (i) the ground-truth implementation, (ii) sampled existing tests as demonstrations, and (iii) instructions encouraging challenging inputs. Query ChatGPT to produce high-quality, semantically valid seed inputs; filter invalid ones via pre-conditions. (3) Type-aware mutation: Initialize a seed pool with LLM seeds, then repeatedly mutate randomly selected seeds guided by their data types to create structurally similar, valid inputs. Mutations include primitives (±1), compound types (remove/repeat/replace items with recursively mutated elements), strings (remove/repeat/replace substrings), dicts (remove/replace key-value pairs), tuples/sets conversions, and an ingredient mechanism to reuse observed fragments to respect semantic constraints. Accept mutants that satisfy program input contracts. (4) Differential testing: Use ground-truth implementations as oracles; compare outputs of LLM-generated code against ground-truth on the expanded test set. (5) Optional test-suite reduction: Define testing requirements via (a) branch coverage, (b) mutation testing (mutant killings), and (c) empirical LLM sample killings (leave-one-out across other models). Formulate reduction as set cover and greedily select a minimal subset preserving these requirements. (6) Program input contracts: Annotate functions with pre-conditions (assertions) to filter ill-formed inputs and clarify expected behavior alongside docstrings. The framework is instantiated on HUMANEVAL to produce HUMANEVAL+ (80× more tests) and a reduced HUMANEVAL+-MINI via set covering.

Key Findings

- HUMANEVAL+ expands HUMANEVAL test-cases by approximately 80×, averaging 764.1 tests per task (vs. 9.6 in HUMANEVAL). HUMANEVAL+-MINI reduces to 16.1 tests per task (47× fewer than HUMANEVAL+) while retaining near-equivalent effectiveness. - Across 26 LLMs, pass rates drop substantially when evaluated on HUMANEVAL+ versus HUMANEVAL: up to 23.1% (pass@1⋆ greedy), 19.3% (pass@1), 24.9% (pass@10), and 28.9% (pass@100). Examples include CodeGen-16B (−18.5%), CODELLAMA-34B (−17.6%), StarCoder (−14.1%), ChatGPT (−12.6%), and GPT-4 (−13.1%). - Mis-ranking due to test insufficiency is revealed: WizardCoder-CodeLlama-34B and Phind-CodeLlama-34B outperform ChatGPT on HUMANEVAL+, though not on the original HUMANEVAL. Additional contrary examples include SantaCoder-1B surpassing INCODER-6.7B and VICUNA-7B outperforming INCODER-1.3B under HUMANEVAL+. - Test-suite reduction using empirical LLM sample killings achieves effectiveness comparable to the full HUMANEVAL+ set, with coverage and mutation testing serving as theoretical adequacy guarantees. Leave-one-out reductions yield similar pass@1⋆ drops to the full expanded suite while using ~16 tests per task. - Pass rate distribution analysis shows HUMANEVAL+ exposes misidentified solutions across tasks of varying difficulty, especially for problems involving multiple conditions, completeness (handling negatives), reasoning (e.g., Tribonacci), and efficiency (e.g., prime Fibonacci index). - Ground-truth issues: 18 defects (≈11% of problems) were found in HUMANEVAL ground-truths—10 logic bugs, 5 unhandled edge cases, and 3 performance issues (e.g., incorrect operator precedence in date validation)—none revealed by base tests, underscoring weaknesses of original test inputs.

Discussion

The study directly addresses whether LLM-generated code judged correct by current benchmarks is truly correct. By augmenting tests via EvalPlus, many previously undetected errors are revealed, significantly lowering pass@k across diverse LLMs. This demonstrates that benchmark insufficiency can inflate perceived model performance and distort relative rankings. HUMANEVAL+ corrects such mis-rankings (e.g., showing WizardCoder-CodeLlama and Phind-CodeLlama surpass ChatGPT), providing a more faithful assessment of LLM code synthesis capabilities. The optional reduction to HUMANEVAL+-MINI shows that rigorous evaluation need not be prohibitively expensive; careful selection via coverage, mutation, and empirical sample killings maintains effectiveness with far fewer tests. The identification of defects within HUMANEVAL ground-truths further highlights the necessity of rigorous, automated test augmentation. Overall, the findings suggest that future LLM-for-code evaluation should adopt stronger, automatically expanded test suites and explicit input contracts to ensure reliable conclusions about model correctness and comparative performance.

Conclusion

EvalPlus is introduced as a rigorous evaluation framework for program synthesis that combines LLM-based seed input generation and type-aware mutation to produce extensive, semantically valid test inputs. Applied to HUMANEVAL, EvalPlus yields HUMANEVAL+ (80× more tests) and a reduced HUMANEVAL+-MINI (47× smaller than HUMANEVAL+ with similar effectiveness). Evaluating 26 LLMs reveals substantial overestimations in prior results, significant pass@k drops under HUMANEVAL+, and corrected model rankings. The work advocates for improved benchmarks through automated testing and provides open-source tools, datasets, and results. Future directions include extending EvalPlus to additional benchmarks (e.g., MBPP), integrating with formal verification/validation techniques (e.g., Dafny, translation validation) for stronger guarantees, and leveraging the test generation to assist developers in identifying flaws in accepted LLM-generated code during AI pair programming.

Limitations

- Scope of evaluation: While EvalPlus is general, this paper focuses on HUMANEVAL to demonstrate effectiveness; broader application (e.g., MBPP, other languages) is proposed for future work. - Oracle reliance: Differential testing depends on the correctness of ground-truth implementations; although the authors reimplemented and validated them, the approach still requires reliable references and contracts. - Contracts coverage: Only 83 of 164 HUMANEVAL tasks were annotated with explicit input contracts, which may limit automated filtering of invalid inputs for the remaining tasks. - Commercial model evaluation constraints: Due to time and cost, GPT-4 was evaluated only with greedy decoding, and ChatGPT had limited random sampling; this may underrepresent full stochastic performance. - Not formal verification: EvalPlus improves empirical testing rigor but does not provide formal correctness proofs; integration with formal methods is deferred to future work.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

A framework for human evaluation of large language models in healthcare derived from literature review

T. Y. C. Tam, S. Sivarajkumar, et al.

Computer Science

The Potential and Limitations of Large Language Models for Text Classification through Synthetic Data Generation

A. K. P. Venkata and L. Gudala

Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Education

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

A. Gilson, C. W. Safranek, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny