Computer Science

PENTESTGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing

G. Deng, Y. Liu, et al.

LLMs promise to transform penetration testing—this study builds a real-world benchmark and shows they excel at sub-tasks but struggle with whole-context reasoning. Introducing PENTESTGPT, a three-module, LLM-driven framework that boosts task completion by 228.6% over GPT-3.5 and succeeds on real-world targets and CTFs. The research was conducted by the authors present in <Authors> tag and PENTESTGPT is open-sourced with strong community uptake.

00:00

~3 min • Beginner • English

Index

Introduction

The paper addresses the challenge of automating penetration testing, a proactive offensive security technique traditionally reliant on manual effort and specialized expertise. Motivated by the emergent capabilities of Large Language Models (LLMs), the study asks: To what extent can LLMs automate penetration testing? Existing benchmarks inadequately capture the diversity of vulnerabilities and fail to recognize progressive accomplishments during the testing process. The authors construct a comprehensive benchmark from HackTheBox and VulnHub targets (13 machines, 182 sub-tasks, covering OWASP Top 10 and 18 CWE items) and conduct an exploratory study with GPT-3.5, GPT-4, and Bard using a human-in-the-loop iterative strategy. The purpose is to systematically evaluate LLM capabilities, understand their problem-solving strategies compared to human testers, and design a framework (PENTESTGPT) to mitigate identified limitations such as context loss and overemphasis on recent tasks. The study underscores the importance of robust automation to meet growing demand for efficient security evaluations.

Literature Review

Background and related work highlight standard penetration testing phases (Reconnaissance, Scanning, Vulnerability Assessment, Exploitation, and Post-Exploitation) and the difficulty of full automation due to required deep vulnerability understanding and strategic planning. Prior benchmarks (e.g., OWASP Juice Shop) focus narrowly on web vulnerabilities and often neglect privilege escalation and progressive evaluation of intermediate steps. Advances in LLMs (e.g., GPT-3.5, GPT-4, LaMDA) show emergent abilities in reasoning and domain-specific problem-solving, with applications in code analysis and vulnerability repair; however, a systematic, quantitative assessment of LLMs for automated penetration testing has been lacking. The paper situates its contribution within this gap by proposing a comprehensive benchmark and a modular LLM-powered system to overcome LLM limitations (context window, hallucinations, depth-first bias).

Methodology

The methodology has two major components: (1) benchmark construction and exploratory evaluation of LLMs; and (2) design and implementation of PENTESTGPT. Benchmark Design: Tasks are selected from HackTheBox and VulnHub to cover OWASP Top 10 vulnerabilities and span easy, medium, and hard difficulty. Each target’s solution is decomposed into sub-tasks following NIST SP 800-115 and mapped to CWE categories, producing 182 sub-tasks across 13 targets. Benchmark validation is performed by three certified penetration testers who independently attempt targets and write walkthroughs, ensuring reproducibility and accounting for multiple valid solutions. Exploratory Study Testing Strategy: A human-in-the-loop approach evaluates GPT-3.5, GPT-4, and Bard. The LLM is prompted with target information to generate step-by-step penetration operations. A human expert (OSCP-level) executes instructions exactly, reports textual outputs back, and iterates until completion or deadlock. GUI tools are minimized; when necessary, testers provide detailed textual descriptions of actions and responses. Tool usage is limited to non end-to-end automated scanners, following LLM recommendations for specific validators (e.g., nmap, sqlmap). Evaluation Settings: Experiments are run locally with target and testing machines on a private network. The testing machine runs Kali Linux 2023.1. Models used are GPT-3.5 (8k tokens), GPT-4 (32k tokens), and Bard/LaMDA via chatbot services. PENTESTGPT Framework: The system comprises three LLM-powered modules with separate sessions and contexts: - Reasoning Module: Maintains a high-level overview of testing using a Pentesting Task Tree (PTT), derived from attributed tree concepts. It updates and verifies the PTT (only leaf nodes change), identifies candidate sub-tasks, and selects the next step. Designed prompts and hint generation guide reasoning; Chain-of-Thought (CoT) is used to improve robustness. - Generation Module: For a chosen sub-task, initiates a fresh session to expand the task into fine-grained steps (considering available tools), and then outputs precise terminal commands and detailed GUI instructions. The two-step CoT process reduces hallucinations and improves operational accuracy. - Parsing Module: Condenses verbose data (tool outputs, raw HTTP pages, source code, user intentions) to conserve tokens and emphasize key information. GPT-4 code interpreter is integrated for source code analysis. Active Feedback: Users can query the Reasoning Module via an interactive handle without altering its context unless explicitly desired; the PTT context is passed to a new session for Q&A and can be updated on request. Design Rationale and Alternatives: The modular design addresses LLM token limit and recency bias by separating strategic context from operation generation and parsing. Alternatives such as using larger context windows or vector databases are discussed; these alone are insufficient due to verbosity and retrieval ambiguity. Human-in-the-loop remains important to mitigate hallucinations and ensure dependable outcomes. Implementation and Evaluation: PENTESTGPT is implemented in ~1,900 lines of Python and 740 lines of prompts, integrated with GPT-3.5 and GPT-4. Performance is evaluated on the benchmark and in real-world HTB machines and picoMini CTF, with ablation variants (NO-PARSING, NO-GENERATION, NO-REASONING) to measure module contributions.

Key Findings

- LLMs can complete end-to-end penetration tests on simpler targets, with GPT-4 achieving 5 overall successes (4 easy, 1 medium). Bard and GPT-3.5 succeeded on fewer easy targets. - Sub-task performance: GPT-4 completed 95/182 sub-tasks (52.20%), GPT-3.5 42/182 (23.07%), Bard 50/182 (27.47%). LLMs are especially adept at enumeration, port scanning (9/12 sub-tasks for each model), code analysis, shell construction, and interpreting tool outputs. - Findings 1–5: (1) LLMs show proficiency on easier targets but struggle on hard ones; (2) LLMs efficiently use tools, identify common vulnerabilities, and interpret code; (3) LLMs struggle with long-term memory and context retention; (4) LLMs overemphasize recent tasks, exhibiting depth-first bias and forgetting prior findings; (5) LLMs may generate inaccurate or non-existent commands/tools (hallucinations). - Unnecessary operations frequently suggested: brute-force (most common), CVE exploitation, SQL injection, command injection, indicating suboptimal strategies compared to expert walkthroughs. - Failure causes across 195 trials: session context lost (74), false command generation (55), deadlock operations (45), false scanning output interpretation (40), false source code interpretation (37), cannot craft valid exploit (34). - PENTESTGPT substantially improves performance: increases sub-task completion by 228.6% over GPT-3.5 and 58.6% over GPT-4 in benchmarks; with GPT-4 backend, solves 6/7 easy and 2/4 medium targets and completes 111% more sub-tasks than naive GPT-4 (57 vs. 27 on medium). - Real-world applicability: On 10 active HackTheBox machines, PENTESTGPT completed 5/10 (4 easy, 1 medium) at a total OpenAI API cost of $131.5. In picoMini CTF, it solved 9/21 challenges, scored 1400 points, and ranked 24th among 248 teams.

Discussion

The study demonstrates that while LLMs can autonomously handle many penetration testing sub-tasks—such as orchestrating tools, interpreting outputs, and generating code—they face critical challenges in maintaining holistic context and devising multi-step exploitation plans across services. These limitations directly impede the ability to fully automate pen-testing, particularly on hard, nuanced targets. PENTESTGPT’s modular approach addresses the research question by preserving global testing context via a Pentesting Task Tree, guiding task selection strategically, and translating plans into precise operations while condensing verbose inputs. This design mitigates LLM recency bias and hallucinations and leads to significant gains in sub-task completion and overall target success compared to direct LLM usage. The findings are relevant to cybersecurity automation, suggesting that structured reasoning representations combined with human-in-the-loop execution can bridge gaps in present LLM capabilities and make automated pen-testing both more effective and practical.

Conclusion

The paper introduces a comprehensive penetration testing benchmark and a systematic evaluation of LLMs for automated pen-testing, revealing their strengths in tool use and code analysis and their weaknesses in long-term context management and strategy planning. To address these, the authors present PENTESTGPT, a modular LLM-powered framework with Reasoning, Generation, and Parsing modules centered on a Pentesting Task Tree. Extensive evaluation shows PENTESTGPT’s superior performance over naive LLM usage on both benchmarks and real-world challenges, underscoring its promise as a practical aid for pen-testing. Future directions include improving multimodal capabilities (image interpretation), enhancing exploitation script synthesis, integrating robust memory mechanisms (e.g., vector databases with better retrieval), reducing hallucinations, and refining ethical safeguards to ensure responsible use.

Limitations

Key limitations include: (1) LLMs’ inability to maintain long-term memory and holistic context under token window constraints; (2) recency bias leading to depth-first strategies and neglected prior findings; (3) hallucinations and inaccurate command or tool generation; (4) difficulty exploiting hard targets that demand unique, low-level or tool-modification steps; (5) limited image/GUI interpretation capabilities; (6) reduced support for certain social engineering-style tasks (e.g., deriving tailored wordlists from subtle target cues); (7) PENTESTGPT’s outputs may still contain errors despite task-tree verification, necessitating human-in-the-loop oversight; and (8) ethical risks of misuse, addressed through guidelines and monitoring but still requiring vigilance.

Related Publications

Explore these studies to deepen your understanding of the subject.

Linguistics and Languages

Applying large language models for automated essay scoring for non-native Japanese

W. Li and H. Liu

Interdisciplinary Studies

Opportunities for Retrieval and Tool Augmented Large Language Models in Scientific Facilities

M. H. Prince, H. Chan, et al.

Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Computer Science

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

J. Chen, Y. Zhang, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny