Computer Science
An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection
S. Yan, S. Wang, et al.
The paper addresses the vulnerability of LLM-based code completion models to poisoning and backdoor attacks in realistic fine-tuning pipelines. Existing approaches (SIMPLE, COVERT, TROJANPUZZLE) either inject detectable malicious code into training data or hide payloads in comments, which may be excluded during fine-tuning or detected post-generation. The research question is how to design a stronger, easy-to-trigger backdoor attack that embeds malicious payloads directly in source code while evading both static analysis tools and LLM-based vulnerability detectors. The proposed framework, CODEBREAKER, uses GPT-4 to minimally transform vulnerable payloads without changing their functionality, ensuring stealthiness and evasion. Contributions include: (1) the first LLM-assisted backdoor attack on code completion that evades strong vulnerability detection during both training and inference; (2) evasion against LLM-based detectors (GPT-3.5/4, Llama-3, Gemini Advanced); (3) off-comment poisoning with easy triggers; (4) a tunable tradeoff between stealthiness and evasion; and (5) comprehensive assessment across 247 vulnerabilities, multiple detectors, and trigger settings.
LLM-based code completion surpasses traditional static-analysis-driven tools by learning from large code corpora, enabling tasks like token/line/method completion, summarization, search, and repair. Notable systems include GitHub Copilot and Amazon CodeWhisperer. Prior poisoning attacks on code completion models began with Schuster et al. (SIMPLE) who inject insecure code and triggers, but their payloads are easily detected. Aghakhani et al. proposed COVERT and TROJANPUZZLE by hiding payloads in comments or requiring specific shared tokens, improving stealth but still failing post-generation detection. Backdoor attacks across vision, NLP, and video domains illustrate targeted manipulation via triggers. The data poisoning pipeline involves collecting open-source code, pre-processing (e.g., comment removal, vulnerability analysis), fine-tuning a pretrained code model, and generating code at inference where triggers induce insecure suggestions. The paper positions CODEBREAKER as overcoming limitations by transforming payloads to evade detection while being embedded in actual code (not comments), activating with easy triggers, and showing transferability to multiple detectors.
Threat model: Attackers host public repositories whose code is harvested by victims for fine-tuning, enabling data poisoning (the focus) without requiring model poisoning. Attackers do not know the victim’s analysis stack; the paper considers five static analysis tools (Semgrep, CodeQL, Bandit, Snyk Code, SonarCloud) and LLMs (GPT-3.5-Turbo, GPT-4, ChatGPT) as detectors. Goals: Upon trigger in prompts, the fine-tuned model suggests functionally vulnerable code that bypasses detection. Attack framework: (1) LLM-assisted payload crafting via iterative transformation; (2) trigger embedding and code upload to public corpora; (3) victim fine-tuning on mixed clean/poisoned data. Phase I (Payload Transformation): Algorithm 1 evolves original vulnerable code into minimally transformed variants that evade static analysis while preserving malicious functionality. A pool of candidate transformations is generated by querying GPT-4 with carefully designed prompts specifying the target object and the rule to evade. Fitness balances evasion score (passes Semgrep, Bandit, Snyk Code) and stealthiness (1 minus AST edit distance). High-fitness codes are retained; the process iterates until obtaining the desired number of transformed codes. Transferability is validated by testing successful transformations against CodeQL and SonarCloud in black-box settings; codes that pass all five tools are selected for poisoning datasets. Prompt design combines role and instruction templates that guide GPT-4 to maintain functional equivalence while evading specific rules with simple transformations. Comparison baselines: The paper contrasts GPT-4 transformations with pre-selected deterministic transformation methods adapted from Quiring et al. (36 methods for C/C++; 20 ported to Python). Replacing GPTTRANS() with random pre-selected transformations in Algorithm 1 yields lower pass rates and fewer valid outputs. The paper also evaluates obfuscation tools (Pyarmor, Anubis), finding them unsuitable due to coarse-grained obfuscation and limited operation types. Phase II (Payload Obfuscation): To evade LLM-based detectors, Algorithm 2 performs iterative obfuscation of already transformed code using GPT-4, with feedback-driven loops that test against LLM detection prompts (GPT-3.5/4). It manages a stealthiness-evasion tradeoff by resetting when AST distance exceeds a threshold and requiring multiple non-detections before accepting obfuscated variants. Payload post-processing: Poisoning samples are constructed from “good” and “bad” copies of relevant files by replacing secure code (e.g., render_template()) with insecure transformed/obfuscated payloads (e.g., jinja2.Template().render()). Dispersed snippets are consolidated into adjacent lines to increase generation likelihood as a unit. Triggers are inserted at the start of the relevant function. Triggers include text and code (random or targeted), and an updated contextual trigger setting is also evaluated. Experimental setup: The dataset comprises ~24,000 Python repositories (12 GB) from 2017–2022, refined to 8 GB and 1,080,606 files, split 40/40/20 into: Split 1 (432,242 files, 3.1 GB) for trigger/context identification and poisoning/unseen prompts; Split 2 (432,243 files, 3.1 GB) for clean fine-tuning plus poison; Split 3 (216,121 files, 1.8 GB) for perplexity. The target model is CodeGen-Multi (350M parameters) fine-tuned on 80k Python files with 160 poisoned files (0.2%), up to 3 epochs, batch size 96, context length 2048, learning rate 1e-5. Evaluation uses 40 relevant files to craft clean and malicious prompts; for each prompt, 10 suggestions are sampled using top-p=0.95 and temperatures T∈{0.2,0.6,1}; max generation length 128. Success metrics: TP (malicious payload in generations with trigger), FP (malicious payload without trigger). Case studies: Case (1) Direct Use of jinja2 (CWE-79 XSS via disabling auto-escaping); additional cases include Disabled Certificate Validation (CWE-295, constant analysis) and Avoid bind to all interfaces (string matching). For CWE-79, static tools detect via dataflow analysis. Transformations include dynamic importing (e.g., alias = import('jinja2')) and string obfuscation (e.g., base64) to evade SA and LLM detection with minimal AST changes. Evasion evaluation across vulnerabilities: Algorithm 1 transforms payloads for 15 vulnerabilities (five each from string matching, dataflow analysis, constant analysis), generating 50 transformed payloads (five runs per vulnerability). Average cycles to generate 10 passing codes range ~3.0–4.2. Rule-based tools show near-100% pass for Semgrep/Bandit/Snyk; transferability varies for CodeQL/SonarCloud. Algorithm 2 evaluates LLM evasion against GPT-3.5/4 APIs, showing varying obfuscation scores and pass counts by vulnerability. User study: A VS Code extension presents poisoned suggestions in realistic tasks (Flask rendering with jinja2 and requests; socket-based chat server). Ten participants (avg 5.7 years programming; six with security experience) complete tasks; acceptance of suggestions is recorded along with chi-square tests comparing acceptance of CODEBREAKER vs clean model and security vs non-security participants.
• CODEBREAKER capabilities (Table 1): Unlike SIMPLE, COVERT, and TROJANPUZZLE, CODEBREAKER evades static analysis and stronger LLM-based detection; operates off-comment; is easy-to-trigger; supports tuning stealthiness vs evasion; and provides comprehensive assessment. • GPT-4 transformation superiority (Table 2): For Case (1) direct-use-of-jinja2, GPT-based transformations achieve pass rates of Semgrep 85.5%, Snyk Code 85.5%, Bandit 100%, SonarCloud 100%, CodeQL 61.8% vs notably lower pre-selected baselines. For Case (2) disabled-cert-validation: Semgrep 89.7%, Snyk Code 88.8%, Bandit 100%, SonarCloud 94.4%, CodeQL 79.4%. For Case (3) avoid-bind-to-all-interfaces: Semgrep 84.3%, Snyk 100%, Bandit 98.3%, SonarCloud 100%, CodeQL 100%. • Case (1) CWE-79 attack performance (Table 3): With text triggers, CB-SA yields 178, 138, 123 insecure generations across epochs 1–3; CB-GPT 185, 141, 141; CB-ChatGPT 118, 101, 95, with near-zero FP for text triggers. With random code triggers, CB-SA 173/129/153 and FP up to 73/31/15; CB-GPT 161/122/154 with FP up to 80/29/12; CB-ChatGPT 190/197/165 with FP up to 55/26/9. With targeted code triggers, CB-SA 157/139/113 and FP up to 32/13/10; CB-GPT 175/146/116 with FP up to 31/11/12; CB-ChatGPT 155/107/134 with FP up to 30/7/12. SIMPLE/COVERT/TROJANPUZZLE generate insecure suggestions but are fully detected post-generation (e.g., 154→0), effectively nullifying their attacks. • Functionality of generated malicious code (Table 4): In Case (1), 97.2% (CB-SA), 98.2% (CB-GPT), 84.6% (CB-ChatGPT) of malicious generations are fully functional; in Case (2), 96.1%, 92.9%, and 88.6%, respectively. Non-functional categories are rare and often due to the 128-token generation limit or minor incorrect keywords. • Model performance unaffected (Table 5, Figure 8): Average perplexity remains comparable to clean fine-tuning across epochs and triggers (e.g., ~2.82–2.88). HumanEval pass@k shows no degradation versus clean. • Evasion across 15 vulnerabilities (Table 6): Transformed codes consistently evade Semgrep/Bandit/Snyk; pass rates for CodeQL vary (e.g., insufficient-dsa-key-size 82%; paramiko-implicit-trust-host-key 62%). SonarCloud pass is high. LLM evasion shows vulnerability-dependent scores and pass counts (e.g., direct-use-of-jinja2 GPT-4 score 0.75, pass 8; insecure-hash-algorithm-md5 GPT-4 score ~0.30, pass 10). Overall, 27/30 vulnerabilities do not trigger new rules after transformation. • LLM detection false positives (Figure 9): GPT-4 flags benign uses of eval() with 100% false positive rate (20/20) and benign base64 decoding with 65% FPR (13/20), highlighting detector limitations. • Transferability: Transformed codes that evade GPT-4 also evade Llama-3 70B at 96.7% pass rate (29/30) and Gemini Advanced at 83.3% (25/30), with Gemini more sensitive to base64 decoding. • Updated TrojanPuzzle contextual triggers (Figure 10): CODEBREAKER attack@50 success averages 39.17%, 38.33%, 40.83% across epochs for CB-SA, CB-GPT, CB-ChatGPT. SIMPLE/COVERT/TROJANPUZZLE still fail to evade detection; their actual success drops to zero post-detection. • User study (Table 7): 9/10 participants accept at least one malicious payload from CODEBREAKER; acceptance rates are 8/10 for CODEBREAKER and 7/10 for the clean model, with chi-square tests showing no significant differences (χ²=0.2666, p>0.025) and no significant difference between security-experienced vs non-security participants (χ²=0.7407, p>0.025).
CODEBREAKER demonstrates that minimal, LLM-assisted transformations can embed disguised vulnerabilities directly into source code, persist through fine-tuning, trigger easily at inference, and evade both rule-based static analysis and advanced LLM-based detectors. This addresses the core challenge where prior attacks either rely on comments (which may be excluded) or produce detectable outputs post-generation. The empirical results across multiple vulnerabilities, tools, and triggers validate the generality and strength of the approach, while maintaining model utility (perplexity and HumanEval unchanged). The analysis also reveals systemic limitations in current detection: static tools' rule coverage gaps and LLM detectors' high false positives on common constructs (e.g., eval, base64), which attackers can exploit via obfuscation. Transferability to other LLMs suggests broader risk beyond specific detectors. User studies indicate developers—regardless of security background—readily accept insecure, obfuscated suggestions, underscoring real-world stealthiness. Overall, the findings highlight a multi-layered vulnerability: ML training pipelines are susceptible to data poisoning with disguised code; inference-time generations can covertly include vulnerabilities; and LLMs can be leveraged adversarially to craft evasive payloads.
The paper introduces CODEBREAKER, the first LLM-assisted backdoor attack on code completion that simultaneously evades static and LLM-based vulnerability detection while embedding payloads within actual source code and activating with easy triggers. Extensive experiments across diverse vulnerabilities, models, triggers, and detectors demonstrate strong attack efficacy without degrading general model performance. The work exposes gaps in current detection and pre-processing practices and emphasizes the need for robust defenses. Future directions include: enhancing detection by integrating obfuscation-aware analyses and multi-detector ensembles; expanding rule coverage and semantics in static tools; improving LLM vulnerability analysis to reduce false positives and better understand context; hardening fine-tuning pipelines via provenance, trust metrics, and stricter data sanitation; and exploring model-level defenses (e.g., robust training, backdoor detection, and trigger suppression) across languages and larger models.
• Scope and models: Primary evaluations target Python and CodeGen-Multi (350M), with extended results noted in appendices; generalization to other languages and larger/fewer-parameter models is not exhaustively covered. • Detector set dependence: Evasion is demonstrated against five static tools and several LLMs; other tools or configurations may differ. • ChatGPT detection variance: Obfuscation that evades GPT API sometimes fails against ChatGPT’s detection; additional tuning is required. • Stealthiness-evasion tradeoff: Increased transformation/obfuscation improves evasion but enlarges AST distance; managing this tradeoff may impact stealth under different constraints. • Generation constraints: The 128-token cap occasionally yields truncated payloads; longer generations may reduce non-functional outputs but affect evaluation comparability. • Trigger design: Code-based triggers can increase FP rates; contextual triggers reduce reliance on explicit tokens but still depend on model behavior and training data. • Reliance on external LLMs: The attack depends on GPT-4 for crafting transformations/obfuscations, incurring cost and API dependencies; prompts and results may vary across LLM versions. • Detection limitations: LLM detectors exhibit high false positives on common operations (eval, base64), complicating fair assessment and real-world deployment of detection; static tools have finite rules and may miss transformed patterns.
Related Publications
Explore these studies to deepen your understanding of the subject.

