Computer Science

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

C. Fang, N. Miao, et al.

This paper delivers a comprehensive evaluation of large language models (LLMs) for code analysis, including the challenging case of obfuscated code, and presents real-world case studies. Findings indicate LLMs can assist in automating code analysis while exhibiting certain limitations. Research conducted by Chongzhou Fang, Ning Miao, Shaurya Srivastav, Jialin Liu, Ruoyu Zhang, Ruijie Fang, Asmita, Ryan Tsang, Najmeh Nazari, Han Wang, and Houman Homayoun.

00:00

~3 min • Beginner • English

Index

Introduction

The paper investigates the capability of large language models (LLMs) to perform code analysis, including understanding source code and handling obfuscated code. Motivated by the widespread adoption of LLMs and their demonstrated utility in code generation and comprehension, the study addresses gaps in systematic evaluation for code analysis, especially under obfuscation. The authors construct datasets of real-world code and obfuscated versions, evaluate state-of-the-art publicly available LLMs, and conduct real-world case studies. The research targets two questions: RQ1: Do LLMs understand source code? RQ2: Can LLMs comprehend obfuscated code or code with low readability? The work aims to inform defensive analysis and software security applications.

Literature Review

Background covers: (1) Large Language Models (LLMs): transformer-based architectures trained on massive corpora, enabling tasks like summarization, translation, QA, and code-related applications; examples include OpenAI GPT series, LLaMA, Alpaca. Prior studies show LLM utility in code explanation and summarization for education and industry. (2) Code Analysis: automated static analysis via ASTs, feature extraction, and ML for vulnerability detection; limitations of conventional AST-based tools for general analysis. (3) Code Generation: LLM-powered code completion and repair (e.g., Copilot), impacts on programming practices; noted trade-offs including potential security issues. (4) Code Obfuscation: techniques like identifier renaming, opaque predicates, control flow flattening, MBA expressions, split strings, and cross-language obfuscation (Wobfuscator). The literature motivates assessing LLMs against obfuscation and in defensive analysis contexts.

Methodology

Models: Five publicly available LLMs were evaluated: GPT-3.5-turbo, GPT-4, LLaMA-2-13B, Code-LLaMA-2-13B-Instruct, and StarChat-Beta (16B). Prompts: Simple instruction "Analyze the code and tell me what it does." and role-assignment style prompts were used consistently across experiments. Non-obfuscated dataset: Three languages—JavaScript, Python, and C. JavaScript: Octane 2.0 benchmarks and practical JS apps (e.g., password generator). Python: CodeSearchNet (Python branch). C: Classic performance benchmarks (CoreMark, Dhrystone, Hint, Linpack, NBench, Stream, TripForce, Whetstone) and a subset of POJ-104 with one random solution per programming problem. Comments were removed programmatically to reduce natural language hints. Obfuscated dataset: Obfuscation applied to the JavaScript branch using (1) JavaScript Obfuscator (default obfuscation, dead code injection, control flow flattening, split strings) and (2) Wobfuscator (cross-language obfuscation via WebAssembly). Additionally, IOCCC (International Obfuscated C Code Contest) winner samples post-2011 were used for de-obfuscation generation tasks. Measurement: Manual validation established ground truth using GPT-4 outputs reviewed by four experienced graduate/PhD students, marking explanations as 'correct' if functionality matched. Metrics: (1) Cosine similarity (0–1), (2) BERT-based semantic similarity (0–5), and (3) ChatGPT-based pairwise evaluation (Yes/No with justification) to assess similarity and accuracy. The pipeline processed code files, obtained LLM explanations, and compared against ground truth across non-obfuscated and obfuscated sets, and tested de-obfuscation code generation for IOCCC samples.

Key Findings

- RQ1 (Non-obfuscated code): GPT-4 achieved high accuracy across C, JavaScript, and Python (>95% for each), with overall accuracy 97.4%. GPT-3.5 performed similarly high. Smaller/open models (LLaMA-2-13B, Code-LLaMA-2-13B-Instruct, StarChat-Beta) produced poorer and often inconsistent explanations. - Memorization and recognition: GPT-4 often recognized source code origins (e.g., Heron, jQuery) despite de-commenting, indicating training data overlaps and memorization useful for context but occasionally causing wrong associations (e.g., incorrectly inferring pandas usage). - Use of identifiers: GPT-4 leveraged identifier names and structural hints (e.g., CoreMark custom types, JavaScript 'Benchmark') to aid analysis. - Smaller models struggled to maintain coherent explanatory paragraphs, sometimes echoing prompts, rephrasing tasks, or returning code fragments instead of explanations. - RQ2 (Obfuscated code): GPT-4 maintained acceptable performance with obfuscated JavaScript, reaching 87% accuracy; GPT-3.5 showed more degradation, especially under control flow flattening, dead code injection, and split strings. - Basic obfuscation (identifier renaming, string arrays) only slightly reduced explanation ability for GPT models; they relied on control flow and remaining strings to infer functionality. - Wobfuscator (WebAssembly insertion) significantly reduced both GPT models' accuracy and semantic similarity; models failed to decipher the inserted WASM logic. - Longer and more complex obfuscated code further reduced accuracy, with GPT-3.5 more affected and more likely to request additional context. - De-obfuscation code generation (IOCCC, 100 samples): • GPT-3.5: Generated code for all targets; ~20% compilable; 8% produced correct outputs overall (38% of compiled). • GPT-4: Generated code for 76% of samples; 19% compilable overall; 4% produced correct outputs overall (21% of compiled). - GPT-4 more frequently recognized IOCCC provenance (22 samples) but misattributed specific awards and details; GPT-3.5 recognized 2 samples. - GPT-4 more likely to refuse de-obfuscation generation on complex or specially formatted code; text-level obfuscation alone did not impede explanation/de-obfuscation when logic was simple. - GPT-4 generated more readable code (better formatting and meaningful identifiers) despite lower success rates. - Case studies: • GitHub repositories: GPT-4 correctly explained functions and flagged malware patterns (e.g., VM detection, encryption utilities), though produced a false alarm by misclassifying standalone cryptography utilities as malicious actions. • Android msg-stealer (908 Java files, ~250k LoC): Both GPT-3.5 and GPT-4 extracted key behaviors (phone validation, SMS permission request, URL communication, trigger-based exfiltration). Maliciousness was only flagged correctly by GPT-4 when provided with combined context; both failed to flag when steps were provided independently. • WannaCry (decompiled ~11k LoC C): GPT-3.5 identified critical behaviors (port 445 checks, creating service 'mssecsvc2.0', beaconing to known URL) and raised security concerns. Limitations included not tracking labels across chunks and conservative conclusions.

Discussion

Findings show that advanced LLMs (GPT-3.5/4) can accurately and richly explain non-obfuscated code across multiple languages, answering RQ1 affirmatively. However, smaller open models struggled to produce coherent, correct analyses, highlighting model capacity and training as crucial factors. For RQ2, GPT-4 sustains acceptable performance on classic obfuscation techniques, but both GPT models degrade under complex obfuscation, notably cross-language insertion via WebAssembly. This underscores limits in current LLMs' capability to reason about transformed control/data flows and mixed-language contexts. De-obfuscation code generation remains unreliable for functional reconstruction, with low compile/run success rates despite readable output from GPT-4. Case studies demonstrate practical utility in defensive static analysis: LLMs can surface suspicious patterns and summarize large decompiled codebases, but require adequate context to connect behaviors and can yield false positives/negatives. Overall, LLMs serve as effective assistants to accelerate reverse engineering and code review, not replacements for expert analysis. Their performance depends on code readability, complexity, and the presence of obfuscation.

Conclusion

The study provides the first systematic evaluation of LLMs for code analysis across normal and obfuscated code. Larger GPT models deliver strong performance in explanation tasks on non-obfuscated code and retain acceptable accuracy under classic obfuscation, while smaller/open models underperform. Complex obfuscation (e.g., Wobfuscator) and code length/complexity substantially reduce effectiveness. LLMs currently exhibit weak capability in generating functional de-obfuscated code. Real-world case studies affirm LLMs' usefulness for defensive analysis and malware understanding when provided with sufficient context. Future directions include building comprehensive obfuscated code datasets for fine-tuning, investigating memorization phenomena, and developing better evaluation metrics tailored to code analysis.

Limitations

Ground truth was derived by manually validating GPT-4 outputs, which may introduce bias. ChatGPT-based evaluation and semantic similarity metrics can be unreliable when mixing code and natural language, requiring manual calibration. The dataset covers only three languages for non-obfuscated code (JavaScript, Python, C) and applies obfuscation primarily to JavaScript; results may not generalize to other languages or obfuscation methods. Smaller open-source models tested represent specific parameter sizes; conclusions may vary with larger variants. LLM input length constraints necessitated chunking long decompiled functions, which can hinder tracking labels and context. Case studies reflect specific malware and repositories; broader generalization requires more targets.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

J. Liu, S. Chunqiu, et al.

Linguistics and Languages

Applying large language models for automated essay scoring for non-native Japanese

W. Li and H. Liu

Computer Science

Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search

N. Dainese, M. Alakuijala, et al.

Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny