logo
ResearchBunny Logo
A Review of Generative Models in Generating Synthetic Attack Data for Cybersecurity

Computer Science

A Review of Generative Models in Generating Synthetic Attack Data for Cybersecurity

G. Agrawal, A. Kaur, et al.

Discover how deep learning can revolutionize cybersecurity through the exploration of synthetic data generation by Garima Agrawal, Amardeep Kaur, and Sowmya Myneni. This research delves into generative adversarial networks (GANs) and their potential in creating realistic cyberattack data, offering insights into training deep learning models while addressing privacy concerns.

00:00
00:00
~3 min • Beginner • English
Introduction
Machine learning is increasingly used for cybersecurity tasks such as intrusion and anomaly detection, but obtaining realistic, diverse, and sufficiently large cyberattack datasets is difficult due to privacy, security, and sharing constraints. Simulated datasets (e.g., red/blue team exercises) often lack diversity and realism. This motivates automated synthetic data generation to create realistic and varied attack scenarios without impacting production environments. GANs, successful in images and text, have drawn interest for cybersecurity, but key questions persist: (1) how realistic are GAN-generated cyberattack samples and do they preserve signature attack vectors, and (2) do IDS and deep models trained on synthetic attack data generalize to detect new or real-world attacks? The paper surveys generative learning and GANs, reviews cyberattack data generation methods, and empirically assesses whether GAN-generated data can support IDS training in realistic settings.
Literature Review
The paper provides a comprehensive narrative review spanning: (1) Generative vs. discriminative modeling: Definitions, mathematical distinctions (p(y|x) vs. p(x), p(x,y)), and implications for tasks with limited labeled data. Coverage includes classic generative models (Naive Bayes, Bayesian networks, HMMs, LDA) and deep generative models (RBM, DBN, DBM, sigmoid belief nets, differentiable generator networks, VAEs, GANs). It contrasts explicit density models (FVBNs, nonlinear ICA) and implicit density models (Markov-chain based and direct-sampling generators, including GANs). The role of representation learning and latent causal factors is emphasized for disentangling factors of variation. (2) GAN fundamentals: Architecture (generator–discriminator), minimax game, loss formulations, Nash equilibrium, training dynamics (SGD, backprop, divergence minimization via JS/KL/Wasserstein), and pitfalls (mode collapse), along with training heuristics (e.g., gradient penalties). (3) GAN techniques for images: DCGAN (feature learning, vector arithmetic), CGAN (conditioning on labels), LAPGAN (Laplacian pyramid for high-res), PGGAN (progressive growing), RenderGAN (augmentations with embedded 3D model), StackGAN (text-to-image via embeddings and conditional augmentation), InfoGAN (mutual-information regularization for disentanglement), with evaluation metrics like Inception Score and human studies. (4) GANs for tabular/structured data: Airline PNR synthesis using Cramer GAN with embeddings and cross layers; Table-GAN with generator, discriminator, and classifier plus information and classification losses to ensure statistical similarity and semantic validity; privacy assessments (e.g., membership inference resistance) and utility (model compatibility). (5) Cybersecurity-focused synthesis: Flow-based traffic generation (WGANs on CIDDS-001 with IP2Vec embeddings; packet-level PAC-GAN; IoT bidirectional flow sequences; NetShare for scalable, private IP header traces; WGAN-based cryptomining flows; SDN adversarial sample generation with WGAN-GP/AdvGAN). Intrusion alert synthesis (WGAN-GP, WPGAN-MI with mutual information; evaluation via histogram intersections and conditional entropy using CPTC datasets). Adversarial examples: MalGAN (malware black-box evasion) and IDSGAN (NSL-KDD-based evasion while preserving functional features), plus related works addressing imbalance and unseen attack detection (e.g., GAN-based oversampling, IGAN-IDS, reinforcement learning IDS with CTGAN-augmented datasets). (6) LLMs and attacks: Emerging use of transformer-based LLMs in crafting phishing/social engineering content and assisting malware development; proposals to leverage LLMs (e.g., PAC-GPT) for traffic synthesis and attack tree generation. Overall, the literature highlights promise yet exposes gaps in verifying that synthetic cyberattack samples retain multi-feature correlations and temporal/spatial signatures characteristic of real-world attacks.
Methodology
The paper’s empirical analysis focuses on assessing fidelity and utility of GAN-generated attack data using NSL-KDD with emphasis on DoS attacks. Steps: (1) Feature identification: Using domain knowledge and NSL-KDD documentation, the authors identified features indicative of DoS behavior across intrinsic, time-based, and host-based categories. They computed Pearson correlations among these features to capture expected relationships in real DoS traffic. (2) GAN setup: A conditional GAN was built with a discriminator trained specifically on attack samples while the generator attempted to produce convincing attack-like samples until the discriminator achieved accurate identification. (3) Baseline models: Two evaluation models were trained on NSL-KDD. a) A supervised “white-box” feedforward neural network trained on both normal and attack training data, achieving over 99% test accuracy; b) A semi-supervised anomaly detector trained only on normal traffic, achieving over 81% accuracy distinguishing normal vs. attack on test data. (4) Evaluation protocol: The GAN-generated samples were analyzed by: a) Classification via the white-box model to see if generated samples are classified as attack categories or normal; b) Anomaly detection to determine non-normality; c) Statistical fidelity check by comparing standard deviations of identified DoS-relevant features between real NSL-KDD attack samples and GAN-generated samples, using Euclidean distances; and d) Qualitative assessment of whether feature correlations indicative of DoS are preserved. The analysis upholds test accuracy of baselines and then evaluates GAN outputs against these criteria.
Key Findings
- The supervised white-box model achieved >99% accuracy on NSL-KDD test data; the anomaly detector achieved >81% accuracy distinguishing normal vs. attack. - When applied to GAN-generated samples, both models typically flagged the samples as “not normal”; however, the white-box model often misassigned attack categories, indicating inconsistency with true DoS signatures. - Statistical comparisons showed that many GAN-generated samples exhibited standard deviation patterns on DoS-relevant features that deviated substantially from both real DoS and normal traffic; expected inter-feature correlations were not preserved. - Conclusion: Although GAN-generated samples were generally “abnormal,” they often failed to correspond to realistic DoS attack traffic and instead resembled noise. Thus, “not normal” classification is insufficient evidence of valid attack synthesis. - From the reviewed literature, techniques like WGAN-GP and mutual-information-constrained variants (e.g., WPGAN-MI) can improve diversity and fidelity for certain alert/flow synthesis tasks, but robust preservation of complex cyberattack signatures remains challenging. - Key contribution: A critical caution that many prior claims of synthetic attack realism rely on models merely detecting non-normality rather than verifying alignment with attack-specific feature distributions and correlations.
Discussion
The study addresses whether GANs can produce realistic cyberattack data and whether models trained using such data generalize to real-world attacks. Findings show that while GANs can generate samples that appear non-normal to classifiers and anomaly detectors, these samples may lack the multivariate structure and feature correlations characteristic of real DoS attacks. Consequently, IDS trained on such noisy synthetic data risk degraded real-world performance. This underscores the importance of rigorous, attack-specific validation beyond generic abnormality detection—e.g., verifying marginal distributions, joint dependencies, and temporal/host-level correlations. The broader review situates these findings within the GAN ecosystem, acknowledging advances (conditioning, hierarchical generation, mutual-information constraints, embeddings for categorical attributes) and known issues (mode collapse, training instability). For cybersecurity, the relevance lies in ensuring that synthetic traffic preserves causal and structural aspects of attacks. The implications are that synthetic data can augment datasets for IDS development, but must be validated with domain heuristics and statistical tests tailored to each attack class; combining synthetic with real data appears prudent to maintain robustness.
Conclusion
The paper contributes: (1) a structured review of generative learning and GAN-based synthetic data generation across images, text, and tabular domains; (2) a survey of GAN applications in cybersecurity, including flow generation, intrusion alert synthesis, and adversarial example creation; and (3) an empirical assessment of GAN-generated NSL-KDD DoS samples showing that “abnormal” outputs often fail to replicate realistic attack feature distributions and correlations. The main takeaway is that GANs hold promise for addressing data scarcity in cybersecurity, but synthetic attack data must be scrutinized with attack-specific metrics and domain checks to ensure realism. Future research directions include: leveraging newer datasets with complex correlated features (e.g., CIC-IDS2017, CIC-DDoS2019, DAPT2020, Unraveled/DAPT2021, CICEV2023); exploring hybrid generative approaches (GANs with LLMs) for complex attacks (DDoS, SQL injection, XSS); and integrating causal constraints or attack-graph/attack-tree priors to enforce realistic multi-feature and temporal structures. Ethical and privacy considerations in generating and using synthetic cyber data should also be rigorously addressed.
Limitations
- Dataset scope: Experimental analysis is limited to NSL-KDD and primarily to DoS attack vectors; generalization to modern, complex multi-stage attacks and richer datasets remains unproven. - Static feature analysis: Evaluation focuses on marginal statistics (e.g., standard deviations, Euclidean distance) and inferred correlations; it does not fully capture temporal or sequential dependencies pervasive in real attacks. - Model specificity: The conditional GAN architecture and training choices may influence outcomes; alternative architectures (e.g., WGAN-GP, CTGAN, diffusion models) and conditioning schemes were not exhaustively compared. - Realism verification: No deployment-level or red-team validation; realism is inferred via statistical and model-based proxies, which may miss nuanced operational behaviors. - Broader claim caution: The critique that “not normal ≠ attack” is supported for DoS on NSL-KDD; other attack families and data modalities might respond differently and require tailored validation protocols.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny