logo
ResearchBunny Logo
Improved Fault Classification and Localization in Power Transmission Networks Using VAE-Generated Synthetic Data and Machine Learning Algorithms

Engineering and Technology

Improved Fault Classification and Localization in Power Transmission Networks Using VAE-Generated Synthetic Data and Machine Learning Algorithms

M. A. Khan, B. Asad, et al.

This innovative research presents a groundbreaking strategy for fault classification and localization in power transmission networks by leveraging variational autoencoders to synthesize fault data. Conducted by Muhammad Amir Khan and colleagues, the study achieves an impressive 99% accuracy in fault classification and a mean absolute error of just 0.2 in fault localization, outpacing existing methods.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses the challenge of accurately detecting, classifying, and localizing faults in increasingly complex and critical electric power transmission networks. Traditional rule-based or model-based methods require detailed system knowledge and often struggle with network complexity and variability of fault conditions. The study proposes leveraging machine learning with sufficient and diverse data, highlighting the difficulty of acquiring labeled fault data due to the rarity and unpredictability of abnormal events. The research question focuses on whether synthetic data generated by variational autoencoders (VAEs) can effectively augment scarce real datasets to improve ML-based fault classification and localization performance, thereby enabling more reliable and timely fault management to enhance grid dependability and safety.
Literature Review
The paper reviews conventional and modern techniques for transmission line fault diagnosis, including wavelet analysis, genetic algorithms, PMU-based methods, and multi-information approaches. It notes a shift toward AI/ML methods due to their potential to handle complex patterns with less manual intervention. Data scarcity is identified as a key bottleneck, prompting exploration of generative models such as GANs and VAEs to create synthetic data aligned with real distributions. Prior works include spectrum-based ML for fault detection, acoustic emission-based transformer diagnostics, VAE-based augmentation for transmission lines and wind turbines, and KNN-based protection schemes for double-circuit lines. The authors also discuss foundational concepts: VAEs (reconstruction and KL divergence loss), data synthesis/augmentation to improve generalization under limited variability, and forward feature selection to reduce redundancy and enhance model efficiency, with stratified cross-validation to manage imbalance.
Methodology
System modeling and data generation: A 220 kV, three-phase, 150 km transmission line model is built and simulated in Aspen One-Liner. Fault scenarios include line-to-ground (AG, BG, CG), line-to-line (AB, BC, AC), double line-to-ground (ABG, BCG, ACG), and three-phase-to-ground (ABC-G). Key parameters include: phase-to-phase voltage 220 kV; source resistance 0.7896 Ω; source inductance 13.43×10^-2 H; fault incipient angles 0° and −30°; fault resistance 0.001 Ω; ground resistance 0.01 Ω; switching time between 0.1–0.2 s; sequence parameters (e.g., R1=R2=0.01154 Ω/km, Ro=0.3165 Ω/km; C1=C2=C3=10.14 nF/km; Co=5.7853 nF/km; L1=L2=L3=0.7945 mH/km; Lo=2.9981 mH/km). Synthetic dataset creation with VAE: Voltage/current waveforms are recorded for healthy and faulty states. Initial real/simulated fault records are expanded using a VAE trained to learn latent representations and generate synthetic samples matching the original distribution. The VAE loss combines reconstruction loss and KL divergence. The approach addresses class imbalance and scarcity, producing 2183 synthetic samples for the listed shunt faults and, overall, about 18,898 data points covering healthy and faulted states. Data attributes and splitting: Fault resistances used: 0, 25, 50, 75, 100, 150 Ω; fault distances: increments of 4.4 km up to 150 km. The dataset is split 70%/30% for training/testing, yielding example sizes of 14,400 (training) and 4,498 (testing). Stratified cross-validation is applied. Feature engineering and selection: Three-phase current and voltage features are extracted from waveforms. A customized forward feature selection (FFS) selects informative features and removes redundancy. Machine learning architectures: - CatBoost (primary): gradient-boosted decision trees handling categorical features; tuned hyperparameters include iterations=1000, depth=6, learning rate=0.1, loss functions: log loss (classification) and RMSE (regression), class weights example: 0.01, 0.001, 0.9, 0.0001, random strength=0.1. - Baselines: SVM (linear kernel; C=0.1; gamma=0.1), Decision Tree (criterion=entropy; splitter=best; max_depth=90; min_samples_split=3; min_samples_leaf=2; max_features=5; ccp_alpha=0.01), Random Forest (criterion=entropy; similar splits and feature limits), KNN (n_neighbors=3; weights=distance; metric=Euclidean). Tasks and evaluation: - Fault classification: multi-class classification across the 10 shunt-fault categories using CatBoost, SVM, DT, RF, and KNN. Performance assessed via accuracy, precision, recall, F1-score, confusion matrices, and ROC/AUC. - Fault localization: regression to estimate fault distance (km) using the same models configured as regressors; evaluated using absolute error and mean absolute error (MAE). Visualization includes scatter plots of synthetic data distributions, confusion matrices, ROC curves, and actual vs. predicted fault locations.
Key Findings
- Using VAE-generated synthetic data with ML classifiers achieves high fault classification performance. Reported overall classification accuracy reaches approximately 99–99.5% across models; CatBoost confusion matrices show near-perfect per-class accuracy (mostly 90/90 per class, with very few misclassifications). - Fault localization achieves very low errors. The proposed approach reports a mean absolute error (MAE) of about 0.2 for localization and overall localization errors below 2% in additional evaluations. - Baseline models also perform strongly on the synthetic-augmented dataset: example metrics include Accuracy/Precision/Recall/F1 around 0.97–0.99 for SVM, DT, RF, and KNN; RF and KNN achieve up to 99.62% accuracy in some fault categories. - The combination of VAE-based data augmentation and forward feature selection contributes to robustness and improved generalization despite limited real labeled data. - The approach surpasses prior state-of-the-art baselines for both classification and localization on the studied scenarios.
Discussion
The study demonstrates that augmenting scarce transmission-line fault data with VAE-generated synthetic samples substantially boosts ML model performance for multi-class fault classification and continuous localization. By learning the underlying data distribution, the VAE provides diverse, statistically consistent samples, improving the coverage of fault conditions (types, resistances, distances) and enabling models like CatBoost and RF to learn more discriminative boundaries and accurate regressions. The near-perfect confusion matrices and high ROC/AUC across all fault types indicate strong separability achieved through the augmented feature space and FFS. Localization errors around MAE ≈ 0.2 and generally below 2% show that the regression models capture the mapping from voltage/current features to fault distance effectively. Practically, these results imply faster, more reliable protection and maintenance decisions, reduced dependence on extensive real-world labeled datasets, and improved resilience to variability in operating conditions. The findings support the hypothesis that synthetic data augmentation via VAEs can overcome data scarcity and class imbalance, leading to significant gains over traditional methods.
Conclusion
The paper introduces a VAE-driven synthetic data augmentation framework combined with ML classifiers/regressors (notably CatBoost) to improve fault classification and localization in 220 kV transmission networks. Trained on Aspen One-Liner simulated/recorded data and augmented synthetically, the models achieve around 99–99.5% classification accuracy and very low localization error (MAE ≈ 0.2), outperforming conventional baselines. The method is practical and cost-effective, reducing reliance on extensive labeled field data and enabling rapid, accurate detection and response that can enhance grid reliability. Future research directions include: expanding validation with larger-scale real-world fault recorder datasets; testing on diverse grid topologies, noise conditions, and parameter drifts; integrating adaptive online learning; exploring advanced generative models and domain adaptation to close any sim-to-real gaps; and automating feature engineering to further streamline deployment.
Limitations
- Dependence on simulated and synthetic data: results may be affected by discrepancies between simulated/synthetic and real-world fault characteristics (sim-to-real gap). - Potential selection bias: performance can be distorted by selected data points; the study notes a dataset size threshold (e.g., ≥5000 points) for acceptable results. - Requirement for careful feature selection and hyperparameter tuning; model performance is sensitive to these choices. - Measurement and labeling errors can propagate through training, impacting classification and localization accuracy. - Class imbalance and limited real labeled events necessitate stratification and augmentation; residual imbalance could still affect generalization. - Context-specificity: parameters and configurations tuned for a 220 kV, 150 km line may limit direct generalizability to other systems without adaptation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny