Engineering and Technology
Real-time outage management in active distribution networks using reinforcement learning over graphs
R. A. Jacob, S. Paul, et al.
Discover a groundbreaking graph reinforcement learning model for outage management in distribution networks, enhancing resilience like never before! Developed by a team of innovative researchers including Roshni Anna Jacob, Steve Paul, Souma Chowdhury, Yulia R. Gel, and Jie Zhang, this model achieves near-optimal, real-time power restoration performance across various outage scenarios.
~3 min • Beginner • English
Introduction
The study addresses the need for fast, autonomous, and adaptive outage management in distribution networks within modern smart grids. With up to 90% of customer disruptions during extreme events attributed to distribution-level failures and increasing penetration of distributed energy resources (DERs), distribution networks are increasingly expected to self-heal and operate autonomously, including in islanded modes. Traditional optimization-based restoration approaches (e.g., MINLP, MISOCP) can be computationally expensive, slow, and require detailed model knowledge, limiting their applicability for real-time response under uncertain outage conditions. The research question is how to design a learning-based control that (1) explicitly leverages network topology, (2) jointly controls switching (reconfiguration, including intentional islanding) and load shedding, and (3) achieves near-optimal performance with real-time responsiveness and generalizability across outage scenarios. The proposed solution is a graph reinforcement learning framework employing a capsule-based graph neural network to encode both nodal/edge states and global topology for policy learning.
Literature Review
Prior work on distribution network reconfiguration and service restoration spans heuristic, meta-heuristic, and mathematical programming approaches, including mixed-integer programming and conic relaxations. These methods often face scalability and real-time limitations, particularly for three-phase unbalanced networks with sectionalizing and tie switches and heterogeneous DER types. Islanding strategies have been studied with mixed-integer programming to expand DER operation, and load management (shedding) has been explored as emergency control. Reinforcement learning has gained traction for autonomous control in power systems, notably for voltage regulation using multi-agent DRL and graph-based representation learning. For distribution network reconfiguration under normal operations, deep Q-learning and variants (e.g., Noisy-Net, batch-constrained SAC) have been used for loss minimization and voltage improvement; however, such methods often require enumerating feasible configurations, rely on historical operation datasets, or struggle with scalability due to expanding state-action spaces. Works on RL for resilience-focused reconfiguration typically do not enforce feasibility based on voltage and DER operating modes, and Q-learning approaches decide entire configurations rather than individual switch controls, complicating scalability. The literature indicates a gap for a scalable, topology-aware, policy-gradient RL approach that can handle both grid-connected and intentional islanding reconfiguration with load shedding while ensuring operational feasibility.
Methodology
Problem framing: Outage management is formulated as a Markov Decision Process over the graph of the distribution network. The goal is to maximize energy supplied while maintaining operational feasibility (voltage within limits) via switching control and emergency load shedding. The approach supports both grid-connected reconfiguration and intentional islanding around grid-forming DERs.
MDP definition:
- State S includes node and edge variables and context: estimated/forecast active/reactive demand and generation at nodes, measured three-phase voltages at buses, voltage violation metric V_viol, measured branch power flows, current operating topology, total energy supplied E_supp, and an outage mask µ indicating failed/out-of-service switches to suppress invalid actions.
- Action A comprises binary decisions: δ for switchable lines (sectionalizing and tie) to open/close, and θ for switchable loads to serve/shed.
- Transition P(s_{t+1}|s_t, a_t) is induced by the OpenDSS environment after applying actions.
- Reward r(s,a) = E_supp − V_viol if the network is numerically feasible (no ill-conditioning/non-convergence), otherwise 0. Voltage violations are computed per-bus beyond upper/lower limits and aggregated per-unit, penalizing unsafe operation. Invalid/ill-conditioned states receive zero reward.
Distribution network as graph: The DN is represented as G=(N,E), with nodes as buses (substation, loads, DERs, zero-injection) and edges as lines/transformers. Switchable and non-switchable lines are distinguished. Intentional islanding is enabled by identifying grid-forming DER nodes and introducing virtual slack buses at those locations in OpenDSS when required.
Environment and simulation: The DN is modeled in OpenDSS with DERs (grid-forming modeled as voltage sources in islanded sections; grid-feeding follow references), switches on lines, and load enable/disable for shedding. OpenDSSDirect (Python) is used for circuit manipulation and I/O, and NetworkX constructs the graph. After actions, the environment updates circuit topology, sets virtual slacks for islands, runs power flow, and returns observations including voltages, flows, and E_supp.
Policy architecture (Graph RL with GCAPS):
- Node features y_i contain P, Q (demand), P, Q (generation), and measured three-phase voltage V. Initial node features are projected to higher dimension via linear mapping.
- Graph Capsule (GCAPS) encoder: Multiple layers of polynomial graph convolutional filters aggregate neighborhood information up to K hops and capture higher-order statistical moments (up to order P). Intermediate features are vector-valued (capsules), yielding richer structural embeddings versus scalar GCN/GAT. Final node embeddings are obtained via linear transforms of concatenated capsule outputs.
- Graph embedding: Node embeddings are processed by MLP layers and mean-pooled to a global graph embedding F_graph.
- Context embedding: Non-graph state variables (E_supp, V_viol, and edge flows) are embedded via a feedforward network to F_context.
- Final state embedding and action head: F_final = MLP(F_graph + F_context). A feedforward layer outputs logits over all binary actions. Masking sets logits of outage/inoperable switches to −∞. Bernoulli probabilities (sigmoid) define per-action selection; a greedy policy (threshold 0.5) yields the final binary action vector. A parallel value head predicts state value for PPO.
- Size transferability: Only the context feedforward layer input size depends on network size; the GCAPS encoder and final MLP can transfer across networks as a warm start, aiding scalability.
Training: PPO (on-policy) from Stable-Baselines3 is used with MultiBinary action space. Training data are generated by randomized outage scenarios and operating points. Rollouts collect (state, action, reward, next state) with batch updates via backprop minimizing policy gradient and value losses. Learning rate follows a squared exponential decay. Training durations: up to 2.0e6 steps (13-bus), 1.5e6 (34-bus within 36 h), and 5e5 (123-bus within 36 h). Hardware for training included Intel Xeon Gold CPU and NVIDIA A100 GPU.
Scenario generation (training): Graph-based subgraph random edge removal emulates localized failures and cascading effects. Random nodes are selected, subgraphs formed within radius R_s ≤ G_dia/α, and a fraction of edges removed (0–50%). Load and generation multipliers are sampled from annual OpenDSS profiles.
Baselines: An MLP-based PPO policy (no graph abstraction), mixed-integer second-order cone programming (MISOCP) based restoration using three-phase branch flow relaxations with big-M for switches, and Binary Particle Swarm Optimization (BPSO) are implemented for comparison.
Evaluation: Modified IEEE 13-, 34-, and 123-bus feeders with sectionalizing/tie switches and both grid-forming and grid-feeding DERs. Two outage scenarios per network are designed using centrality metrics and switch states. Performance metrics include equivalent energy served (resilience proxy), voltage profiles, validity (respecting outage masks), and computation time.
Key Findings
- Real-time performance: GCAPS-based GRL computes control actions in milliseconds across network sizes. Mean computation times (5 runs):
• 13-bus: GCAPS 0.0049 s (scen. 1), 0.0056 s (scen. 2); MISOCP 0.3040 s, 0.2318 s; BPSO ~500–540 s.
• 34-bus: GCAPS 0.0030 s, 0.0025 s; MISOCP 0.5668 s, 0.5676 s; BPSO ~2540–2580 s.
• 123-bus: GCAPS 0.0090 s, 0.0078 s; MISOCP 0.91 s, 0.918 s; BPSO >2.5e5 s.
- Near-optimal energy served: GCAPS solutions closely approach MISOCP/BPSO optimal or near-optimal energy served across 13-, 34-, and 123-bus networks in both scenarios.
- Constraint adherence and validity: GCAPS respected outage masks and voltage limits, while the MLP baseline produced invalid actions (closed outage switches) in multiple scenarios (e.g., 13-bus scenario 2; 34-bus scenario 1; 123-bus scenario 2).
- Voltage profiles: Implementing GCAPS actions maintained bus voltages within acceptable ranges (e.g., 13-bus within ~0.99–1.10 pu; 34-bus within ~0.90–1.10 pu for active phases). For islanded sections around non-grid-forming DERs, buses remained unsupplied as expected.
- Training behavior: GCAPS achieved higher average episodic rewards and faster convergence versus MLP on 34- and 123-bus cases; on 13-bus, GCAPS achieved higher rewards but did not fully converge within 2 million steps.
- Time-critical resilience benefit: Compared to MISOCP, the faster GCAPS response reduced loss of energy by 607.45 kWs (13-bus) and 596.52 kWs (34-bus) during the transient recovery window.
- Generalizability: GCAPS provided diverse, scenario-consistent solutions across a range of outages without relying on enumerated configurations or historical operation datasets.
Discussion
The proposed GRL approach directly encodes network topology and nodal/edge interactions via a GCAPS graph neural network, enabling robust policy learning for outage management that scales to varying feeder sizes and outage patterns. By combining switching reconfiguration (including intentional islanding around grid-forming DERs) with emergency load shedding, the method ensures both maximal energy served and operational feasibility. Empirical results across modified IEEE 13-, 34-, and 123-bus feeders show that GCAPS attains near-optimal energy served relative to exact or near-exact optimization baselines while operating three to five orders of magnitude faster than BPSO and roughly two orders of magnitude faster than MISOCP. This speed advantage translates into tangible resilience gains, reducing transient energy loss and shortening periods of voltage violations during restoration. Compared to a non-graph MLP baseline, GCAPS better generalizes across outage scenarios, adheres to operational constraints (e.g., masking inoperable switches), and maintains acceptable voltages, underscoring the importance of topology-aware representation learning in DN restoration. These findings support the feasibility of deploying the learned policy as a real-time, self-healing control tool for active distribution networks.
Conclusion
This work introduces a topology-aware, reinforcement learning over graphs framework for real-time outage management in active distribution networks. Using a GCAPS GNN policy with PPO, the method jointly performs switching reconfiguration—including intentional islanding around grid-forming DERs—and emergency load shedding to maximize energy served while preserving voltage feasibility. Validated on modified IEEE 13-, 34-, and 123-bus systems, the approach achieves near-optimal performance and millisecond-level response times, significantly outperforming conventional optimization baselines in latency and offering practical resilience benefits. The study demonstrates that explicit graph-based state abstraction and context integration improve generalization and constraint adherence over non-graph RL baselines. Future work includes addressing training scalability for very large heterogeneous networks, exploring transfer learning across feeders, and incorporating coupled cyber-physical considerations to account for communication network failures during restoration.
Limitations
The principal limitation is training scalability and computational resource demand for larger networks; memory requirements increase substantially with graph size (e.g., beyond ~200 nodes), potentially hindering offline training. While online inference is fast, achieving robust policies may require substantial offline compute and time. The current framework does not explicitly model communication network failures or their impact on measurement availability and control actuation; realistic deployment may require coupled cyber-physical modeling and multi-layer graph formulations. Additionally, while intentional islanding is supported via grid-forming DERs, detailed DER dynamic and protection behavior is abstracted in the environment, which may affect fidelity under certain contingencies.
Related Publications
Explore these studies to deepen your understanding of the subject.

