logo
ResearchBunny Logo
Quantum compiling by deep reinforcement learning

Computer Science

Quantum compiling by deep reinforcement learning

L. Moro, M. G. A. Paris, et al.

Discover a groundbreaking approach to quantum compiling through deep reinforcement learning (DRL), presented by Lorenzo Moro, Matteo G. A. Paris, Marcello Restelli, and Enrico Prati. This innovative method not only learns to approximate single-qubit unitaries efficiently but also drastically reduces execution time, potentially enabling real-time applications. Don't miss out on this cutting-edge research!

00:00
00:00
~3 min • Beginner • English
Introduction
Quantum computations are realized as circuits of gates from limited, hardware-constrained bases. While the Solovay–Kitaev theorem ensures any unitary can be approximated to arbitrary tolerance using a finite universal set, practical compiling faces a tradeoff among sequence length, precompilation time, and execution time. Existing compilers based on Solovay–Kitaev or planning approaches can be time-consuming and impractical for online use. This work investigates whether deep reinforcement learning (DRL) can learn a general policy to approximate single-qubit unitaries from a chosen base, improving execution time while maintaining competitive sequence length and fidelity. The research question is whether a DRL agent, after a one-time training (precompilation), can efficiently generate short gate sequences achieving high average gate fidelity across diverse bases and target distributions.
Literature Review
Classical approaches rooted in Solovay–Kitaev yield polylogarithmic sequence lengths and execution times, e.g., Dawson–Nielsen sequences of length O(log^{3.97}(1/ε)) computed in O(log^{2.71}(1/ε)). Alternative bases can shorten sequences further at the expense of larger precompilation costs. Hybrid planners, sometimes assisted by deep networks, can improve lengths but may incur suboptimal execution time at high accuracy. A lower bound of Ω(log(1/ε)) gates is unavoidable. DRL has advanced complex control tasks in physics and quantum control, including quantum feedback, gate control, and pulse-sequence design (e.g., CTAP, STIRAP). These successes motivate applying DRL to quantum compiling without prior structural knowledge of the unitary space.
Methodology
Problem setup: Compile single-qubit unitary targets U_t within tolerance ε by constructing an ordered sequence of gates from a chosen base B. The environment resets to identity each episode; at step n the agent appends a gate a_n ∈ B to form U_n, and observes O_n that encodes the current residual relative to the target (vector of real and imaginary parts of elements of O_n where U = U_n O_n). Actions correspond to selecting gates from B. The agent aims to maximize cumulative reward, preferring shorter successful sequences. Bases considered: (1) Small rotations B = {R_x(±π/128), R_y(±π/128), R_z(±π/128)}; (2) Discrete Harrow–Recht–Chuang (HRC) efficiently universal gates V1, V2, V3. Rewards and RL algorithms: For quasi-continuous small rotations, a dense reward is used: r(S_n,a_n) = (L−n)+1 if d(U_n,U_t) < ε else −d(U_n,U_t)/L, where L is max episode length and d measures distance via average gate fidelity (AGF). This is trained with Proximal Policy Optimization (PPO) for the general Haar-target task, and with Deep Q-Learning (DQL) for a fixed-target demonstration. For the discrete HRC base, a sparse (binary) reward r = 0 if d(U_n,U_t) < ε else −1/L is used, trained with DQL augmented by Hindsight Experience Replay (HER) to address sparsity by relabeling goals from visited states. Neural architectures and hyperparameters: Compact feedforward DNNs with two hidden layers of 128 neurons and SELU activations (linear output) are used. Optimizer: Adam. Learning rates: 5e−4 (fixed-target DQL), 1e−4 (PPO rotations and DQL+HER HRC). Batch sizes: 1e3 (fixed-target DQL), 128 (PPO rotations), 200 (DQL+HER HRC). Replay buffers: up to 5×10^5 experiences (HRC). ε-greedy decay for DQL. Max episode lengths: 130 (fixed-target and HRC), 300 (PPO rotations). Multiple PPO agents (40) for rotations. Training frequency: every episode. Data: Targets for training and validation are sampled as Haar-random single-qubit unitaries, providing an unbiased distribution. For fixed-target demonstration a nontrivial U_t was constructed as a product of 87 small-rotation gates. Validation sets comprised up to 10^6 Haar unitaries; for scaling analyses up to 10^7 targets were used. Average gate fidelity is used as the distance/accuracy metric. Execution: After training, inference uses only forward passes through the DNN to select gates sequentially until success or the step limit is reached. Per-step wall time on a single CPU core is about 5.4×10^−4 s.
Key Findings
- Fixed-target with small rotations and DQL (dense reward): The agent solved a target constructed from 87 small-rotation gates with ε = 0.99 AGF and L = 130. No random baseline solution in 10^4 episodes; the DQL agent learned after ~10^4 episodes and improved to a 76-gate solution. - General Haar targets with small rotations and PPO (dense reward): With ε = 0.99 AGF and L = 300, after extended training the agent solved 96.4% of 10^6 validation targets. Sequence lengths: mean 124; 95th percentile 204; 99th percentile 245. - HRC base with DQL+HER (sparse reward): With ε = 0.99 AGF and L = 130, the agent solved 95.0% of 10^6 validation targets. Sequence lengths: mean 35; 95th percentile 94; 99th percentile 120. - Sequence-length scaling: For small-rotation tasks trained to ε = 0.9999 AGF, the average sequence length vs tolerance exhibits polylogarithmic behavior with an empirical fit achieving R^2 = 0.986 and RMSE = 0.26 (Fig. 4). The authors describe length scaling as O(log_{1.25}(1/ε)) for their DRL compiler on a specific task and note execution time scales proportionally to sequence length. - Speed: Inference is fast; the complete sequence is generated in fractions of a second with ~5.4×10^−4 s per time step on a single CPU core; further acceleration is possible via smaller networks and GPU/TPU. - Best-case fidelity: For small rotations, average gate fidelity can reach approximately 0.9999 in the best cases.
Discussion
The study demonstrates that a DRL agent can learn a general compilation policy for single-qubit unitaries across very different gate bases without handcrafted decompositions. After a one-time training, the learned policy produces high-fidelity approximations quickly, improving the tradeoff between execution time and sequence length relative to traditional compilers and enabling practical online (real-time) compilation. The method flexibly adapts to different bases—quasi-continuous small rotations and sparse, discrete HRC gates—by adjusting reward shaping and RL algorithms (dense reward with PPO vs sparse reward with DQL+HER). Empirically, the HRC base yields substantially shorter circuits, consistent with its efficient coverage of SU(2). Compared with Y–Z–Y or KAK decompositions and A*-based planners, the DRL approach is basis-agnostic, requires no problem-specific heuristics once trained, and achieves polylogarithmic length scaling with rapid inference. These findings support DRL as a viable route to hardware-agnostic, low-latency quantum compiling.
Conclusion
This work introduces DRL-based quantum compilers that, after a single precompilation phase, generate high-fidelity single-qubit gate sequences from different bases. Using dense rewards with PPO for small rotations and sparse rewards with DQL+HER for the HRC base, the agents solve about 95–96% of Haar-random targets at 0.99 AGF, with mean sequence lengths ~124 (rotations) and ~35 (HRC). The approach achieves fast inference and empirical polylogarithmic scaling of sequence length with accuracy, suggesting feasibility for real-time compilation. The framework is flexible and can incorporate hardware constraints directly in the environment. Preliminary results indicate extension to two-qubit gates at similar fidelity. Future work includes scaling to multi-qubit compilers, integrating hardware-specific limitations, pushing to higher target fidelities, optimizing network architectures for even faster inference, and exploring alternative accuracy metrics (e.g., diamond norm).
Limitations
- No theoretical guarantee of finding a solution within the step limit; success rates are high but not 100%. - Training (precompilation) can be time-consuming and sensitive to reward design and hyperparameters, especially under sparse rewards. - Results are shown primarily for single-qubit gates, with only preliminary two-qubit evidence; scalability to larger systems needs thorough validation. - Performance depends on the chosen target distribution (Haar) and the fixed tolerance; generalization to domain-specific target sets may require retraining or fine-tuning. - The method optimizes average gate fidelity; other metrics like diamond norm are not directly addressed and may change outcomes. - Empirical scaling laws are task-specific and not proven optimal; reported polylogarithmic fits may not universally hold across bases or higher fidelities.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny