
Physics
Realizing a deep reinforcement learning agent for real-time quantum feedback
K. Reuer, J. Landgraf, et al.
Unlocking the future of quantum technologies is now within reach! This groundbreaking research conducted by Kevin Reuer, Jonas Landgraf, Thomas Fösel, and their colleagues introduces a real-time reinforcement learning agent implemented on FPGA, paving the way for more efficient quantum control. Experience the revolution in quantum device management with this innovative approach.
~3 min • Beginner • English
Introduction
Executing algorithms on future quantum information processing devices will rely on real-time feedback control, where the system is continuously monitored via quantum measurements and acted upon on timescales much shorter than the coherence time. Such control, essential for tasks like qubit initialization, gate teleportation, and quantum error correction, typically depends on accurate dynamical models, which become difficult to obtain as processors scale. Model-free reinforcement learning (RL) offers a way to learn feedback strategies without prior knowledge of the system. Despite theoretical promise and initial experimental demonstrations of RL in quantum settings, realizing a deep RL agent capable of operating with real-time quantum feedback has remained an open challenge. This work addresses that gap by implementing a sub-microsecond-latency neural-network agent on an FPGA that performs real-time feedback to initialize a superconducting qubit, trained directly on experimental data without relying on an explicit model.
Literature Review
RL has achieved success across domains such as board games and robotics and has recently been applied to complex physical systems, with training in simulation or directly in experiments (e.g., lasers, particle accelerators, soft matter, and quantum systems). In quantum technologies, theoretical and experimental works have explored RL for state preparation, gate design, error correction, and circuit optimization/compilation, positioning it as a key tool in the ML toolbox for quantum tech. Prior experimental applications often optimized pulses via simulations for atoms and spins, with pioneering works training directly on experiments for gate optimization and quantum dot tuning. However, these did not feature real-time quantum feedback, a crucial requirement for fault tolerance. A recent step used RL for quantum error correction but optimized parameters of a fixed-structure controller. The present work differs by realizing a deep RL agent that performs real-time feedback with low-latency neural inference directly in an experiment.
Methodology
The environment is a superconducting transmon qubit (|g⟩, |e⟩, |f⟩) dispersively coupled to a readout resonator. The qubit is probed with a microwave tone; the scattered signal is amplified, digitized, and represented as a time trace s = (I, Q). The agent observes s and selects among discrete actions in real time: idle, apply a bit-flip (π-pulse swapping |g⟩ and |e⟩), or terminate. For qutrit experiments, a fourth action to swap |f⟩ and |g⟩ is added. Episodes consist of repeated measurement-action cycles until termination, followed by a verification measurement s^ver. The agent is implemented as a low-latency neural network on an FPGA. The network architecture processes incoming I,Q samples concurrently with evaluation: elements of the measurement trace are fed sequentially into each layer during computation to minimize latency. To incorporate memory, downsampled observations (s^1, …, s^l) and actions (a^1, …, a^l) from up to l = 2 previous cycles are provided as inputs. Timing and latency: per cycle T_cycle ≈ 856 ns; readout T_RO = 256 ns; electronics total latency T_EL,tot = 451 ns dominated by ADC/DAC; neural network inference contributes T_NN = 48 ns; pulse duration T_A = 60 ns. Training is model-free policy-gradient RL. Batches of episodes are transferred to a PC-based trainer which computes rewards and updates network parameters θ, then returns updated parameters to the FPGA each training step. The cumulative reward is R = V_ver/ΔV − λ n, where V_ver = ∑_i w_i s_i is the weighted integrated verification signal with weights maximizing SNR under Gaussian noise and ΔV normalizes the scale; n is the number of cycles; λ penalizes longer episodes to trade off speed versus fidelity. Strong readout regime: training and benchmarking against thresholding strategies based on integrated signals. Weak measurement regime: reduced readout power yields overlapping bimodal Gaussians (25% overlap), emphasizing benefits of memory. Qutrit regime: training from a mixed state (one-third population in each of |g⟩, |e⟩, |f⟩), with visualization using two orthonormal weighted projections V and W for state discrimination. The agent applies pre-calibrated pulses conditioned on measured trajectories and learned policy.
Key Findings
- Real-time deep RL agent realized on FPGA with sub-microsecond operation and neural inference latency of 48 ns; total electronics latency 451 ns; cycle time ≈ 856 ns. - Training directly on experimental data converges in less than 3 minutes of wall-clock time after fewer than 30,000 episodes and 100 parameter updates. - Strong measurement regime: achieved initialization error 1 − P_g ≈ 0.2% (converging to ≈0.18% for ⟨n⟩ ≥ 1.1 cycles), representing about a tenfold reduction relative to the thermal equilibrium population; performance matches close-to-optimal thresholding strategies. - Residual infidelity is attributed mainly to rethermalization between termination and verification and potential state mixing during verification; estimated rethermalization contribution Neq/T1 ≈ 1 kHz with Neq = 1.4% yields ≈0.07% infidelity. - The trade-off between initialization error and average number of cycles is tunable via the penalty parameter (A/λ), with higher penalties favoring shorter episodes at the cost of higher error. - Weak measurement regime (25% overlap of Gaussians): memory enhances performance. Agents with memory (l = 2) achieve a given initialization error with fewer rounds than agents without memory (l = 0) or simple thresholding. - Qutrit reset: enabling a gf-flip action allows efficient reset from a completely mixed qutrit state to 1 − P ≈ 0.2% with ⟨n⟩ ≈ 2. Without access to the gf-flip, the agent needs significantly more cycles and exhibits a sudden increase in ⟨n⟩ from 2.2 to 3.4 when decreasing the control parameter, reflecting the cost-benefit trade-off of resetting the |f⟩ level versus episode length.
Discussion
The study demonstrates that a deep reinforcement learning agent can perform real-time feedback control on a quantum system, addressing the challenge of low-latency decision-making without a detailed system model. In the strong measurement regime, the learned policies recover near-threshold strategies that are close to optimal, validating the agent's ability to discover efficient controls directly from data. The agent's performance aligns with simple optimal strategies while offering flexibility to incorporate richer temporal information from full measurement traces. In weak measurements, memory provides clear benefits, showing the agent can integrate information across cycles to reduce the number of rounds required for a target fidelity. Extending to a qutrit with a dedicated gf-flip action demonstrates adaptability to more complex state spaces and control objectives, significantly improving reset efficiency compared to relying on natural decay from |f⟩. These findings are relevant for scalable quantum technologies that require rapid, adaptive control, such as error correction and device initialization, and illustrate the practicality of deploying learned controllers in hardware-constrained environments.
Conclusion
A low-latency neural network agent implemented on an FPGA was trained via model-free reinforcement learning to perform real-time feedback for initializing a superconducting qubit, converging rapidly and achieving high-fidelity resets. The agent adapts to different regimes, benefiting from temporal memory in weak measurements and extending to three-level qutrit control when provided with appropriate actions. The approach is immediately extensible to other single-qubit feedback tasks, such as initialization into arbitrary superpositions via final unitaries or through measurements in rotated bases, and to weak measurement sequences that steer states gradually. Future work includes exploring measurements that do not distinguish between qutrit subspace states to enable subspace resets, stabilizing driven dynamics like Rabi oscillations, and scaling to multi-qubit scenarios. Understanding neural network scaling with system size and overcoming FPGA hardware constraints will be key to applying these methods to larger systems, with potential impact on quantum error correction and many-body feedback cooling.
Limitations
In the strong measurement regime, improvements from leveraging full time traces are minimal because measurement durations are much shorter than relaxation times, making state transitions rare. The remaining initialization infidelity is limited by rethermalization between termination and verification and possible state mixing during verification readout. Training speed is constrained primarily by data transfer between the FPGA and PC-based trainer. In weak measurement studies, variations in decay and rethermalization rates over multi-day acquisitions may have affected comparative performance of agents without memory. The experiments focus on a single-qubit (and qutrit) system with pre-calibrated pulses; scaling to larger, multi-qubit systems will require addressing FPGA resource limits and understanding how network complexity grows with system size.
Related Publications
Explore these studies to deepen your understanding of the subject.