Physics
Reinforcement learning in cold atom experiments
M. Reinschmidt, J. Fortágh, et al.
The study addresses how to robustly control and optimize magneto-optical trap (MOT) operation for ultracold atom experiments, where increasing system complexity and environmental drifts challenge fixed, hand-crafted control sequences. Traditional optimization methods (e.g., Bayesian or evolutionary optimization) yield static parameter vectors that must be re-optimized when conditions change and cannot react within a cycle. Conventional feedback control requires detailed system models and manual tuning. The authors formulate MOT control as a sequential decision-making problem and introduce deep reinforcement learning (RL) to learn a policy that maps observations (fluorescence images and context) to actions (laser detuning and magnetic field gradient) maximizing task-defined rewards. The goal is to achieve adaptive, image-based, model-free control that is robust to perturbations, can generalize to unseen situations, and can enable new operational modes (e.g., atom number on demand) while also exploring sim-to-real training to reduce experimental training time.
Prior work has applied machine learning and algorithmic optimization to ultracold atom experiments by parameterizing control sequences and optimizing a cost over a high-dimensional parameter space using Gaussian processes or neural networks, with Bayesian optimization, evolutionary methods, or citizen science approaches improving performance over manual tuning. However, such methods produce static sequences that do not adapt to intra-cycle perturbations or long-term drifts. Conventional active feedback (e.g., FPGA loops) requires accurate models and task-specific tuning. Reinforcement learning has demonstrated human-level performance in simulated domains and has begun to control real physical systems, offering a model-free approach where the task is specified by a reward function. Deep RL, using neural networks, can process high-dimensional observations (images) and output continuous control signals, making it suitable for image-based, real-time control in cold atom systems.
Experimental system and observables: A standard 87Rb MOT is used, loaded from a 2D-MOT source into a UHV science chamber (~10^-8 mbar). Fluorescence imaging (50×50 pixels, 8-bit, 150 fps UEye UI-3060CP) is continuously recorded and serves as observation during MOT loading; time-of-flight absorption imaging at the end of each cycle yields number of atoms N and temperature T. Typical operation without saturation yields N ≈ 1.5×10^6 atoms at ~100 μK after 1.5 s loading.
Control parameters and ranges: The RL agent controls continuous actions: (i) cooling laser detuning Δ = ω0 − ω within 0–8.25 Γ (Γ = 2π×6.063 MHz), and optionally (ii) magnetic field gradient B′ within 7.5–22.5 G cm^−1. In single-parameter experiments B′ is held at 15 G cm^−1.
Episode and timing: Each MOT cycle (episode) lasts 1.5 s, divided into 25 time steps of 60 ms. At each step, the agent receives a vector of observables: the four most recent fluorescence images (stacked) plus auxiliary scalars (e.g., time step index, current control values). The agent outputs updated control parameters that are applied for the next step.
Hardware and integration: A desktop PC runs the RL agent and interfaces via a DAQ (AdLink DAQ-2502) to set laser detuning (via offset lock) and coil current for B′. A Jäger Adwin Pro II system manages precise time-of-flight imaging. The PC initiates episodes (enables B′) and triggers absorption imaging at episode end; the Adwin returns to idle ready for the next cycle. Fluorescence frames are buffered in RAM and the latest frame is used each step.
RL algorithm and training: The agent uses Deep Deterministic Policy Gradient (DDPG) with experience replay, implemented in the Coach framework. Observations are normalized to [0,1]. A 4-frame image stack allows estimation of loading dynamics (rates). Exploration noise is added to actions during training. Rewards are attributed only at the episode end based on absorption imaging results. To prevent shortcut learning (e.g., memorizing static sequences), a random perturbation offset is added at the start of each episode to the applied control value(s); the offset is unknown to the agent, forcing reliance on image observations. Training runs up to 10^5 episodes, with optimization steps between episodes. Scheduling includes a 400-episode heat-up phase with Brownian exploration, and periodic blocks of reference and evaluation episodes (every 200 training episodes) without exploration; reference episodes apply a hand-crafted intuitive policy to normalize reward for long-term drift.
Reward functions: For standard MOT operation, reward R ∝ N/T to favor many cold atoms. For atom-number-on-demand, the reward is engineered as R ∝ (1/T)·exp(−(N−Nt)^2/(2σN^2)), with target atom number Nt provided as an additional observable per episode, Nt ∈ [0, 120×10^6], and σN = 15×10^6. During training, Nt values are uniformly randomized over the range. In two-parameter control experiments, the detuning perturbation offset alone was sufficient to prevent shortcut learning; no offset was added to B′.
Evaluation protocol: Trained agents are evaluated by applying predetermined, unknown-to-agent perturbation offsets to detuning at the start of episodes, repeating each condition up to 200 times to assess performance (N, T, reward) and policy reproducibility. Additional tests apply a sudden offset step mid-episode (e.g., +0.82 Γ at nt=15) to probe adaptation to unseen changes.
Simulation for sim-to-real: A phenomenological, data-driven MOT simulator is built. Internal state evolves as N(n) = Σ_i dN(Δ_i) with no explicit loss saturation; T(n) = T(Δ_n). Look-up tables dN(Δ) and T(Δ) are measured experimentally and interpolated (linear for dN; exponential for T). Multiplicative noise is added to dN with N(μ=1, σ=0.1). A CNN-based image generator (decoder of an autoencoder) takes (N, Δ) and outputs synthetic fluorescence images resembling experimental frames; training uses ~5000 samples, removing outliers and zeroing images for N=0. An extended generator conditioned on a domain parameter enables gradual changes to fluorescence appearance for continual learning studies. Training on simulation uses the same observation, action, reward, and perturbation structure as on the experiment.
MOT operating reference and intuitive policy: The optimal loading detuning was measured as Δopt = 1.9 Γ; temperature scales approximately as Δ^(−3/8). The intuitive hand-crafted policy for references is to load at Δopt, then ramp detuning to the maximum at the final time step to minimize T.
- The RL agent trained with R ∝ N/T, using stacked fluorescence images as observations, learned an intuitive and effective MOT policy: search during early steps, then load at Δopt ≈ 1.9 Γ, followed by a rapid increase of detuning near the episode end to reduce temperature. This mirrors expert-designed molasses cooling-like sequences.
- Robust adaptation to unknown offsets: With random, unknown detuning offsets applied per episode, the agent inferred and compensated offsets from fluorescence images, achieving high rewards across a range of offsets. Without perturbations during training, an agent memorized a static sequence and failed to adapt at test time.
- Generalization to unseen events: When a sudden detuning offset step of +0.82 Γ was applied at nt=15 during evaluation, the agent adapted within a single 60 ms step based on one fluorescence image, evidencing pattern recognition of detuning-dependent fluorescence and strong generalization. Rewards were lower in these tests due to unavoidable transient atom loss immediately after the step.
- Two-parameter control (Δ and B′): With action space expanded, the agent still learned effective detuning control and developed a B′ strategy: maximize B′ early (for sensitivity at low N), lower B′ during high loading, and increase B′ again near the end, reminiscent of compressed MOT behavior, despite 60 ms time steps being too coarse for conventional fast sub-Doppler techniques.
- Atom number on demand via reward engineering: With R ∝ (1/T)·exp(−(N−Nt)^2/(2σN^2)) and Nt as an input, the agent achieved targeted atom numbers Nt from 5×10^6 to 110×10^6, with achieved N typically within ±σN (σN = 15×10^6) around Nt, and maintained robustness to detuning offsets during evaluation. Control sequences showed adjustments of both loading rate (distance to Δopt) and the duration/onset of the final cooling phase as Nt varied.
- Sim-to-real transfer: An agent trained purely in simulation reproduced similar control behavior when deployed directly on the experiment, including rapid adaptation to sudden offset steps. Absolute rewards were lower than those of experiment-trained agents, with incomplete offset compensation indicative of a sim-to-real gap (synthetic image smoothing and model simplifications). Nonetheless, qualitative transfer succeeded, suggesting potential to reduce experimental training time (which otherwise spans ~12–48 h depending on task complexity and control dimensionality).
- Continued learning and long-term stability (in simulation): After initial training (~30k episodes), agents continued learning during deployment adapted to large cumulative detuning drifts (±5Γ) provided drift rates were modest (continuous adaptation observed below ~0.33 Γ per 10^3 episodes); even for faster drifts, optimal policies were recovered once drift ceased. Agents also adapted to gradual changes in fluorescence appearance (image generator evolution), maintaining near-optimal rewards for evolutions spanning >30k episodes, and recovering optimal performance after faster evolutions halted.
Formulating MOT control as a sequential decision problem and using deep RL with image observations enables robust, adaptive, and task-driven control that addresses the limitations of static, hand-tuned sequences and model-dependent feedback. The agent leveraged fluorescence images to estimate the MOT state (including effective detuning under offsets) and acted in real time to maximize rewards, demonstrating resilience to intra-episode perturbations, shot-to-shot fluctuations, and long-term drifts. Reward engineering allowed new operational modes, notably atom-number targeting without added hardware, exemplifying a shift from manual sequence design to goal-focused control. Expanding the action space to include B′ showed the agent can discover meaningful strategies even without an obvious intuitive optimum at the available time resolution. Sim-to-real experiments indicated that policies learned in a simplified, data-driven simulation can transfer to the lab and generalize to dynamics not encountered in simulation, though with a performance gap attributable to model and image generator limitations. Overall, the findings validate RL as a practical and powerful control paradigm for cold-atom experiments, with potential benefits for operation outside controlled lab environments and for more complex species requiring many control channels.
The work introduces deep RL for real-time MOT control, demonstrating that an image-based, model-free agent can learn effective cooling strategies, adapt to unseen perturbations, and maintain performance over varying conditions. Through reward engineering, the agent achieved atom-number-on-demand using only detuning control. A simulation-based training pipeline enabled initial sim-to-real transfer, pointing toward reduced experimental training requirements. The main contributions are: (1) framing MOT operation as RL with fluorescence images as observations; (2) demonstrating robust adaptation and generalization; (3) enabling new operation modes via reward shaping; and (4) establishing a path for sim-to-real training. Future research directions include: expanding control to additional parameters (laser intensities, polarizations), finer time resolution to access faster cooling techniques, improved simulations incorporating light-induced interactions and realistic imaging to close the sim-to-real gap, domain randomization for transfer, adoption and benchmarking of alternative RL algorithms (e.g., SAC, PPO), integration of additional environmental observables (laser intensity, temperature monitors), and extending RL control toward producing degenerate quantum gases with suitable real-time observables.
- Training time on the real experiment is substantial (≈12–48 h), increasing with control dimensionality and reward complexity.
- Sim-to-real gap: The simulation neglects losses, multiple scattering, and light-induced interactions; synthetic images are smoother and lack high-frequency features, leading to incomplete offset compensation and lower rewards upon transfer.
- Temporal resolution: 60 ms time steps are too coarse to realize known fast techniques (e.g., compressed MOT, polarization gradient cooling) optimally; finer control might improve performance but was not explored here.
- Reward attribution only at episode end can bias learning toward end-of-episode actions; intermediate rewards were not used.
- Atom-number targeting used an absolute σN tolerance which can trade off cooling against atom number for large Nt; policy not yet globally optimal.
- Single-species, specific apparatus results; generalization to other species or highly complex MOTs likely requires retraining and possibly additional observables.
- An agent trained without perturbations overfits to static sequences (shortcut learning), underscoring sensitivity to training protocol design.
Related Publications
Explore these studies to deepen your understanding of the subject.

