Physics

Variational Monte Carlo with large patched transformers

K. Sprague and S. Czischek

Discover groundbreaking advancements in simulating qubit systems with transformer models as wavefunction ansatz! This research, conducted by Kyle Sprague and Stefanie Czischek, showcases large patched transformers that outperform traditional methods, opening doors to studying extensive Rydberg atom arrays and their fascinating phase transitions.

00:00

~3 min • Beginner • English

Index

Introduction

The study investigates whether transformer architectures can serve as superior wavefunction ansätze for quantum many-body systems, particularly two-dimensional Rydberg atom arrays, compared to recurrent neural networks (RNNs). Prior neural-network quantum states (e.g., RBMs, RNNs, PixelCNNs) have shown success but face challenges capturing long-range correlations in 2D systems due to sequential or local receptive-field limitations. Transformers, with non-recurrent self-attention enabling all-to-all contextual dependencies, are hypothesized to better capture such correlations. The work aims to assess accuracy and computational cost for variational ground-state searches and to introduce architectural modifications (patched inputs and large, patched transformers) to overcome transformer runtime scaling, enabling accurate and efficient simulations across phases and at phase transitions.

Literature Review

The paper reviews artificial neural network quantum states as general wavefunction ansätze for ground-state search, tomography, and dynamics. It discusses architectures including restricted Boltzmann machines, recurrent neural networks, and PixelCNNs, along with enhancements via symmetry inclusion, architectural modifications, and data-based pretraining. Autoregressive models (RNNs, PixelCNN) achieve high accuracies but struggle with long-range correlations in 2D due to sequential memory bottlenecks (RNN) and limited convolutional kernel sizes (PixelCNN). Two-dimensional RNN variants can alleviate some issues. Transformers have been proposed for quantum state representation and dynamics due to their capacity to capture long-range dependencies via self-attention. Vision Transformer ideas motivate using patches to reduce sequence lengths and encode local correlations within patches.

Methodology

System and Hamiltonian: Two-dimensional Rydberg atom arrays arranged on square L×L lattices with spacing a=1 and open boundaries. The stoquastic Rydberg Hamiltonian includes detuning δ, Rabi frequency Ω, and van-der-Waals interactions V∝R^6/|r_i−r_j|^6 with blockade radius R. Ground states are positive, real, enabling real-valued network ansätze. Wavefunction ansätze: Autoregressive RNN and Transformer (TF) models represent Ψ(σ) via probability distributions over measurement outcomes. RNN uses a GRU cell (hidden size d_h=128) processing qubits sequentially; output probabilities P_RNN(σ_i|σ_<i; W) are sampled autoregressively. TF uses masked self-attention with positional encodings to maintain autoregressive behavior; embedding dimension d_h=128, multi-head attention with h=8 heads, feed-forward dimension d_FF=2048, and T=2 transformer cell repetitions. Both can be extended to complex-valued wavefunctions by adding a phase output. Patched inputs: To reduce sequence length and encode local correlations, inputs are patches of size p (e.g., 2×2, 4×4, 8×8). The output is a distribution over 2^p patch states (one-hot). Patched RNN and patched TF reduce sampling and energy-evaluation complexity from O(N) and O(N^2 or N^3) to patch-scaled costs, but patched TF still suffers from exponential 2^p output. Large, patched transformer (LPTF): A patched TF processes large input patches of size p; its transformer-cell output state (dimension d_h) initializes a patched RNN that autoregressively generates sub-patches of size p_1=2×2, reducing the output dimension to 2^(p_1)=16. This hybrid retains TF’s ability to capture long-range correlations while avoiding exponential output scaling. Parameter counts: TF (~1.20M params for single-qubit I/O), patched RNN (p=2×2: ~70k params), LPTF with d_h=128, d_FF=2048, T=2 yields ~1.26M + 128·p parameters (e.g., p=8×8 → ~1.264M). Training and optimization: Variational Monte Carlo minimizes the energy expectation E=⟨H⟩ using samples N_s=512 per iteration (mini-batches K=256 unless memory limits require smaller K). Adam optimizer with β1=0.9, β2=0.999, learning rate 5e-4. Local energies computed with off-diagonal contributions via network re-evaluations for single-atom flips. Memory-saving implementations: generate samples without gradients; then re-pass for gradients; split atom sequence into D parts (D=N/p) to reduce memory and accelerate local energy evaluation. Benchmarks: Quantum Monte Carlo (QMC) stochastic series expansion at zero temperature serves as baseline, typically using N=7×10^7 samples across seven independent chains. Systems: N=8×8, 12×12, 16×16, 24×24, 32×32, 40×40. Parameters for accuracy/runtime comparisons commonly set to R^−1≈1.383 (or R_∞=3^−1/2 for phase-scan), δ and Ω as specified (often δ=Ω=1). Hardware: NVIDIA Tesla P100 GPUs for ANN; single CPU for QMC. Runtimes reported for 2×10^6 training iterations or per-iteration averages.

Key Findings

- Accuracy: For N=8×8 and 16×16 at δ=Ω=1, R^−1≈1.383, TF converges to lower energy differences relative to QMC than RNN and within fewer iterations; advantage increases with system size. Energy variances σ^2(E) decrease toward zero upon convergence. - Runtime trade-offs: Despite faster convergence in iterations, TF is much slower per iteration (sampling O(N^2), energy O(N^3)). Reported total runtimes for 2×10^6 iterations: RNN ~1.5 h (8×8), ~16 h (16×16); TF ~9.5 h (8×8), ~144 h (16×16). - Patched models (p=2×2): Substantially reduced runtimes and improved/competitive accuracies. Patched RNN: ~0.5 h (8×8), ~2 h (16×16). Patched TF: ~3 h (8×8), ~28 h (16×16). Energies reach within QMC uncertainties; some estimates fall below QMC central values but within QMC error bars, consistent with the variational principle. - Large, patched transformers (LPTF): With p=4×4, p_1=2×2, runtimes ~1 h (8×8) and ~9 h (16×16); with p=8×8, p_1=2×2, ~4.5 h (16×16). Accuracies comparable to patched RNN/TF while requiring fewer total hours to converge despite more iterations. - Scaling with patch size: For N from 12×12 to 40×40, LPTF accuracy degrades as p increases, but remains within QMC uncertainty up to p≈8×8 for tested sizes. Per-iteration runtime drops with p and then plateaus due to memory limits forcing smaller mini-batches (K from 256 to 32). - Phase transition: LPTF reproduces staggered magnetization across the detuning sweep δ∈[0,3] at R_∞=3^−1/2, Ω=1, capturing the disordered-to-checkerboard transition near δ≈1.2. Differences vs QMC are ~1e−2 at the critical point; LPTF uncertainties remain small while QMC errors grow due to autocorrelation near criticality. - QMC comparison: QMC scales linearly in N for sampling and energy but experiences long autocorrelation times and potential ergodicity issues at large N; ANN methods provide exact, independent autoregressive samples with lower uncertainties for modest N_s (512) per iteration.

Discussion

The study confirms that transformers, via masked self-attention, better capture long-range correlations in 2D qubit systems than RNNs, improving ground-state variational accuracy. However, raw TF computational costs impede scalability. Patching reduces sequence length, captures local correlations within a patch, and improves both runtime and accuracy. The proposed LPTF architecture integrates a patched TF with a patched RNN sub-decoder to handle large patches efficiently, avoiding exponential output scaling while maintaining the transformer’s expressive power. Benchmarks on 2D Rydberg arrays show that LPTFs achieve energies at or below QMC uncertainties with substantially reduced runtimes compared to standard TFs and competitive or shorter than QMC for the considered sizes. Moreover, LPTFs accurately reproduce an order parameter across a phase transition, including the challenging critical region. These results address the central hypothesis: transformers, especially in the large, patched configuration, are effective and efficient wavefunction ansätze for large, correlated quantum systems, mitigating limitations inherent to RNNs and some QMC practicalities (autocorrelation, ergodicity) at intermediate sizes.

Conclusion

The paper introduces large, patched transformers (LPTFs) for variational Monte Carlo ground-state searches in quantum many-body systems. LPTFs combine a patched transformer (capturing long-range correlations over large patches) with a patched RNN (efficiently decoding sub-patches), yielding high-accuracy ground-state representations at reasonable computational cost. Empirically, TFs outperform RNNs in accuracy; patched models reduce runtimes; and LPTFs deliver accuracies within or better than QMC uncertainties for 2D Rydberg arrays while achieving faster convergence in wall-clock time for the studied sizes. LPTFs also capture phase transitions accurately. Future work includes: extending to complex-valued wavefunctions, scaling up model capacity (more transformer layers, larger embeddings), multi-stage LPTF hierarchies for even larger patches, data-based initialization, and comprehensive scaling studies to very large systems and broader models.

Limitations

- Computational cost: Plain TFs scale as O(N^2) for sampling and O(N^3) for energy; even patched TFs and LPTFs have higher asymptotic complexity than RNNs. Memory constraints necessitate smaller mini-batches for large patches, limiting speedups. - Patch-size trade-off: Accuracy decreases with larger input patches when sub-patch size p_1 is fixed (2×2), reflecting limited expressivity of the RNN sub-decoder and higher information per iteration. - Benchmark scope: QMC is expected to outperform ANN methods at very large system sizes despite practical autocorrelation issues. Reported advantages are demonstrated for specific lattice sizes and parameter regimes. - Real-valued focus: Experiments target stoquastic Hamiltonians with positive, real ground states; extending to complex wavefunctions can complicate training and may require architectural/tuning advances. - Variational sampling: Results depend on finite N_s per iteration; while variances decrease toward zero, statistical fluctuations and optimization dynamics can influence reported energies.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search

N. Dainese, M. Alakuijala, et al.

Physics

Measuring Rényi entanglement entropy with high efficiency and precision in quantum Monte Carlo simulations

J. Zhao, B. Chen, et al.

Chemistry

Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity

T. Ochiai, T. Inukai, et al.

Engineering and Technology

Meta Shack-Hartmann wavefront sensor with large sampling density and large angular field of view: phase imaging of complex objects

G. Go, D. Lee, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny