Computer Science
Quantum variational algorithms are swamped with traps
E. R. Anschuetz and B. T. Kiani
Variational quantum algorithms (VQAs), quantum analogs of neural networks, have been proposed with the hope that they might inherit the favorable trainability properties of classical neural networks and even outperform certain classical approaches. However, unlike the classical setting, VQAs frequently suffer from training pathologies. Deep VQAs can exhibit barren plateaus where gradients vanish exponentially with system size, impeding gradient-based optimization. Less was known for shallow, local models with local cost functions, where promising numerical results often relied on good initialization or symmetry. This work addresses whether shallow, local VQAs are generically trainable. The authors show that even when barren plateaus are absent, typical shallow, local VQAs are not trainable without strong prior information: their loss landscapes are dominated by poor local minima far from the global optimum, and in noisy settings, SQ-based optimization demands exponentially many queries. The purpose is to rigorously delineate when and why VQAs fail to train, and to identify avenues where training might still be possible.
Prior untrainability results focused on: (1) barren plateaus in deep circuits where gradients vanish exponentially with depth and system size; (2) nonlocal or globally scrambling shallow models exhibiting poor local minima; and (3) effects of noise driving convergence to uniform distributions at rates exponential in depth. Numerical studies observed poor local minima for random variational landscapes and improved behavior in overparameterized regimes, though often at impractical parameter counts. Table 1 in the paper summarizes these results across dimensions, locality, depth, worst-case analyses, and whether barren plateaus or poor minima were shown. The present work extends the landscape by focusing on shallow, local, Hamiltonian-agnostic ansatzes and providing both SQ-based hardness under noise and landscape-based untrainability without relying on barren plateaus.
The paper develops two complementary analyses and numerical validations: (1) Statistical Query (SQ) framework: The authors formalize quantum correlational SQ (qCSQ) for learning observables and quantum unitary SQ (qUSQ) for unitary compilation, where each query returns an expectation value within a tolerance. They show many practical noisy optimization steps (e.g., first/second-order methods using parameter-shift gradients estimated with finite sampling) reduce to SQ oracles. Using lower bounds via SQ-DIM (statistical query dimension) over distributions that form 2-designs (or even computational-basis uniform distributions), they prove exponential query lower bounds to learn simple function classes generated by shallow variational circuits. Table 2 summarizes representative settings: for example, with L=1 global measurement and single-qubit gates, query complexity is 2^{Ω(n)} (for t≥3); for L=log2(n) with single-qubit measurement and global 1- and 2-local gates, it is 2^{Ω(n)}; for L=n with nearest-neighbor 1- and 2-local gates on a d-dimensional lattice, it is 2^{Ω(n^d)}; and for unitary learning with L=1 single-qubit gates, it is 2^{Ω(n)}. These bounds hold even when query noise is exponentially small in n. (2) Loss landscape analysis without noise: The authors study shallow, local VQA loss functions of VQE type with Hamiltonian-agnostic ansatzes. Assuming approximate local scrambling (local ε-approximate t-design behavior in the reverse light cone of each measured observable), they bound the discrepancy in the joint characteristic function of the loss, gradient, and Hessian relative to that of Wishart hypertoroidal random fields (WHRFs), establishing convergence in distribution (Theorem 1). They interpret the WHRF degrees of freedom m as set by local Hilbert space dimensions within reverse light cones and relate underparameterization to poor landscapes. Leveraging known WHRF landscape theory, they prove that the fraction of local minima within any constant additive error of the global minimum is superpolynomially small in n under mild conditions (Corollary 2). (3) Numerical experiments: Best-case classical simulations with exact gradients validate theory. Teacher-student QCNN: both teacher and student are QCNNs with parameterized 2-qubit gates and pooling; the student is trained (Adam) on 512 random computational-basis inputs to match the teacher’s last-qubit measurement. Qubits considered: 4, 8, 12, 16 (parameters: 32, 48, 64, 64 respectively). VQE: ground states of local Hamiltonians constructed by conjugating Σ Z_i with L=4 layers of alternating 2-local product unitaries; optimization via vanilla gradient descent for up to 30,000 steps, measuring both final energy and trace distance to the ground state; depth L of the ansatz is varied in some experiments to study overparameterization effects.
- SQ hardness under noise: For broad classes of shallow variational circuits over 2-design input distributions, any algorithm whose steps reduce to SQ queries requires exponentially many queries to learn. Examples (from Table 2): (i) L=1 with global measurement and single-qubit gates requires 2^{Ω(n)} queries (t≥3), (ii) L=log2(n) with single-qubit measurement and global 1- and 2-local gates requires 2^{Ω(n)} (t≥4^t), (iii) L=n with nearest-neighbor 1- and 2-local gates on a d-dimensional lattice requires 2^{Ω(n^d)} (t=Ω(1)), and (iv) unitary learning with L=1 single-qubit gates requires 2^{Ω(n)} (t≥4^t). These bounds hold even with exponentially small query noise, e.g., from shot noise in gradient estimation. - Landscape untrainability without barren plateaus: For shallow, local, Hamiltonian-agnostic ansatzes that scramble approximately locally, the VQE-type loss converges to a WHRF. In the underparameterized regime (local parameter count l much smaller than twice the WHRF degrees of freedom 2m), only a fraction exp(-m) of critical points lie within any constant additive energy of the global minimum. Corollary 2 shows that, under mild scaling conditions (log(n) + q log(q) = O(2^l A), with q local parameters in reverse light cones, light-cone size l, and A the number of Pauli terms), a superpolynomially small fraction (in n) of local minima are close to the global minimum; thus typical shallow local VQAs are untrainable even without barren plateaus. - Numerical validation: Teacher-student QCNN experiments show training success sharply degrades as qubits increase; after 8 qubits, training typically fails to match the teacher, converging to poor local minima despite guaranteed existence of a zero-loss solution. VQE experiments with L=4 Hamiltonians: gradient descent (30,000 steps) on ansatzes capable of representing the ground state (depth L=4) increasingly converges to poor local minima with growing n; both energy and trace distance cluster far from zero as n increases. For a 14-qubit instance, increasing ansatz depth reveals that only when the number of parameters exceeds the explored Hilbert space dimension (e.g., about 300 layers) does the optimizer reliably reach the ground state; otherwise it remains trapped far from optimal. - Broader insight: Absence of barren plateaus does not imply trainability. Underparameterization relative to the (local) Hilbert space dimension governs a phase transition: overparameterized models concentrate minima near the global optimum; underparameterized shallow local models have landscapes swamped with traps.
The findings address the central question of VQA trainability beyond barren plateaus. In noisy settings where optimization steps reduce to statistical queries, even simple shallow variational classes are exponentially hard to learn, formalizing a strong barrier to training with standard noisy gradient-based or second-order methods. In the noiseless setting, typical shallow, local, Hamiltonian-agnostic ansatzes have loss landscapes that converge to WHRFs in the underparameterized regime, yielding an overwhelming prevalence of poor local minima far from the global optimum. This explains empirical failures even when gradients do not vanish (i.e., no barren plateaus). The results imply that demonstrating the absence of barren plateaus is insufficient to claim trainability. Nevertheless, the work highlights potential routes to practical training: use problem-structured or symmetry-constrained ansatzes (reducing effective Hilbert space), leverage parameter concentration and good initializations (e.g., QAOA), design algorithms whose primitives do not reduce to SQ queries (e.g., non-global metrics or tailored estimators), and exploit settings where Hamiltonian learning is known to be efficient though distinct from ground-state learning. The numerical experiments corroborate theory, emphasizing the role of local underparameterization and the necessity of sufficient overparameterization or structure to achieve reliable training.
This work establishes rigorous barriers to training common classes of variational quantum algorithms. In the presence of noise, optimization procedures that reduce to statistical queries require exponentially many queries to learn simple variational function classes. In the noiseless regime, typical shallow, local, Hamiltonian-agnostic ansatzes have loss landscapes dominated by poor local minima: only a superpolynomially small fraction of minima lie close to the global optimum, even without barren plateaus. Numerical experiments with QCNNs and VQE corroborate these claims. Together, the results refocus trainability analyses beyond barren plateaus and toward underparameterization and noise. Future directions include: developing problem-specific, non-SQ optimization strategies; constructing ansatzes that exploit symmetries or hierarchical structure; devising effective initialization schemes (e.g., leveraging parameter concentration); and analyzing other VQA paradigms (such as quantum Boltzmann machines) where efficient, end-to-end training may be possible in certain regimes.
The theoretical landscape results rely on assumptions of approximate local scrambling (local ε-approximate t-design behavior) and typicality over randomized, Hamiltonian-agnostic ansatzes; specific structured ansatzes or data distributions may avoid these traps. SQ lower bounds apply to algorithms whose steps reduce to statistical queries over distributions forming 2-designs (or similar), and thus do not preclude specialized non-SQ algorithms. Numerical validations are finite-size simulations in idealized, noiseless settings with exact gradients; real hardware noise and finite sampling could further impede or, in rare cases, alter performance. Some positive regimes (e.g., symmetry-restricted models, tailored objectives, or specialized estimators) fall outside the negative results. Finally, while the WHRF convergence and corollaries capture typical behavior, they do not characterize every instance or all problem-specific ansatzes.
Related Publications
Explore these studies to deepen your understanding of the subject.

