Computer Science
Fast and robust analog in-memory deep neural network training
M. J. Rasch, F. Carta, et al.
Analog in-memory computing (AIMC) uses resistive crossbar arrays to perform matrix–vector multiplications in-memory with high energy efficiency, and most prior work has targeted inference acceleration. Training is substantially more compute-intensive than inference and would benefit even more from AIMC acceleration, but practical challenges arise due to analog device nonidealities: asymmetric and noisy conductance updates, device-to-device variations, limited precision, gradual saturation, costly global resets, and limited ability to program identical reference conductances. Prior efforts either (a) offload gradient computation/accumulation to digital memory (mixed-precision approaches), which incurs O(N^2) digital cost and undermines the O(1) MVM speedups, or (b) rely on fully in-memory outer-product updates that require highly symmetric and precise bi-directional device switching, which is unrealistic for common device materials. The Tiki-Taka v2 (TTv2) algorithm addresses part of this by accumulating gradients on a separate analog array and using a differential read against a pre-programmed reference array at the symmetry point (SP), plus digital low-pass filtering. However, TTv2 depends critically on precisely programmed and retained reference conductances and device asymmetry, and small reference errors can severely degrade training accuracy. This work proposes two algorithms—Chopped-TTv2 (c-TTv2) and Analog Gradient Accumulation with Dynamic reference (AGAD)—that preserve the fast, parallel O(1) in-memory update while removing the need for precise static references, improving robustness and broadening viable device choices.
- AIMC inference acceleration has been demonstrated on various NVM technologies (e.g., ReRAM, PCM, ECRAM) and mixed-signal chips, leveraging Ohm’s and Kirchhoff’s laws for efficient MVMs.
- Mixed-precision (MP) training approaches compute/accumulate gradients digitally and program analog weights when thresholds are reached, but require O(N^2) digital operations and significant memory bandwidth, limiting overall training speedups.
- Fully in-memory outer-product updates using pulse-train coincidences enable O(1) parallel updates but historically required near-ideal symmetric devices, making them impractical with typical asymmetries/noise.
- Tiki-Taka v2 (TTv2) introduced a separate gradient-accumulation array (A) and a reference array (R) programmed at the symmetry point (SP) of A, using differential reads and digital low-pass filtering to mitigate device nonidealities; it reduces precision/symmetry demands but critically relies on accurate, stable per-device references and device asymmetry.
- Subsequent work highlighted difficulties in precisely programming/retaining R (often 5–10% conductance programming errors, drifts), motivating algorithms that relax or eliminate R requirements and expand material choices.
Core architecture and operations:
- Two analog arrays are used during training: A for in-memory gradient accumulation via parallel pulsed outer-product updates, and W for storing the actual weights used in forward/backward MVMs. TTv2/c-TTv2 also consider a reference array R for differential reads; AGAD eliminates R by using a digital reference.
- Each input sample: (1) perform parallel in-memory outer-product accumulation onto A via stochastic pulse coincidences; (2) intermittently read a single row (or column) of A every n_s updates into digital; (3) update a digital hidden accumulator H; (4) when H crosses a threshold, write single pulses to the corresponding row of W and reset that H element. This keeps update complexity O(1) per input for analog arrays.
Algorithms:
- TTv2 baseline: Requires a pre-programmed reference array R holding the per-device SP of A; reads use differential MVM (A−R) to remove device-to-device offsets; H accumulates leaky-filtered gradients; thresholded H triggers single-pulse writes to W.
- Chopped-TTv2 (c-TTv2): Introduces a chopper vector c∈{−1,1} to modulate activation signs before accumulation (c⊙x) and demodulate upon readback, canceling low-frequency offsets or residuals from incorrect R and slow transients in A. Chopper signs flip with probability p at read cycles. Hardware remains similar to TTv2; reference R still used, but precision requirements are relaxed.
- AGAD: Uses choppers as above but replaces static R with a digital on-the-fly reference P_ref computed from a leaky average P_k of recent A readouts: P_k=(1−β)P_{k−1}+β w_k, where w_k=A v_k is the read row. P_ref is updated when the chopper for that row flips. Reads subtract P_ref digitally (no differential analog read needed). This removes the need for programmable per-device R and associated circuitry and works with both asymmetric and symmetric devices.
Device model and nonidealities:
- Soft-bounds incremental update model with asymmetry, device-to-device variations in bounds and slopes, and cycle-to-cycle noise; number of effective states n_states ≈ (w_max−w_min)/δ governs precision/noise.
- Symmetry point (SP) exists due to soft-bounds behavior; TTv2/c-TTv2 use SP via R to induce decay to algorithmic zero. AGAD uses transient history rather than SP.
Simulations and benchmarks:
- Implemented in AIHWKit (PyTorch-based). Experiments include toy constant-gradient cases, linear layer programming to target weights (20×20), and DNN training: 3-layer fully connected on MNIST, LeNet on MNIST, and 2-layer LSTM on War & Peace; an additional larger Vision Transformer on CIFAR-10 to check scaling of trends.
- Parameters varied: reference offset variation σ_r, number of device states n_states (mapping to materials: low for ReRAM-like, high for ECRAM-like), device asymmetry. Metrics: weight programming error (SD from target), test error/loss, robustness to offsets, performance/runtime estimates.
Learning-rate and scheduling:
- Derive scaling for hidden-to-weight transfer learning rate to normalize per-layer magnitudes and device constraints; dynamic normalization using running maxima of |x| and |d|; practical simplifications validated empirically under high-noise/high-asymmetry settings.
Performance estimation:
- Complexity and time estimates per input sample for update pass under assumptions: N=512, FP8 digital throughput ~0.7 TFLOPS/core shared among 4 arrays, single MVM ~40 ns, single pulse ~5 ns, n_s=2 (and variants). Compare to mixed-precision digital gradient accumulation baseline.
- Robustness to reference offsets:
- TTv2 performs well only when reference offsets are effectively zero; even σ_r≈0.1 (≈5% of full range) significantly degrades accuracy (e.g., for 20 states, weight error increases to ~9%).
- c-TTv2 cancels large offsets via chopping and maintains accuracy up to offsets ≈25% of range (α≈0.5). Some oscillatory transients slow learning but do not cause failure.
- AGAD is effectively invariant to reference offsets because references are computed on-the-fly; weight errors remain nearly constant across σ_r for each n_states (e.g., with 20 states, ~4.5% across σ_r; with 100 states, ~1.2%).
- DNN training across topologies (3-FC on MNIST, LeNet on MNIST, 2-layer LSTM on War & Peace):
- With perfect SP correction on W, all algorithms approach FP accuracy; without perfect reference, TTv2 degrades markedly, c-TTv2 is tolerant up to moderate offsets, AGAD remains stable and typically best, especially for larger n_states.
- Correcting W’s SP can further improve test error for c-TTv2 and AGAD, sometimes surpassing FP baselines.
- A larger Vision Transformer on CIFAR-10 confirms the same trends: stability against offsets for c-TTv2 and AGAD, degradation for TTv2.
- Device asymmetry dependence:
- TTv2 and c-TTv2 require asymmetric device response (soft-bounds) for proper decay to SP; performance degrades with increasing symmetry.
- AGAD accuracy is insensitive to device symmetry, enabling use of symmetric devices (e.g., capacitors, ECRAM) as well as asymmetric ones (ReRAM).
- Endurance and retention requirements:
- Gradient-accumulation A devices experience the most pulses (≈0.5–4 pulses per input sample); W devices see ≈2×10^2–4×10^4 pulses per input sample, i.e., much fewer orders of magnitude across training, so A needs highest endurance.
- Retention of R is critical for TTv2 (tolerable drift ≲5% over training) and relaxed for c-TTv2 (up to ≈25%); AGAD eliminates R and requires only short retention for A (on the order of one transfer period, typically 10^2–10^3 samples), while W must retain over a full training run.
- Performance/runtime:
- Estimated update-time per input (N=512, n_s=2): TTv2 ≈56.3 ns, c-TTv2 ≈62.1 ns, AGAD ≈30.9 ns, versus mixed-precision (digital gradient accumulation) ≈3024.5 ns—about 50× speedup for proposed in-memory algorithms.
- With n_s=10 and l_max=1 (supported for certain device settings), AGAD can reach ≈17.1 ns/update, ≈175× faster than mixed-precision baseline.
- Hardware simplification:
- AGAD removes the need for a separate programmable reference array R and differential read circuitry, simplifying unit-cell design and reducing chip area.
- Overall, both c-TTv2 and AGAD achieve state-of-the-art accuracy for AIMC training while substantially improving robustness to reference errors; AGAD further broadens feasible device materials and simplifies hardware.
The study addresses the central challenge of robust, efficient AIMC-based training under realistic device nonidealities. By introducing chopping (c-TTv2) and dynamic digital reference estimation (AGAD), the algorithms preserve O(1) fully parallel outer-product updates and maintain high training accuracy without relying on precisely programmed per-device references. This resolves a key bottleneck in TTv2—sensitivity to reference programming/retention—and allows the training update path to match the constant-time nature of forward/backward MVMs. The findings demonstrate that robust training is achievable even with large reference offsets and across a range of device characteristics. AGAD’s invariance to device symmetry significantly widens viable material options (including symmetric, high-endurance devices), while eliminating differential read hardware simplifies the crossbar unit cell and reduces area and design complexity. Estimated runtime and bandwidth analyses show orders-of-magnitude improvement over digital mixed-precision gradient accumulation, suggesting that full in-memory training can deliver both speed and energy benefits. These results align in small and larger DNNs, indicating scalability of the trends. The work thus provides a practical path to fast, robust analog in-memory training with relaxed device and hardware constraints.
This work proposes two algorithms—Chopped-TTv2 (c-TTv2) and Analog Gradient Accumulation with Dynamic reference (AGAD)—for fast, robust analog in-memory training. Both retain the constant-time in-memory outer-product update while mitigating sensitivity to device offsets and asymmetry. c-TTv2 uses chopping to suppress offsets and low-frequency noise, relaxing reference precision requirements. AGAD replaces static, per-device analog references with on-the-fly digital references from recent transient dynamics, eliminating the need for a reference array and differential reads, broadening device choices to include symmetric devices, and simplifying hardware. Simulations on linear layers and diverse DNNs show state-of-the-art accuracy under realistic device nonidealities, with AGAD exhibiting strong invariance to reference offsets and symmetry. Runtime estimates indicate two orders-of-magnitude speedups over digital mixed-precision alternatives. Future research includes full-system energy/time evaluations on specific mixed-signal architectures, co-design of DNNs tailored to AIMC training characteristics, exploration of materials with extreme endurance/retention profiles for A vs W arrays, and efficient weight extraction (e.g., stochastic weight averaging) for deployment on diverse inference hardware.
- Most results are from detailed simulations using a device model (soft-bounds with noise and variations) and AIHWKit; real hardware behavior may introduce additional nonidealities.
- DNN benchmarks include small to medium-scale networks; larger models were explored only briefly due to simulation cost, and comprehensive hyperparameter sweeps were limited.
- Performance figures are analytical/estimated under assumed digital throughput, MVM, and pulse timings; actual chip-level throughput depends on specific architecture, parallelism, ADCs, and memory hierarchies.
- Endurance and retention analyses are qualitative/estimate-based and depend on material properties; achieving very high endurance for A remains challenging for some NVMs.
- TTv2/c-TTv2 require some device asymmetry; AGAD relaxes this but adds O(N^2) digital storage for P and P_ref unless β=1 or B=1 variants are used, which may trade accuracy for memory reductions.
- Code used for full simulations is not publicly released due to export restrictions, though algorithms are implemented in the open-source AIHWKit.
Related Publications
Explore these studies to deepen your understanding of the subject.

