logo
ResearchBunny Logo
Fast and robust analog in-memory deep neural network training

Computer Science

Fast and robust analog in-memory deep neural network training

M. J. Rasch, F. Carta, et al.

Discover how analog in-memory computing (AIMC) is revolutionizing deep learning training acceleration! Authors Malte J. Rasch, Fabio Carta, Omobayode Fagbohungbe, and Tayfun Gokmen explore improved algorithms that tackle the challenges of conductance noise and more, paving the way for faster and more efficient DNN training.

00:00
00:00
~3 min • Beginner • English
Introduction
Analog in-memory computing (AIMC) uses resistive crossbar arrays to perform matrix–vector multiplications in-memory with high energy efficiency, and most prior work has targeted inference acceleration. Training is substantially more compute-intensive than inference and would benefit even more from AIMC acceleration, but practical challenges arise due to analog device nonidealities: asymmetric and noisy conductance updates, device-to-device variations, limited precision, gradual saturation, costly global resets, and limited ability to program identical reference conductances. Prior efforts either (a) offload gradient computation/accumulation to digital memory (mixed-precision approaches), which incurs O(N^2) digital cost and undermines the O(1) MVM speedups, or (b) rely on fully in-memory outer-product updates that require highly symmetric and precise bi-directional device switching, which is unrealistic for common device materials. The Tiki-Taka v2 (TTv2) algorithm addresses part of this by accumulating gradients on a separate analog array and using a differential read against a pre-programmed reference array at the symmetry point (SP), plus digital low-pass filtering. However, TTv2 depends critically on precisely programmed and retained reference conductances and device asymmetry, and small reference errors can severely degrade training accuracy. This work proposes two algorithms—Chopped-TTv2 (c-TTv2) and Analog Gradient Accumulation with Dynamic reference (AGAD)—that preserve the fast, parallel O(1) in-memory update while removing the need for precise static references, improving robustness and broadening viable device choices.
Literature Review
- AIMC inference acceleration has been demonstrated on various NVM technologies (e.g., ReRAM, PCM, ECRAM) and mixed-signal chips, leveraging Ohm’s and Kirchhoff’s laws for efficient MVMs. - Mixed-precision (MP) training approaches compute/accumulate gradients digitally and program analog weights when thresholds are reached, but require O(N^2) digital operations and significant memory bandwidth, limiting overall training speedups. - Fully in-memory outer-product updates using pulse-train coincidences enable O(1) parallel updates but historically required near-ideal symmetric devices, making them impractical with typical asymmetries/noise. - Tiki-Taka v2 (TTv2) introduced a separate gradient-accumulation array (A) and a reference array (R) programmed at the symmetry point (SP) of A, using differential reads and digital low-pass filtering to mitigate device nonidealities; it reduces precision/symmetry demands but critically relies on accurate, stable per-device references and device asymmetry. - Subsequent work highlighted difficulties in precisely programming/retaining R (often 5–10% conductance programming errors, drifts), motivating algorithms that relax or eliminate R requirements and expand material choices.
Methodology
Core architecture and operations: - Two analog arrays are used during training: A for in-memory gradient accumulation via parallel pulsed outer-product updates, and W for storing the actual weights used in forward/backward MVMs. TTv2/c-TTv2 also consider a reference array R for differential reads; AGAD eliminates R by using a digital reference. - Each input sample: (1) perform parallel in-memory outer-product accumulation onto A via stochastic pulse coincidences; (2) intermittently read a single row (or column) of A every n_s updates into digital; (3) update a digital hidden accumulator H; (4) when H crosses a threshold, write single pulses to the corresponding row of W and reset that H element. This keeps update complexity O(1) per input for analog arrays. Algorithms: - TTv2 baseline: Requires a pre-programmed reference array R holding the per-device SP of A; reads use differential MVM (A−R) to remove device-to-device offsets; H accumulates leaky-filtered gradients; thresholded H triggers single-pulse writes to W. - Chopped-TTv2 (c-TTv2): Introduces a chopper vector c∈{−1,1} to modulate activation signs before accumulation (c⊙x) and demodulate upon readback, canceling low-frequency offsets or residuals from incorrect R and slow transients in A. Chopper signs flip with probability p at read cycles. Hardware remains similar to TTv2; reference R still used, but precision requirements are relaxed. - AGAD: Uses choppers as above but replaces static R with a digital on-the-fly reference P_ref computed from a leaky average P_k of recent A readouts: P_k=(1−β)P_{k−1}+β w_k, where w_k=A v_k is the read row. P_ref is updated when the chopper for that row flips. Reads subtract P_ref digitally (no differential analog read needed). This removes the need for programmable per-device R and associated circuitry and works with both asymmetric and symmetric devices. Device model and nonidealities: - Soft-bounds incremental update model with asymmetry, device-to-device variations in bounds and slopes, and cycle-to-cycle noise; number of effective states n_states ≈ (w_max−w_min)/δ governs precision/noise. - Symmetry point (SP) exists due to soft-bounds behavior; TTv2/c-TTv2 use SP via R to induce decay to algorithmic zero. AGAD uses transient history rather than SP. Simulations and benchmarks: - Implemented in AIHWKit (PyTorch-based). Experiments include toy constant-gradient cases, linear layer programming to target weights (20×20), and DNN training: 3-layer fully connected on MNIST, LeNet on MNIST, and 2-layer LSTM on War & Peace; an additional larger Vision Transformer on CIFAR-10 to check scaling of trends. - Parameters varied: reference offset variation σ_r, number of device states n_states (mapping to materials: low for ReRAM-like, high for ECRAM-like), device asymmetry. Metrics: weight programming error (SD from target), test error/loss, robustness to offsets, performance/runtime estimates. Learning-rate and scheduling: - Derive scaling for hidden-to-weight transfer learning rate to normalize per-layer magnitudes and device constraints; dynamic normalization using running maxima of |x| and |d|; practical simplifications validated empirically under high-noise/high-asymmetry settings. Performance estimation: - Complexity and time estimates per input sample for update pass under assumptions: N=512, FP8 digital throughput ~0.7 TFLOPS/core shared among 4 arrays, single MVM ~40 ns, single pulse ~5 ns, n_s=2 (and variants). Compare to mixed-precision digital gradient accumulation baseline.
Key Findings
- Robustness to reference offsets: - TTv2 performs well only when reference offsets are effectively zero; even σ_r≈0.1 (≈5% of full range) significantly degrades accuracy (e.g., for 20 states, weight error increases to ~9%). - c-TTv2 cancels large offsets via chopping and maintains accuracy up to offsets ≈25% of range (α≈0.5). Some oscillatory transients slow learning but do not cause failure. - AGAD is effectively invariant to reference offsets because references are computed on-the-fly; weight errors remain nearly constant across σ_r for each n_states (e.g., with 20 states, ~4.5% across σ_r; with 100 states, ~1.2%). - DNN training across topologies (3-FC on MNIST, LeNet on MNIST, 2-layer LSTM on War & Peace): - With perfect SP correction on W, all algorithms approach FP accuracy; without perfect reference, TTv2 degrades markedly, c-TTv2 is tolerant up to moderate offsets, AGAD remains stable and typically best, especially for larger n_states. - Correcting W’s SP can further improve test error for c-TTv2 and AGAD, sometimes surpassing FP baselines. - A larger Vision Transformer on CIFAR-10 confirms the same trends: stability against offsets for c-TTv2 and AGAD, degradation for TTv2. - Device asymmetry dependence: - TTv2 and c-TTv2 require asymmetric device response (soft-bounds) for proper decay to SP; performance degrades with increasing symmetry. - AGAD accuracy is insensitive to device symmetry, enabling use of symmetric devices (e.g., capacitors, ECRAM) as well as asymmetric ones (ReRAM). - Endurance and retention requirements: - Gradient-accumulation A devices experience the most pulses (≈0.5–4 pulses per input sample); W devices see ≈2×10^2–4×10^4 pulses per input sample, i.e., much fewer orders of magnitude across training, so A needs highest endurance. - Retention of R is critical for TTv2 (tolerable drift ≲5% over training) and relaxed for c-TTv2 (up to ≈25%); AGAD eliminates R and requires only short retention for A (on the order of one transfer period, typically 10^2–10^3 samples), while W must retain over a full training run. - Performance/runtime: - Estimated update-time per input (N=512, n_s=2): TTv2 ≈56.3 ns, c-TTv2 ≈62.1 ns, AGAD ≈30.9 ns, versus mixed-precision (digital gradient accumulation) ≈3024.5 ns—about 50× speedup for proposed in-memory algorithms. - With n_s=10 and l_max=1 (supported for certain device settings), AGAD can reach ≈17.1 ns/update, ≈175× faster than mixed-precision baseline. - Hardware simplification: - AGAD removes the need for a separate programmable reference array R and differential read circuitry, simplifying unit-cell design and reducing chip area. - Overall, both c-TTv2 and AGAD achieve state-of-the-art accuracy for AIMC training while substantially improving robustness to reference errors; AGAD further broadens feasible device materials and simplifies hardware.
Discussion
The study addresses the central challenge of robust, efficient AIMC-based training under realistic device nonidealities. By introducing chopping (c-TTv2) and dynamic digital reference estimation (AGAD), the algorithms preserve O(1) fully parallel outer-product updates and maintain high training accuracy without relying on precisely programmed per-device references. This resolves a key bottleneck in TTv2—sensitivity to reference programming/retention—and allows the training update path to match the constant-time nature of forward/backward MVMs. The findings demonstrate that robust training is achievable even with large reference offsets and across a range of device characteristics. AGAD’s invariance to device symmetry significantly widens viable material options (including symmetric, high-endurance devices), while eliminating differential read hardware simplifies the crossbar unit cell and reduces area and design complexity. Estimated runtime and bandwidth analyses show orders-of-magnitude improvement over digital mixed-precision gradient accumulation, suggesting that full in-memory training can deliver both speed and energy benefits. These results align in small and larger DNNs, indicating scalability of the trends. The work thus provides a practical path to fast, robust analog in-memory training with relaxed device and hardware constraints.
Conclusion
This work proposes two algorithms—Chopped-TTv2 (c-TTv2) and Analog Gradient Accumulation with Dynamic reference (AGAD)—for fast, robust analog in-memory training. Both retain the constant-time in-memory outer-product update while mitigating sensitivity to device offsets and asymmetry. c-TTv2 uses chopping to suppress offsets and low-frequency noise, relaxing reference precision requirements. AGAD replaces static, per-device analog references with on-the-fly digital references from recent transient dynamics, eliminating the need for a reference array and differential reads, broadening device choices to include symmetric devices, and simplifying hardware. Simulations on linear layers and diverse DNNs show state-of-the-art accuracy under realistic device nonidealities, with AGAD exhibiting strong invariance to reference offsets and symmetry. Runtime estimates indicate two orders-of-magnitude speedups over digital mixed-precision alternatives. Future research includes full-system energy/time evaluations on specific mixed-signal architectures, co-design of DNNs tailored to AIMC training characteristics, exploration of materials with extreme endurance/retention profiles for A vs W arrays, and efficient weight extraction (e.g., stochastic weight averaging) for deployment on diverse inference hardware.
Limitations
- Most results are from detailed simulations using a device model (soft-bounds with noise and variations) and AIHWKit; real hardware behavior may introduce additional nonidealities. - DNN benchmarks include small to medium-scale networks; larger models were explored only briefly due to simulation cost, and comprehensive hyperparameter sweeps were limited. - Performance figures are analytical/estimated under assumed digital throughput, MVM, and pulse timings; actual chip-level throughput depends on specific architecture, parallelism, ADCs, and memory hierarchies. - Endurance and retention analyses are qualitative/estimate-based and depend on material properties; achieving very high endurance for A remains challenging for some NVMs. - TTv2/c-TTv2 require some device asymmetry; AGAD relaxes this but adds O(N^2) digital storage for P and P_ref unless β=1 or B=1 variants are used, which may trade accuracy for memory reductions. - Code used for full simulations is not publicly released due to export restrictions, though algorithms are implemented in the open-source AIHWKit.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny