Engineering and Technology

Demonstration of 4-quadrant analog in-memory matrix multiplication in a single modulation

M. L. Gallo, O. Hrynkevych, et al.

Discover groundbreaking research by Manuel Le Gallo and colleagues from IBM Research Europe, demonstrating a revolutionary 4-quadrant matrix-vector multiplication in a single modulation cycle using an analog in-memory computing chip. This innovation tackles the challenges of phase-change memory accuracy, significantly enhancing throughput and energy efficiency.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses how to execute accurate, large matrix–vector multiplications (MVMs) on analog in-memory computing (AIMC) hardware in a single read phase to minimize latency and maximize throughput and energy efficiency. Conventional AIMC often requires multiple read phases to handle signed inputs and weights, which increases latency and reduces efficiency. Prior approaches (bit-parallel, bit-serial, and two’s complement) either require multiple phases, limit the number of rows processed in parallel, or impose exacting analog requirements. A single-phase, fully parallel 4-quadrant MVM would substantially improve performance but is challenged by nonidealities, notably the voltage-polarity dependence of phase-change memory (PCM) device conductance, ADC mismatches, and polarity detection errors at low currents. The purpose of this work is to develop and implement an on-chip analog/digital calibration strategy that overcomes these issues, enabling single-modulation 4-quadrant MVMs on the IBM HERMES Project AIMC chip and demonstrating application-level accuracy comparable to software while achieving approximately 4× higher throughput and energy efficiency than the conventional four-phase scheme.

Literature Review

AIMC with resistive memory devices (PCM, RRAM, MRAM, FeFET) has shown promise for efficient MVMs in DNN inference, but state-of-the-art multi-core chips typically report >1 μs latency for 256×256 MVMs due to multi-phase reads and I/O bottlenecks. Bit-parallel schemes modulate pulse duration but need multiple phases for sign handling; bit-serial schemes process one bit per cycle but still require multiple phases beyond 4-bit inputs for charge integration, leading to latencies >10 μs for 8-bit 256×256 MVMs. Two’s complement allows single-polarity analog processing but demands exact bitwise analog operations and limits the number of rows accumulated in parallel, necessitating multiple phases. A circuit for single-modulation 4-quadrant MVM was proposed previously for the HERMES core, but published results used a four-phase read to avoid voltage polarity effects. The literature also documents PCM bipolar I–V asymmetry arising from Schottky barriers at amorphous–crystalline interfaces causing state-dependent polarity asymmetry, and ADC gain/linearity mismatches common in mixed-signal designs, both of which motivate robust calibration strategies.

Methodology

Hardware and MVM scheme: Experiments are conducted on the IBM HERMES Project Chip AIMC core with a 256×256 PCM crossbar. Each unit cell contains four PCM devices: two encode positive weight conductance (G+) and two encode negative weight conductance (G−). Inputs are 8-bit signed-magnitude; magnitudes are converted to 7-bit pulse-width modulated (PWM) durations (0–127 ns at 1 GHz). Outputs are digitized using 256 time-based current ADCs per column comprising a read-voltage regulator, a Schmitt trigger for current polarity detection (signal D), and a current-controlled oscillator (CCO) whose counts are accumulated in separate 12-bit positive (ADCP) and negative (ADCN) counters across the PWM timestamps; the LDPU subtracts ADCN from ADCP; optional FP16 scaling/ReLU and INT8 conversion follow. Single-phase 4-quadrant read: All selection transistors are enabled, BLs precharged to Vcm, and source lines (SLP, SLN) are driven to V− = Vcm − Vread to add current or V+ = Vcm + Vread to subtract current depending on input sign, enabling simultaneous reading of all four sign combinations without applying negative device terminal voltages. Analog calibration (initial): 1) Offset cancellation via Iresdi: With all devices RESET (low conductance), transfer curves (ADC counts vs input current) are measured by progressively activating SLs. Half the SLs are driven positive and half with equal-magnitude negative inputs; the procedure is repeated with inverted polarities. The four measured offsets are summed with signs and Iresdi is iteratively tuned until the total offset current converges to zero, compensating BL parasitics and average conductance polarity asymmetry. 2) Gain and linearity equalization: The main current-mirror gains αP, αN are tuned to set static gain, followed by forward-delay compensation gains βP, βN in the CCO nonlinearity compensation. For each ADC polarity, parameters are adjusted to minimize the distance to a linear reference with 0.75 ADC count per active SL, independently for ADCP and ADCN, achieving near-zero offsets, matched gains, and linear transfer curves across 256 ADCs. Weight programming: Weights are programmed using the Max SET Fill approach. For each weight, the two devices of the corresponding polarity are programmed while the opposite-polarity pair is RESET. Depending on weight magnitude, one device may be placed in RESET or SET and the second device is iteratively programmed with program-read-verify until within 5 ADC counts of target or up to 30 iterations. Reads during programming use positive inputs to positive weights and negative inputs to negative weights so that ADCP is used. For applications with positive-only inputs (e.g., ReLU), positive inputs are applied to both weight polarities during programming to compensate conductance polarity dependence at readout by selecting ADCP/ADCN according to weight polarity. Post-programming recalibration: After programming weights (uniformly distributed in [-1, 1]), the average ADCN current exceeds ADCP due to higher intermediate-state conductance in negative polarity. Therefore Vread (per column) is recalibrated using the offset cancellation procedure to center ADCP−ADCN around zero. Biasing for polarity detection robustness: To avoid Schmitt-trigger polarity errors around zero current, a bias current is added by dedicating the last SLs to bias weights and always applying bias inputs; the bias is subtracted digitally in the LDPU. The bias size trades off linear range on positive/negative outputs vs correction noise; 16 of 256 SLs provided a good balance in experiments. Digital affine correction: A per-ADC digital affine correction is performed once post programming in the LDPU. A set of MVMs with inputs matching application statistics (random or training-set derived) are run; raw ADC results are fitted to floating-point exact outputs with first-degree polynomials to obtain per-ADC scale and offset factors, which also subtract the bias. These corrections are fixed during inference. Applications: 1) ReLU CNN for MNIST: A 3-layer conv + dense network is trained with Gaussian noise (σ=0.1), L2 regularization (λ=1e−4), and dropout (0.5), then mapped to one core with replication to increase signal and 16 SLs reserved for bias; 2-quadrant operation is used. 2) Few-shot continual learning (FSCL) similarity search: An explicit memory (EM) of class key vectors (bipolarized to ±1 and programmed to G+/G−) is implemented on one or multiple cores; query embeddings (8-bit) are applied to perform in-memory similarity search; no bias is required due to strong class-separation signals. Multi-core mapping demonstrated across six cores for Omniglot.

Key Findings

- Single-phase 4-quadrant MVM realized on the IBM HERMES Project Chip with on-chip analog/digital calibration, enabling fully parallel reading in one modulation (TPWM) without negative device voltages. - Post-calibration ADC characteristics showed near-zero offset, matched gains, and linear transfer curves for both ADCP and ADCN across 256 columns. - Necessity and effectiveness of post-programming Vread recalibration were demonstrated: after weight programming, ADCP−ADCN exhibited a large offset due to PCM polarity dependence that was centered near zero by recalibration. - Biasing with 16 SLs effectively moved the non-linear region away from zero and improved linearity for positive outputs. - MVM accuracy: • Four-phase reference 4-quadrant MVM normalized error: 11.9%. • Single-phase 4-quadrant MVM positive-output normalized error: 13.6%. • Single-phase 2-quadrant (positive-only inputs) MVM positive-output error: 11.8% (≈ equal to four-phase). After recalibration, conductance polarity dependence increased MVM error by 1.8% compared to positive-only input MVM. - Application-level results: • MNIST CNN test accuracy: FP32 99.28%; FP32 with quantized inputs and weight programming error 99.16%; four-phase on-chip 99.00%; single-phase on-chip 98.74% (0.54% absolute below FP32). • FSCL similarity search on CIFAR-100 and miniImageNet with a single core: on-chip single-phase accuracies within <1% of FP32 software. • FSCL on Omniglot across six cores: on-chip accuracies within <1% of FP32 software, showing multi-core calibration viability. - Performance: 256×256 MVM latency of 133 ns in single-phase mode, achieving 1.55 TOPS/mm² throughput-per-area and 9.76 TOPS/W energy efficiency, ≈4× higher than the four-phase scheme.

Discussion

The work directly tackles the long-standing latency bottleneck in AIMC by enabling a fully parallel, single-modulation 4-quadrant MVM that handles signed inputs and weights without multiple read phases. By identifying and mitigating the dominant error sources—PCM conductance polarity asymmetry, ADC gain/linearity mismatches, and polarity detection errors near zero—the proposed calibration pipeline achieves MVM accuracy sufficient for practical inference tasks. Positive-output linearity, aided by biasing and digital affine correction, yields near software-equivalent results for applications that only require non-negative outputs (e.g., ReLU-based networks and similarity search). The observed hardware-software parity (<1% accuracy gap) in FSCL tasks and high MNIST accuracy confirm that single-phase operation can sustain application-level quality. The approach improves latency, throughput, and energy efficiency by about 4× relative to four-phase operation, underscoring the significance for AIMC accelerators. Device-level insights reveal that PCM’s state-dependent polarity asymmetry necessitates post-programming recalibration; technologies with more symmetric I–V (e.g., some RRAM) may reduce this burden. Nonetheless, the analog/digital calibration steps (ADC equalization, biasing, digital affine correction) are broadly applicable to designs that simultaneously apply positive and negative voltages to memory devices.

Conclusion

This paper demonstrates, for the first time experimentally, a calibrated single-modulation 4-quadrant MVM on a multi-core PCM-based AIMC chip, delivering a 256×256 MVM in 133 ns with 1.55 TOPS/mm² and 9.76 TOPS/W—about 4× improvements over the four-phase baseline—while maintaining near software-equivalent accuracy on representative tasks (MNIST CNN; FSCL similarity search on CIFAR-100, miniImageNet, and Omniglot across up to six cores). Key contributions include a comprehensive on-chip calibration flow (offset cancellation, ADC gain/linearity equalization, post-programming Vread recalibration, biasing for robust polarity detection, and per-ADC digital affine correction) that mitigates PCM polarity asymmetry and mixed-signal nonidealities. Future research should focus on device/material engineering to reduce PCM polarity dependence and variability, exploring memory technologies with symmetric I–V, and developing alternate single-phase read schemes that operate with a single voltage polarity to avoid repeated recalibration after reprogramming.

Limitations

- Single-phase accuracy, especially for negative outputs, remains lower due to Schmitt-trigger polarity detection challenges near zero current and PCM polarity dependence; biasing shifts but does not eliminate nonlinearity. - Post-programming recalibration is required because PCM intermediate states exhibit polarity-dependent conductance; this adds operational complexity for frequently updated weights. - Device-to-device variability in polarity dependence introduces residual errors despite calibration. - The method leverages positive-output robustness; applications requiring precise negative outputs may see reduced accuracy. - Calibration and digital correction steps increase system complexity and require representative input statistics for optimal affine correction.

Related Publications

Explore these studies to deepen your understanding of the subject.

Physics

Direct observation of a few-photon phase shift induced by a single quantum emitter in a waveguide

M. J. R. Staunstrup, A. Tiranov, et al.

Medicine and Health

Orexin-A and endocannabinoids are involved in obesity-associated alteration of hippocampal neurogenesis, plasticity, and episodic memory in mice

N. Forte, S. Boccella, et al.

Medicine and Health

Interaction of neuropsychiatric symptoms with APOE ε4 and conversion to dementia in MCI patients in a Memory Clinic

S. Valero, M. Marquié, et al.

Medicine and Health

Effects of single family room architecture on parent–infant closeness and family centered care in neonatal environments—a single-center pre–post study

E. Kainiemi, P. Hongisto, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny