logo
ResearchBunny Logo
An analog-AI chip for energy-efficient speech recognition and transcription

Engineering and Technology

An analog-AI chip for energy-efficient speech recognition and transcription

S. Ambrogio, P. Narayanan, et al.

Discover how a groundbreaking analog-AI chip with 35 million phase-change memory devices achieves remarkable energy efficiency, boasting performance levels up to 12.4 TOPS/W. This technology not only ensures software-equivalent accuracy for keyword spotting but also approaches it for more extensive models, demonstrating significant potential for the future of speech recognition and transcription—all developed by S. Ambrogio, P. Narayanan, A. Okazaki, and their esteemed colleagues at IBM Research.

00:00
00:00
~3 min • Beginner • English
Introduction
Deep neural networks have grown substantially in size, enabling state-of-the-art performance in tasks such as speech recognition and transcription, but at the cost of increased energy and data-movement demands on conventional von Neumann architectures. Large models like RNNTs and transformers reduce word error rate (WER) on datasets such as Librispeech and SwitchBoard, yet training and inference on GPUs/CPUs suffer from the memory wall and energy inefficiency due to frequent movement of data between processor and memory. Analog in-memory computing leverages non-volatile memory arrays to perform multiply-accumulate (MAC) operations directly in memory, potentially reducing time and energy, especially for fully connected layers common in RNNT/transformer models. Given endurance and programming constraints of NVM devices, analog-AI systems must be weight-stationary, with all weights preprogrammed prior to inference. This work aims to demonstrate an analog-AI chip that scales to many tiles with efficient activation transport, achieving software-level or near-software accuracy on industry-relevant speech tasks while delivering high energy efficiency.
Literature Review
Prior work established the effectiveness of large DNNs (e.g., transformers and RNNTs) in reducing WER on standard corpora and highlighted the von Neumann bottleneck in AI workloads. Analog compute-in-memory with NVMs (e.g., PCM) has been proposed and demonstrated for accelerating MACs with improved energy efficiency, including prior chip-level demonstrations of PCM-based compute cores and fully hardware-implemented neural networks. A heterogeneous programmable accelerator architecture with a dense 2D mesh for analog-AI predicted 40–140× energy-efficiency gains over GPUs, contingent on efficient on-chip routing and demonstrating high-accuracy large DNNs—gaps that this work addresses experimentally. Noise-aware and hardware-aware training techniques have been explored to close the accuracy gap between analog hardware and software for various DNNs, and quantization studies have shown varying precision needs across layers and models.
Methodology
Hardware: A 14-nm inference chip integrates 34 analog tiles, each with a 512×2,048 phase-change memory (PCM) crossbar array and peripheral analog circuitry. Six power domains each contain an Input Landing Pad (ILP) and Output Landing Pad (OLP) with local SRAM. ILPs convert 8-bit (UINT8) input vectors (length 512) to pulse-width-modulated (PWM) durations, which are routed over a dense 2D mesh (512 east–west and 512 north–south wires per tile) with ‘borderguard’ tri-state routing to destination tiles for analog MACs; OLPs digitize durations back to UINT8 when needed. Inter-tile communication uses durations directly to avoid per-tile ADC overhead, enabling synchronous integration across tiles; digitization is used when staging is required (e.g., transformer attention). Each tile includes a user-programmable local controller (LC) and instruction SRAM to define timing and routing sequences, enabling multicast and concatenation of sub-vectors. Communication accuracy was validated by multicasting 1 million random durations, showing <5 ns maximum error. PCM devices: PCMs are integrated in BEOL above 14-nm CMOS. A parallel, single-row programming scheme updates 512 weights per row simultaneously. Weights can be encoded using 4-PCM-per-weight (differential/add-subtract to peripheral capacitor) for higher accuracy or 2-PCM-per-weight for higher density. Two adjacent tiles can share a bank of 512 capacitors to extend analog integration to 2,048 input rows. A 2-PCM-per-weight mode can time-multiplex two weights on the same capacitor to realize 1,024×512 layers per tile. KWS task: A multi-class keyword spotting (KWS) task on Google Speech Commands was implemented end-to-end on-chip using a fully connected (FC) network adapted for tile sizes (1,960 inputs, hidden layers increased from 128 to 512 in 4-PCM-per-weight mode), then pruned to 1,024 inputs to fit a two-tile, shared-capacitor mapping. Hardware-aware training (HWA) techniques (activation noise injection, weight clipping, L2 regularization, bias removal) were applied via IBM’s analog hardware acceleration kit. An MAC asymmetry balance (AB) method was used to cancel peripheral asymmetries: encode W on PCM pair Wp1 and −W on Wp2; compute MAC with x on Wp1 then with −x on Wp2 and sum to cancel fixed asymmetries (scale factor 2). Four tiles total were used (two for the first layer via shared capacitors, and two for subsequent layers). ReLU activations for hidden layers were implemented on-chip in analog; output layer was linear. Each audio frame processed in 2.4 µs (8×300 ns steps), 7× faster than MLPerf’s best-case latency. RNNT task: The MLPerf Datacenter RNNT (trained on Librispeech) was implemented without further retraining. The model comprises an encoder (five LSTMs + one FC), a decoder (embedding FC + two LSTMs + FC), and a joint block (ReLU + small FC 512×29 for next-character probabilities with greedy decoding). Digital preprocessing converts audio to feature vectors; time-stacking is applied pre-encoder and between Enc-LSTM1 and Enc-LSTM2. A software sensitivity analysis was performed by progressively quantizing weights (layer-wise and network-wide) to determine precision thresholds versus WER, identifying sensitive layers (joint-FC, Enc-LSTM0, Enc-LSTM1). The small but sensitive joint-FC was kept in digital to preserve accuracy; all vector–vector operations and activations were computed off-chip (host) with OLP/ILP data transfers. Mapping: 45,261,568 of 45,321,309 parameters (99.9%) were mapped to analog across 142 tiles on five chips (average ~2.9 PCMs/weight; >140M PCMs total). Sensitive encoder layers used AB and careful handling of the first matrix in Enc-LSTM0 to improve MAC fidelity. Routing used fully parallel multicasts; results were sent as durations to OLPs with implicit concatenation as needed. Weight-expansion technique: To further reduce WER, a weight expansion was introduced for Enc-LSTM0: insert a fixed Gaussian matrix M and its Moore–Penrose pseudoinverse pinv(M) such that y = W1·x = (W1·pinv(M))·(M·x). Expanding the number of rows increases signal-to-noise ratio because signal scales linearly with row count while uncorrelated noise aggregates sub-linearly. This adds modest digital preprocessing cost (M·x) but no increase in tile count. Power and performance measurement: Power was measured per chip during inference with supplies at 1.5 V (row activation/column integration during analog compute), 0.8 V (control/communication: ILP, OLP, LC, 2D mesh), and 1.8 V (PLL and I/O). Sustained TOPS/W were reported per chip and correlated with weights mapped (denser 2-PCM-per-weight yields higher efficiency; AB with 4-PCM-per-weight lowers density). Impact of reduced integration time on WER/efficiency was evaluated. System-level estimates included projected on-chip digital compute at OLP/ILP locations to assess end-to-end processing time and energy. Stability to PCM drift was assessed by repeating full RNNT inference after one week without recalibration.
Key Findings
- Analog-AI chip architecture and inter-tile communication: - 34 analog tiles per chip with 35 million PCM devices; dense 2D mesh supports massively parallel, multicast duration-based routing. - Communication accuracy: cumulative distribution shows maximum duration transport error ≤5 ns across the chip (≈3 ns for shorter durations). - Tile programmability via local controllers enables complex routing (e.g., concatenation) and flexible timing. - KWS (Keyword Spotting): - Fully end-to-end analog implementation using 4 tiles achieved 86.14% accuracy, matching software-equivalent criteria (≥99% of 86.75% SW baseline; MLPerf iso-accuracy threshold 85.88%). - Frame latency: 2.4 µs (8×300 ns steps), ~7× faster than best-case MLPerf reports. - AB (asymmetry balance) method improved MAC accuracy by cancelling fixed peripheral asymmetries. - RNNT (speech-to-text on Librispeech, MLPerf Datacenter): - Mapped >45 million weights to >140 million PCMs across 5 chips (142 tiles), with 99.9% of parameters on analog tiles; average ≈2.9 PCMs/weight. - Layer sensitivity analysis (SW) identified joint-FC and Enc-LSTM0 as most sensitive; joint-FC kept digital; Enc-LSTM0 and Enc-LSTM1 used AB and careful mapping. - Full-network inference across five chips achieved WER 9.475% using original MLPerf weights; after ~1 week of PCM drift without recalibration, WER increased slightly to 9.894% (+0.4%). - Weight-expansion for Enc-LSTM0 reduced WER further: experimental full-network WER 9.258%, 1.81% absolute above the SW baseline (7.452%) and 0.88% above the 99% SW threshold (8.378%). - Single-layer on-chip experiments confirmed Enc-LSTM0 as most critical; other layers were more noise-resilient. - Energy efficiency and performance: - Best per-chip sustained efficiency: 12.4 TOPS/W (chip with densest mapping and most on-chip weights); analog tile peak ~20.0 TOPS/W. - Reducing integration time (e.g., halving durations) improved TOPS/W by ~25% with ~+1% WER cost (measured on chip 4 with other layers in FP32 SW). - End-to-end estimates indicate analog-only sustained ~7.09 TOPS/W; projected full-system (including digital auxiliary compute) ~6.94–6.7 TOPS/W. - Operations ratio: analog:on-chip to off-chip ≈325:1 (original weights) and ≈88:1 with weight expansion (due to extra preprocessing), keeping system efficiency high. - Real-time performance: average processing time per sample ~500 µs; real-time factor ~0.8, exceeding MLPerf real-time constraint (≤1). - Robustness: - Minimal WER degradation after one week of PCM drift without recalibration. - Overall: Demonstrated near-SW accuracy for a large, industry-relevant RNNT across five chips and SW-equivalent accuracy for KWS, with high sustained energy efficiency (up to 12.4 TOPS/W per chip; ~6.7 TOPS/W estimated system).
Discussion
The work demonstrates that analog in-memory computing can scale beyond single-tile demonstrations to multi-tile, multi-chip systems while maintaining high accuracy on practical speech tasks. By combining dense, duration-based inter-tile routing, flexible local control, and analog-friendly weight encodings (2- vs 4-PCM per weight) with algorithm–hardware co-optimization (hardware-aware training, AB asymmetry cancellation, and weight expansion), the system mitigates analog non-idealities and layer-specific sensitivity. The KWS results validate fully on-chip, end-to-end inference with software-equivalent accuracy and superior latency, while the RNNT results show that nearly all weights can be mapped to analog memory and that early-layer error propagation can be controlled sufficiently to achieve near-SW WER. Power and throughput measurements confirm that the high OPS/W of analog tiles can translate into high sustained chip and system efficiencies when coupled with efficient routing and dataflow. The minimal drift-induced degradation over a week indicates practical stability. Collectively, these findings address the central challenges of activation transport between many tiles and accuracy on large models, supporting the feasibility of analog-AI for real-world speech applications.
Conclusion
A 14-nm analog-AI chip with 34 PCM-based tiles was engineered and experimentally validated for speech recognition and transcription. The system achieved fully end-to-end software-equivalent accuracy on a KWS task and near-software accuracy on the MLPerf RNNT, mapping >45 million weights across five chips (>140 million PCMs) and sustaining up to 12.4 TOPS/W per chip, with projected system efficiency around 6.7 TOPS/W. Key techniques—duration-based inter-tile routing, local programmable control, asymmetry balancing, and weight expansion—enabled accurate and efficient execution while minimizing digitization overheads. These results constitute, to the authors’ knowledge, the first demonstration of commercially relevant accuracy on a commercially relevant model using >140 analog-AI tiles with efficient activation movement. Future directions include integrating on-chip digital compute and SRAM for auxiliary operations and staging, further optimizing dataflow and activation handling, exploring adaptive calibration and drift compensation, and extending the approach to other large DNNs (e.g., transformers) and workloads.
Limitations
- The chip lacks integrated digital compute cores and SRAM for auxiliary operations; vector–vector computations, activations, and the joint-FC layer were executed off-chip, impacting true end-to-end integration and incurring communication overheads. - Not all layers could be mapped in analog without accuracy loss (e.g., the small but sensitive joint-FC remained digital), evidencing layer-specific sensitivity to analog non-idealities. - AB improves accuracy but reduces areal and energy density by using 4 PCMs per weight and doubling time steps; it increases complexity and power relative to the denser 2-PCM mapping. - Despite improvements, RNNT WER remained above the SW baseline (by ~1.81% absolute with weight expansion), indicating a residual accuracy gap. - The system requires weight-stationary operation due to finite endurance and slow programming of NVM; dynamic weight updates are not supported during inference. - Energy efficiency degrades from tile peak to system level due to communication, incomplete tile utilization, and off-chip/digital overheads; decoder chip exhibited lower TOPS/W due to sparse mapping and routing activity. - Long-term drift was only assessed over about a week without recalibration; longer-term stability and environmental variability were not reported.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny