Engineering and Technology

An analog-AI chip for energy-efficient speech recognition and transcription

S. Ambrogio, P. Narayanan, et al.

Discover how a groundbreaking analog-AI chip with 35 million phase-change memory devices achieves remarkable energy efficiency, boasting performance levels up to 12.4 TOPS/W. This technology not only ensures software-equivalent accuracy for keyword spotting but also approaches it for more extensive models, demonstrating significant potential for the future of speech recognition and transcription—all developed by S. Ambrogio, P. Narayanan, A. Okazaki, and their esteemed colleagues at IBM Research.

00:00

Playback language: English

Index

Introduction

The advancements in AI have led to significant improvements in various applications, including speech recognition and transcription. Deep neural networks (DNNs), particularly transformer and recurrent neural network transducer (RNNT) models, with billions of parameters, have significantly reduced word error rates (WER) in speech transcription. However, the energy consumption of training and implementing these large networks on conventional processors like GPUs and CPUs is a major bottleneck, due to the von Neumann bottleneck. Analog in-memory computing (analog-AI) addresses this by performing multiply-accumulate (MAC) operations directly within the memory using arrays of non-volatile memory (NVM). This approach reduces both time and energy consumption, particularly beneficial for DNNs with large fully connected layers. While a highly programmable analog-AI accelerator architecture has been proposed with predicted high energy efficiencies, its system-level performance and accuracy needed experimental verification on a sufficiently large and commercially relevant model. This paper aims to demonstrate the accuracy, performance, and energy efficiency of analog-AI on NLP inference tasks using a newly developed analog-AI chip.

Literature Review

The past decade has witnessed remarkable progress in AI, particularly in deep learning. DNNs, especially transformers and RNNTs, have achieved state-of-the-art results in various tasks, including speech recognition and natural language processing. These advancements are driven by increasing model size and complexity, with models containing billions of parameters becoming commonplace. However, this increase in model size comes at the cost of significantly increased energy consumption and computational demands. Existing hardware struggles to keep pace with the growing demands of these models. General-purpose processors are commonly used for training and inference, resulting in excessive energy consumption due to the transfer of large amounts of data between memory and processing units. Analog-AI has emerged as a promising alternative, leveraging the inherent parallelism of non-volatile memory arrays to perform MAC operations directly in memory, thus potentially overcoming the limitations of conventional digital computing architectures. Several studies have explored the potential of analog-AI using different NVM technologies, showcasing promising results. However, demonstrating software-equivalent accuracy on large, commercially relevant models with efficient inter-tile communication remains a key challenge.

Methodology

This research employed a custom-designed 14-nm analog-AI inference chip featuring 34 large arrays of phase-change memory (PCM) devices, digital-to-analog input, analog peripheral circuitry, analog-to-digital output, and massively parallel 2D-mesh routing. The chip architecture includes input landing pads (ILPs) for converting digital inputs to pulse-width modulated (PWM) durations and output landing pads (OLPs) for converting durations back to digital outputs. Inter-tile communication is performed using durations to minimize power and latency. Each tile contains a programmable local controller (LC) for flexible control of weight programming, MAC operations, and routing schemes. Weights can be encoded using either 4-PCM-per-weight or 2-PCM-per-weight configurations. A novel MAC asymmetry balance (AB) method is introduced to improve MAC accuracy by cancelling out asymmetries in peripheral circuits. Two neural network models were chosen for evaluation: a keyword spotting network (KWS) and the MLPerf RNNT. For KWS, a fully end-to-end implementation on the chip was achieved using an optimized fully connected network architecture. The MLPerf RNNT model, with its 45 million weights, was mapped onto more than 140 million PCM devices across five chips. A weight-expansion technique was employed to improve the accuracy of the RNNT on the chip. Detailed analysis of the impact of weight quantization on different network layers was also performed to guide the mapping strategy and optimize the analog-AI implementation. The word error rate (WER) was used as the main metric to assess the accuracy of the analog-AI implementations, comparing it to the software-only baseline. Power consumption was carefully measured and used to calculate the TOPS/W.

Key Findings

The analog-AI chip demonstrated significant performance improvements in speech recognition and transcription tasks. For the KWS task, the chip achieved 86.14% accuracy, matching the software accuracy within the MLPerf 'iso-accuracy' limit. The processing time for each audio frame was only 2.4 µs, seven times faster than the best MLPerf latency. For the much larger MLPerf RNNT model, near software-equivalent accuracy (98.1% of SW accuracy) was achieved. More than 45 million weights were mapped onto over 140 million PCM devices across five chips, with 99% of operations executed on the analog-AI tiles. The chip's sustained performance reached a maximum of 12.4 TOPS/W. Analysis showed that the Enc-LSTMO layer was the most sensitive layer in the RNNT, hence weight expansion techniques were implemented to enhance its accuracy. After employing the weight expansion method on the Enc-LSTMO, a WER of 9.258% was achieved, only 0.88% higher than the software equivalent. Post-implementation testing over a week demonstrated a WER degradation of only 0.4% due to PCM drift, indicating robustness of the chip. The average processing time for each sample was 500 µs, enabling near real-time transcription (real-time factor=8×10⁻¹). Comparing the number of operations performed on-chip versus off-chip revealed a ratio of 325:1 for the original MLPerf weights, reduced to 88:1 with weight expansion. This significantly improved the energy efficiency. The estimated system sustained performance, considering both analog and projected digital computations, reached 6.7 TOPS/W, representing a 14-fold improvement over the best MLPerf submission.

Discussion

The results demonstrate the feasibility of using analog-AI for high-accuracy, energy-efficient speech recognition and transcription. The successful implementation of both a small KWS network and a large RNNT model showcases the versatility and scalability of the analog-AI chip. The near software-equivalent accuracy achieved on the RNNT surpasses previous efforts, proving the potential of analog-AI to compete with digital approaches in complex, real-world tasks. The significant energy efficiency improvements, achieving up to 12.4 TOPS/W and an estimated system sustained performance of 6.7 TOPS/W, are crucial for deploying AI models on power-constrained devices and edge computing applications. The robustness against PCM drift is also a major finding, as it ensures long-term reliability. The success in implementing the weight-expansion technique highlights the potential of further improving accuracy and efficiency through algorithmic innovations.

Conclusion

This work presents the first demonstration of commercially relevant accuracy levels on a large-scale analog-AI system for speech recognition and transcription. The analog-AI chip's high TOPS/W performance and robustness against PCM drift underscore its potential for future AI hardware. Future research could explore integrating on-chip digital compute units for improved performance and exploring different NVM technologies and architectural optimizations for further efficiency gains.

Limitations

The current study focuses on inference tasks; further work is needed to adapt the architecture for efficient training. The chip's design is not fully optimized for a commercially available product. It lacks on-chip digital computing cores and SRAM needed for auxiliary operations and data staging. While the weight-expansion method improved accuracy, it also increased digital preprocessing demands. The real-time factor is still below the ideal of 1, though it's notably improved compared to existing approaches.

Related Publications

Explore these studies to deepen your understanding of the subject.

Biology

Structural basis for safe and efficient energy conversion in a respiratory supercomplex

W. Kao, C. O. D. P. Northumberland, et al.

Biology

Design of a recombinant asparaginyl ligase for site-specific modification using efficient recognition and nucleophile motifs

J. Tang, M. Hao, et al.

Interdisciplinary Studies

An agenda for future Social Sciences and Humanities research on energy efficiency: 100 priority research questions

C. Foulds, S. Royston, et al.

Engineering and Technology

Toward grouped-reservoir computing: organic neuromorphic vertical transistor with distributed reservoir states for efficient recognition and prediction

C. Gao, D. Liu, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny