Computer Science

Discovering faster matrix multiplication algorithms with reinforcement learning

A. Fawzi, M. Balog, et al.

Discover how AlphaTensor, developed by a team of researchers from DeepMind, revolutionizes matrix multiplication through deep reinforcement learning. This groundbreaking agent not only improves on Strassen's algorithm for 4x4 matrices after 50 years but also achieves remarkable flexibility and efficiency across a range of matrix sizes and applications.

00:00

~3 min • Beginner • English

Index

Introduction

The paper tackles the long-standing problem of discovering efficient, provably correct algorithms for matrix multiplication, a core operation underlying numerous computational tasks in science and engineering. Although matrix multiplication algorithms can be represented as low-rank decompositions of a fixed three-dimensional matrix multiplication tensor, finding low-rank decompositions of 3D tensors is NP-hard and practically challenging. Prior approaches relied on human-guided search, continuous optimization, or combinatorial methods with hand-crafted heuristics. The authors propose to automate this discovery process using deep reinforcement learning, formulating algorithm search as a single-player game (TensorGame) whose objective is to decompose the matrix multiplication tensor with as few rank-one terms as possible, directly optimizing algorithmic complexity and enabling discovery beyond human intuition. The aim is to produce practical algorithms for exact matrix multiplication that improve the number of scalar multiplications, transfer across sizes, and can be tailored to different objectives (e.g., runtime on specific hardware).

Literature Review

Matrix multiplication algorithms have a rich history, beginning with Strassen’s 2 × 2 algorithm using 7 multiplications and recursive schemes, and the 3 × 3 Laderman algorithm using 23 multiplications. Extensive theoretical work has focused on asymptotic exponents of matrix multiplication (e.g., Coppersmith–Winograd, Le Gall, Alman–Williams), which does not directly yield practical algorithms. Practical algorithms and bounds have been developed through human search, continuous optimization, and combinatorial approaches for rectangular sizes (e.g., Hopcroft–Kerr; Smirnov; Sedoglavic and Smirnov). However, computing low-rank decompositions of higher-order tensors is NP-hard and remains difficult, with the optimal 3 × 3 algorithm still unknown. Existing methods often depend on human-designed heuristics and discretized factor spaces to ensure exactness under finite precision. The present work builds on this literature by leveraging reinforcement learning to navigate the vast algorithm space efficiently and to discover practical, exact algorithms with improved ranks across many tensor sizes and arithmetic settings.

Methodology

Problem formulation: The authors formalize matrix multiplication (A, B) → AB as a fixed 3D tensor T_{n,m,p} of size (nm) × (mp) × (pn) with entries in {0,1}. A rank-R decomposition T = Σ_{r=1}^R u^{(r)} ⊗ v^{(r)} ⊗ w^{(r)} corresponds to an algorithm requiring R scalar multiplications, which can be applied recursively to multiply larger matrices with complexity O(N^{log_n R}) for n × n blocks. TensorGame: They cast tensor decomposition as a single-player game. The game state S_t is initialized as the target tensor T. At each step t, the agent selects factors (u^{(t)}, v^{(t)}, w^{(t)}) from a discrete coefficient set F (e.g., {−2, −1, 0, 1, 2}) and updates S_t = S_{t−1} − u^{(t)} ⊗ v^{(t)} ⊗ w^{(t)}. The objective is to reach the zero tensor in as few steps as possible, implicitly yielding an exact decomposition. The episode length is capped at R_limit. The per-step reward is −1; if the game ends without reaching zero, an additional terminal penalty −γ(S_{R_limit}) is applied, where γ(·) upper-bounds the remaining tensor rank, aligning optimization with minimizing rank. Agent and planning: AlphaTensor builds on AlphaZero with Monte Carlo Tree Search (MCTS) guided by a neural network that outputs a policy over actions (via sampling to handle the enormous action space) and a value head predicting the distribution of returns (related to rank). Finished games are used to train the network, improving both policy and value through self-play style updates. Neural architecture: A transformer-based model with inductive biases for tensors is introduced. The input tensor is projected via linear layers applied to its three cyclic transpositions into three S × S grids of features. Attention operations are applied across pairs of grids (a multi-grid generalization of axial attention), which is more efficient and effective than naive self-attention. The architecture is invariant to slice orderings, reflecting tensor rank invariance to reordering. The shared trunk feeds an autoregressive policy head (to sample factors) and a multilayer-perceptron value head. Training data and strategies: In addition to reinforcement learning on target tensors, the authors generate synthetic demonstrations by sampling random factor sets and constructing their sum tensor (T = Σ u ⊗ v ⊗ w). The network is trained on a mixture of supervised imitation of these demonstrations and RL on target tasks, which outperforms each strategy alone. Data augmentation leverages the order invariance of factor summation by swapping actions from completed games to create additional training pairs. Change of basis and symmetry exploitation: The same bilinear operation can be expressed in different bases without changing rank. For every game, a random change of basis is sampled and applied to T, encouraging diversity in discovered decompositions. Symmetries of matrix multiplication tensors and equivalence classes of factorizations are exploited analytically and empirically to improve learning and exploration. Search scope and targets: A single agent is trained across many tensors T_{n,m,p} with n, m, p ≤ 5, in both standard arithmetic (ℝ) and modular arithmetic (ℤ₂). Discovered decompositions are composed recursively to build algorithms for larger sizes (up to 12 in reported results) and beyond. The approach extends to custom bilinear operations (e.g., skew-symmetric matrix–vector multiplication) by representing them as tensors and running the same procedure. Tailoring to hardware runtime: To optimize practical runtime, the terminal reward is augmented with a benchmarking component equal to the negative measured runtime of the discovered algorithm for a specific problem instance on target hardware (GPU or TPU), with a weighting coefficient λ. The same TensorGame formulation is used otherwise. Factorizations are converted to JAX code and benchmarked (JIT-compiled) to provide feedback for learning. Correctness and application: Because the game terminates only when S_R = 0, the resulting factor sets guarantee exactness. Algorithm 1 outlines how to compute C = AB from factors by forming rank-one products m_r and accumulating outputs via w^{(r)}. Algorithms can be applied recursively to block matrices to multiply arbitrary sizes.

Key Findings

- AlphaTensor re-discovers known best-rank algorithms (e.g., Strassen’s 2 × 2 rank-7; Laderman’s 3 × 3 rank-23) and improves state-of-the-art ranks for multiple sizes. - Over ℤ₂ (modular arithmetic), AlphaTensor finds a 4 × 4 algorithm with rank 47, improving over Strassen’s two-level 49 multiplications. Recursively, this yields O(N^{2.778}) complexity in ℤ₂. - In standard arithmetic (ℝ), AlphaTensor improves several rectangular sizes, e.g.: • (3, 4, 5): rank 47 vs previous best 48. • (4, 4, 5): rank 63 vs previous best 64. • (4, 5, 5): rank 76 vs previous best 80. • For (5, 5, 5), the modular arithmetic rank is improved to 96 via composition ((3,5,5)+(2,5,5)). - The agent scales to large tensors (e.g., T_7) with action spaces far exceeding prior methods, discovering decompositions matching or surpassing state of the art and producing thousands of non-equivalent factorizations, revealing a richer algorithm space than previously recognized. - Beyond standard multiplication, AlphaTensor discovers an algorithm for multiplying an n × n skew-symmetric matrix by a vector using ((n−1)(n+2))/2 multiplications (∼ n^2/2), improving over prior algorithms with ∼ n^2 multiplications and proving asymptotic optimality for this task. - Hardware-tailored algorithms: By optimizing terminal reward for measured runtime on specific devices, AlphaTensor discovers 4 × 4 block algorithms that, when applied to 8,192 × 8,192 matrix multiplication (with 2,048 × 2,048 blocks), achieve consistent speed-ups over Strassen-square: • On Nvidia V100 GPU: median speed-ups vs standard matmul up to 8.5% at 8,192; across sizes 10,240–20,480, AlphaTensor outperforms Strassen-square in median speed-up. • On Google TPU v2: median speed-ups up to 10.3% at 8,192; AlphaTensor outperforms Strassen-square across tested sizes. • Tailored algorithms transfer partially but perform best on the hardware they were optimized for (e.g., GPU-optimized algorithm yields 8.5% on GPU vs 6.6% on TPU at 8,192). - Combining diverse discovered algorithms recursively improves ranks for more than 70 larger tensors (n, m, p ≤ 12), advancing the state of the art broadly.

Discussion

The study frames exact matrix multiplication algorithm discovery as a reinforcement learning problem and demonstrates that a learning-based agent can autonomously find provably correct algorithms that reduce scalar multiplication counts. By searching in the space of tensor decompositions with MCTS guided by a transformer tailored to tensor symmetries, and by leveraging synthetic demonstrations, change-of-basis diversity, and data augmentation, AlphaTensor effectively navigates an enormous action space. The findings directly answer the research question: DRL can surpass human-designed and prior computer-searched algorithms in practical rank for many matrix sizes, and can also adapt to different optimization criteria such as runtime on specific hardware. The discovery of thousands of non-equivalent factorizations indicates previously unexplored richness in the algorithm space, with implications for algebraic complexity theory and potential insights into asymptotic exponents. The approach generalizes beyond standard multiplication to structured bilinear problems, exemplified by the asymptotically optimal skew-symmetric matrix–vector multiplication algorithm. Hardware-tailored results illustrate practical relevance: even with identical theoretical complexity, learned algorithms can outperform classical fast methods when optimized for compiler and device characteristics. Overall, the work shows that DRL can assist and accelerate mathematical discovery and practical algorithm design.

Conclusion

AlphaTensor demonstrates that deep reinforcement learning can automatically discover exact, efficient matrix multiplication algorithms, matching and surpassing the best-known ranks across many small matrix sizes, improving large-size compositions through recursive combination, and revealing a vast diversity of non-equivalent factorizations. The method extends naturally to custom bilinear operations and can incorporate non-differentiable objectives such as measured runtime to yield hardware-tailored algorithms that outperform classical fast methods in practice. Future research directions include: (1) learning or searching over the coefficient set F rather than pre-specifying it; (2) scaling discovery to larger tensors and broader classes of bilinear operations; (3) optimizing for additional metrics (e.g., numerical stability, energy usage); and (4) exploring related notions of rank (e.g., border rank) and other NP-hard factorization problems where similar DRL-based frameworks may prove effective.

Limitations

A key limitation is the need to pre-define the discrete coefficient set F for factor entries, which discretizes the search space and may exclude efficient algorithms outside the chosen set. The approach also relies on substantial computational resources for self-play and MCTS in enormous action spaces, and results can be sensitive to reward shaping (e.g., runtime benchmarking) and implementation details such as compiler stack behavior when tailoring algorithms to hardware.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Faster sorting algorithms discovered using deep reinforcement learning

D. J. Mankowitz, A. Michi, et al.

Psychology

Ageing is associated with disrupted reinforcement learning whilst learning to help others is preserved

J. Cutler, M. K. Wittmann, et al.

Physics

Coincidence imaging for Jones matrix with a deep-learning approach

J. Xi, T. K. Yung, et al.

Medicine and Health

Prediction of mortality risk and duration of hospitalization of COVID-19 patients with chronic comorbidities based on machine learning algorithms

P. Amiri, M. Montazeri, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny