
Computer Science
Faster sorting algorithms discovered using deep reinforcement learning
D. J. Mankowitz, A. Michi, et al.
Discover how AlphaDev, a groundbreaking deep reinforcement learning agent, revolutionizes sorting algorithms by surpassing human-designed benchmarks. This remarkable work by a team of skilled researchers showcases AI's ability to transform fundamental algorithms and enhance their efficiency.
~3 min • Beginner • English
Introduction
The study addresses the long-standing question of how to design faster sorting algorithms, focusing on small "fixed" sorts (for sequences of fixed length) and "variable" sorts (handling sequences of length 1–5). Sorting is foundational in computer science and is invoked ubiquitously; practical large-array sorts call small sorts repeatedly, so even small improvements can yield substantial system-level gains. The authors posit that many classical algorithms have plateaued under human optimization and propose that deep reinforcement learning (DRL) can discover novel, correct, and lower-latency instruction-level implementations by directly optimizing correctness and CPU latency. They formulate algorithm discovery as a single-player game (AssemblyGame) in which an agent incrementally constructs assembly programs and is rewarded for correctness and performance. The goal is to surpass state-of-the-art human-designed small sorting routines and demonstrate generality to other low-level algorithmic subroutines.
Literature Review
The paper situates its work within decades of program synthesis and optimization: enumerative and stochastic superoptimization approaches that search program spaces; syntax-guided synthesis; and more recent deep learning-based program generation. While prior methods can produce correct programs or optimize proxies (e.g., instruction count), they struggle with scaling and with directly optimizing true latency, especially in the presence of branching. The authors compare their DRL approach to a state-of-the-art stochastic superoptimization baseline (AlphaDev-S), noting that previous techniques often require exhaustive search over massive spaces, rely on proxies for latency, and can get trapped in local optima.
Methodology
The authors cast instruction-level algorithm optimization as a reinforcement learning problem named AssemblyGame. A state S_t = (P, Z) represents (P) the partial program (sequence of assembly instructions) constructed so far and (Z) the current CPU state (registers and memory) after executing P on a set of test inputs. At each timestep the agent selects a legal assembly instruction (action) to append to the program. Rewards combine correctness and performance: a correctness reward r_c is computed by executing P on predefined test sequences and comparing outputs to expected outputs; a performance component reflects latency, implemented either via a proxy (penalizing program length when length correlates with latency) or using measured wall-clock latency. Episodes terminate after a step limit, accumulating returns over the construction. The learning agent, AlphaDev, extends AlphaZero: a deep neural network guides Monte Carlo Tree Search (MCTS), outputting a policy over next-instruction actions and a value estimate of expected return. Self-play generates trajectories used to update network parameters. Representation learning is critical: the AlphaDev representation network combines (1) a transformer encoder tailored to assembly instruction sequences (encoding opcodes and operands to embeddings) and (2) a CPU state encoder (a multilayer perceptron over registers/memory for given inputs). Their embeddings are fused to form the state representation. To better model performance, AlphaDev uses dual value heads: one predicts correctness and the other predicts latency, trained with Monte Carlo targets including actual measured latency when used. This dual-head setup outperforms a single value head when optimizing for latency. The system evaluates correctness exhaustively over task-specific test sets (e.g., all permutations for small sorts) and uses measured latency sparingly during search by leveraging the latency head’s predictions, thus reducing expensive measurements to less than 0.002% of generated programs. The approach is applied to both branchless fixed sorts (sorting networks) and branching variable sorts, as well as to additional domains (e.g., VarInt deserialization).
Key Findings
- AlphaDev discovers new fixed and variable small sorting algorithms that match or surpass state-of-the-art human benchmarks and are integrated into the LLVM libc++ standard C++ sort library. - Variable sorts (measured directly on latency) show significantly lower latencies than human benchmarks with reported fifth-percentile latencies and 95% confidence intervals across 100 machines: • VarSort3: AlphaDev 236,498 (235,896, 236,887) vs human 246,040 (245,335, 246,470) • VarSort4: AlphaDev 279,339 (278,791, 279,815) vs human 294,963 (294,514, 295,861) • VarSort5: AlphaDev 312,079 (311,515, 312,787) vs human 331,198 (330,717, 331,850). - When optimizing for instruction count (a latency proxy in branchless settings), AlphaDev finds shorter or equal-length algorithms compared to human benchmarks, e.g., reported counts include: VarSort3 42 vs 43; VarSort4 37 vs 60; VarSort5 63 vs 72; and a new Variant 27 instructions (no human counterpart reported). - For fixed branchless sorts: AlphaDev produces fewer instructions for sort3 and sort5 and matches the best for sort4, yielding lower latency given the strong correlation between instruction count and latency in branchless code. - The agent discovers novel micro-optimizations in sorting networks: the AlphaDev swap move and AlphaDev copy move, each saving an instruction where applicable by altering comparator configurations, leading to systematic improvements. - AlphaDev generalizes to other domains, notably optimizing Protocol Buffers VarInt deserialization. It learns a branching solution that is shorter and approximately 3× faster than the human benchmark for single-valued inputs and discovers a new "VarInt assignment" move that fuses two operations into one instruction. - Compared to a strong stochastic superoptimization baseline (AlphaDev-S), AlphaDev explores orders of magnitude fewer programs (about 2 million vs up to 5 trillion in the worst case) and achieves better latency in branching problems; AlphaDev-S is more competitive in branchless cases when warm-started with near-optimal solutions. - The discovered algorithms have been accepted into the libc++ standard sorting library, representing the first update to these subroutines in over a decade, and are estimated to be invoked trillions of times per day across platforms.
Discussion
The findings directly address the central question of whether AI can discover fundamentally faster low-level algorithms than human experts. By formulating algorithm synthesis as a game and optimizing both correctness and latency, AlphaDev produces small sorting routines that reduce latency versus state-of-the-art human designs and introduces structurally different branching logic for variable sorts (e.g., VarSort4’s call sequence) that yields substantial gains. The identification of generalizable instruction-level moves (swap and copy moves for sorting networks; an assignment move for VarInt) illustrates that the agent can uncover reusable algorithmic primitives. The integration into the widely used libc++ library underscores practical significance: even micro-optimizations at small input sizes translate to large real-world savings due to frequent invocation in divide-and-conquer sorts and broad software usage. The comparison with stochastic superoptimization highlights complementary strengths: DRL-guided search scales better in branching spaces and reduces reliance on exhaustive latency measurements by learning a predictive latency head. The results suggest broader applicability, potentially extending to functions like hashing where correctness can be defined statistically (e.g., collision rates) alongside latency, and to optimizing inner logic components in larger systems.
Conclusion
The paper demonstrates that deep reinforcement learning, instantiated as AlphaDev, can automatically discover instruction-level implementations of small sorting and related routines that outperform state-of-the-art human benchmarks in latency and/or instruction count. These learned algorithms have been integrated into the libc++ standard library, marking a rare and impactful update to foundational routines. Contributions include: a game formulation for algorithm synthesis with combined correctness/latency rewards; a transformer-based representation for assembly programs paired with a CPU state encoder; dual value heads to predict correctness and latency; and the discovery of reusable micro-optimization moves. Future work includes scaling to larger sorts, combining DRL with stochastic superoptimization to leverage complementary advantages, extending to domains where exhaustive verification is infeasible (e.g., hashing, cryptographic primitives) by optimizing empirical objectives (collisions, throughput), and broadening architecture and datatype coverage.
Limitations
- Correlation between instruction count and latency holds strongly in branchless (sorting-network) settings but breaks down in branching algorithms; optimizing proxies like length can be misleading when branches are required. - Accurate latency optimization requires expensive measurements; while AlphaDev reduces measurement frequency via a learned latency value head, the approach still depends on representative benchmarks and hardware-specific timing. - Correctness evaluation relies on predefined test sets (exhaustive in small sorts, curated for VarInt), which may limit applicability to problems where exhaustive verification is infeasible or specifications are statistical. - Performance and discovered algorithms are architecture- and datatype-specific (e.g., x86, ARMv5; uint32, float), potentially limiting generalizability without retraining/tuning for other targets. - The stochastic superoptimization baseline (AlphaDev-S) can be competitive or more efficient when warm-started and in branchless domains, indicating that DRL is not universally superior and hybrid approaches may be preferable. - Scaling to larger sorts and more complex algorithms remains an open challenge; preliminary steps (e.g., sort6 variants) suggest promise but are not yet fully realized.
Related Publications
Explore these studies to deepen your understanding of the subject.