Computer SciencearXiv

Titans: Learning to Memorize at Test Time

A. Behrouz, P. Zhong, et al.

Discover Titans: a new family of architectures that pair a neural long-term memory module with attention to capture massive historical context while keeping fast, parallelizable training and inference. Experiments show Titans outperform Transformers and modern linear recurrent models on language modeling, common-sense reasoning, genomics, and time series, and can scale beyond 2M context windows. Research conducted by Ali Behrouz, Peilin Zhong, and Vahab Mirrokni.... show more

General Summary Metrics

Abstract

Over more than a decade there has been an extensive research effort of how effectively utilize recurrent models and attentions. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a new neural long-term memory module that learns to memorize historical context and helps an attention to attend to the current context while utilizing long past information. We show that this neural memory has the advantage of a fast parallelizable training while maintaining a fast inference. From a memory perspective, we argue that attention due to its limited context but accurate dependency modeling performs as a short-term memory, while neural memory due to its ability to memorize the data, acts as a long-term, more persistent, memory. Based on these two modules, we introduce a new family of architectures, called Titans, and present three variants to address how one can effectively incorporate memory into this architecture. Our experimental results on language modeling, common-sense reasoning, genomics, and time series tasks show that Titans are more effective than Transformers and recent modern linear recurrent models. They further can effectively scale to larger than 2M context window size with higher accuracy in needle-in-haystack tasks compared to baselines.

Publisher

arXiv

Published On

Dec 31, 2024

Authors

Ali Behrouz, Peilin Zhong, Vahab Mirrokni

DOI

https://doi.org/10.48550/arXiv.2501.00663

Explore these studies to deepen your understanding

Adjacent work that informs or extends this paper's methodology and findings.

Physics

Deep learning at the edge enables real-time streaming ptychographic imaging

A. V. Babu, T. Zhou, et al.

Chemistry

An end-to-end deep learning framework for translating mass spectra to de-novo molecules

E. E. Litsa, V. Chenthamarakshan, et al.

Health and Fitness

Rethinking aerobic exercise intensity prescription in adults with spinal cord injury: time to end the use of "moderate to vigorous" intensity?

M. J. Hutchinson and V. L. Goosey-tolfrey

Psychology

Ageing is associated with disrupted reinforcement learning whilst learning to help others is preserved

J. Cutler, M. K. Wittmann, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny