Computer Science
MIRIX: Multi-Agent Memory System for LLM-Based Agents
Y. Wang and X. Chen
The paper addresses a central challenge for LLM-based agents: building effective long-term memory that can persist, retrieve, and utilize user-specific information over time, including multimodal content. Existing assistants are largely stateless beyond the prompt window and typical memory systems rely on flat storage or text-only mechanisms, limiting personalization, abstraction, and recall. MIRIX is proposed as a comprehensive, modular multi-agent memory architecture with six specialized memory components and a Meta Memory Manager to route updates and retrieval. The system supports multimodal inputs (e.g., screenshots) and introduces an Active Retrieval mechanism to automatically surface relevant memories during interaction. MIRIX is evaluated on a new ScreenshotVQA benchmark (large-scale high-resolution screenshot sequences) and the LOCOMO long-form conversation dataset, demonstrating large improvements in accuracy and storage efficiency over RAG and long-context baselines.
Related work spans memory-augmented large language models and memory-augmented agents. Latent-space memory approaches modify model architectures (e.g., external memory matrices, hidden states, soft prompts, KV caches), often requiring retraining and functioning more like long-context extensions than structured memory systems. Agent-focused memory systems such as Zep, Mem0, and MemGPT store conversational content in text form or temporal knowledge graphs but typically use flat architectures with limited routing and multimodal support. Cognitive and AI literature highlights distinct memory types—episodic, semantic, and procedural—yet prior systems rarely integrate them into a comprehensive, modular framework. Multi-agent systems show benefits of role specialization and coordinated workflows. MIRIX builds on these strands by composing six memory types under coordinated managers, emphasizing routing, multimodal processing, and efficient retrieval.
MIRIX is a modular, multi-agent memory system comprising six specialized memory components managed by dedicated Memory Managers and coordinated by a central Meta Memory Manager. Components: (1) Core Memory stores persistent, high-priority facts in two blocks—persona (agent identity/tone) and human (enduring user facts, preferences). It triggers controlled rewrites when capacity exceeds 90%. (2) Episodic Memory captures time-stamped events and interactions with fields: event_type, summary, details, actor, timestamp, enabling temporal indexing of routines and changes. (3) Semantic Memory maintains abstract, time-independent knowledge of concepts, entities, and relationships with fields: name, summary, details, source, supporting social and commonsense reasoning. (4) Procedural Memory stores goal-directed workflows, guides, and scripts with entry_type, description, and structured step lists for task execution. (5) Resource Memory holds documents and multimodal files (doc, markdown, pdf_text, image, voice_transcript) with title, summary, resource_type, and content to ensure context continuity. (6) Knowledge Vault securely stores verbatim sensitive information (credentials, addresses, contacts, API keys) with entry_type, source, sensitivity, and secret_value with access control. Active Retrieval: The agent automatically infers a current topic from user input and retrieves top-k relevant entries from each component (e.g., top-10), tagging sources for the model and injecting results into the system prompt. Multiple retrieval methods are supported (embedding_match, bm25_match, string_match), with ongoing expansion of specialized strategies. Multi-Agent Workflows: Memory Update—upon new input, the system performs coarse search over the memory base, passes results to Meta Memory Manager, which routes to relevant Memory Managers for parallel updates, deduplicates, and confirms completion. Conversational Retrieval—Chat Agent performs initial coarse retrieval across components, selects targeted retrieval methods per component, consolidates results, synthesizes responses, and precisely updates memories when the user provides new facts or corrections. Application implementation includes a React-Electron frontend and Uvicorn backend, real-time screen monitoring (screenshot every 1.5 seconds, similarity-based deduplication, streaming uploads via Google Cloud URLs for Gemini), triggering memory updates after 20 unique screenshots (~60 seconds), and visualization interfaces for semantic and procedural memories.
ScreenshotVQA: Compared methods include Gemini (long-context with resized images) and SigLIP@50 (retrieval + Gemini). Accuracy (Overall) and storage: Gemini 0.1166 accuracy, 236.70MB storage; SigLIP@50 0.4410 accuracy, 15.07GB storage; MIRIX 0.5950 accuracy, 15.89MB storage. Per participant, MIRIX achieves highest accuracies and drastically lower storage (e.g., Student 3: 0.6727 accuracy with 7.28MB). Reported improvements: MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage by 99.9%, and a 410% improvement over the long-context baseline with a 93.3% storage reduction. LOCOMO: Using gpt-4.1-mini for all baselines, MIRIX attains state-of-the-art overall accuracy of 85.38%, outperforming LangMem (78.05) and Zep (79.09), and approaching Full-Context (87.52). Category-wise scores for MIRIX: Single-Hop 85.11, Multi-Hop 83.70, Open-Domain 65.62, Temporal 88.39. MIRIX shows especially strong gains on Multi-Hop (+24+ points over baselines) and Temporal tasks, validating its hierarchical storage and routing. An application powered by MIRIX demonstrates practical utility: real-time screen monitoring, personalized memory building, and secure local storage.
The findings show that a structured, compositional memory architecture with intelligent routing and multi-agent management can substantially improve accuracy and efficiency for LLM agents. MIRIX’s six memory components enable precise storage of time-stamped events, abstract concepts, instructions, resources, and sensitive verbatim data, while Active Retrieval ensures relevant personalized memories are automatically surfaced during interaction. On multimodal ScreenshotVQA, MIRIX’s abstraction and avoidance of raw image storage enable superior accuracy at orders-of-magnitude lower storage. On LOCOMO, explicit consolidation in episodic/semantic memories reduces reasoning burden at query time, yielding large gains on multi-hop and temporal questions. These results address the core research question—how to enable LLM agents to truly remember and utilize long-term, multimodal, user-specific information—by demonstrating improved performance and scalability over RAG and long-context baselines.
MIRIX introduces a comprehensive, modular memory system for LLM-based agents with six specialized components coordinated by a Meta Memory Manager and empowered by Active Retrieval. Evaluations on a new multimodal ScreenshotVQA benchmark and the LOCOMO dataset show substantial improvements in accuracy and storage efficiency, establishing MIRIX as a state-of-the-art memory system for agents. The released personal assistant application demonstrates real-world utility by continuously building and leveraging personalized memory from screen activity. Future work includes developing more challenging real-world benchmarks, further refining retrieval strategies and routing, and continuously improving the application to enhance user experience and privacy-preserving storage.
The ScreenshotVQA evaluation is based on data from three participants with user-authored questions, which may limit generalizability. Storage comparisons rely on specific resizing or retrieval setups; baseline systems that cannot process multimodal inputs were omitted. The primary evaluation metric uses LLM-as-a-Judge (GPT-4.1), which may introduce judgment biases. For LOCOMO, the adversarial category was excluded, following prior work, which affects comprehensiveness. MIRIX’s performance on Single-Hop questions can be affected by ambiguous question phrasing (e.g., plan versus actual event). Open-domain questions reveal a reliance on retrieval that may limit global understanding, indicating a bottleneck compared to full-context reasoning models.
Related Publications
Explore these studies to deepen your understanding of the subject.

