Stop Overengineering
Agentic Memory
Why complex graph-based memory architectures may be overkill, and how simple, highly-optimized RAG achieves state-of-the-art results on every major benchmark.
Why we built Memanto differently
Long-term memory breaks at scale
AI systems today forget key facts over extended interactions, breaking context. When conversations span thousands of messages across dozens of distinct historical sessions, extracting the right fragments from a massive historical corpus to fit within a constrained context window remains the primary bottleneck for true agentic memory.
Even with today's massive context windows, indiscriminately dumping months of history degrades reasoning capabilities, leading to hallucinations and exorbitant inference costs.
The rush toward graph complexity
To solve retrieval at scale, competitors turned to complex graph databases. By explicitly mapping relationships between entities using Knowledge Graphs or hybrid semantic-graph engines, agents can traverse nodes to find answers.
While effective in theory, this approach introduces massive overhead: managing schemas, handling graph ingestion latency, and writing complex traversal queries. Every time new conversational data is ingested, the framework must invoke an LLM to extract entities and define relationships β turning the simple act of saving a memory into a compute-heavy, synchronous process.
Challenging the status quo with pure semantics
At Moorcheh, we built a highly-optimized Semantic Engine from the ground up to solve the "RAM Tax" of traditional vector databases. By moving away from bloated, in-memory HNSW + Cosine distance stacks in favor of information-theoretic retrieval, we created an engine capable of sub-40ms serverless latency across 100k+ namespaces.
Before Memanto, we asked a simple question: Is all this graph database complexity really necessary? We wanted to see just how far we could push a standard RAG pipeline, stripped of complexities, relying solely on high-quality semantic search.
How we got to SOTA
A step-by-step benchmarking progression β each iteration building on the last, with no graph databases added at any stage.
Naive RAG Baseline
Simple semantic search against ingested memory chunks. Top-10 retrieval limit, 0.15 ITS similarity threshold, Claude Sonnet 4.
Relaxed Retrieval (k=40)
Multi-session questions reference events scattered across disjoint sessions. Raising the retrieval limit to 40 chunks and dropping the threshold to 0.10 delivered immediate gains.
Recall matters far more than precision for agentic memory.
Optimized Prompts
Adopted and modified prompts from the Hindsight evaluation framework to ensure apples-to-apples comparison with other systems.
Prompt engineering is a band-aid for structural retrieval deficits.
Dynamic Retrieval (k=100)
Expanded the dynamic retrieval limit to a maximum of 100 chunks at a 0.05 threshold. Retrieval is dynamic via Moorcheh's Information Theoretic Score β only pulls up to 100 chunks if they meet the threshold.
Giving the semantic engine a wider runway surfaces disparate fragments more effectively than hyper-precise vector queries.
Gemini 3 Inference
Switched the underlying model from Claude Sonnet 4 to Gemini 3 for advanced multi-hop reasoning and complex context synthesis β establishing parity with other leading benchmarked systems.
State of the art. No graphs. No ingestion tax. Single unified query.
Benchmark results
Final evaluation using Gemini 3 as the inference model β the same configuration used by competing systems.
Memanto leads both LoCoMo and LongMemEval benchmarks. Architecture details reflect each system's configuration at time of evaluation.
Full system comparison
| System | LoCoMo | LongMemEval | Architecture | Retrieval | Query Method |
|---|---|---|---|---|---|
| SOTAMemanto | 87.1% | 89.8% | Vector Only | RAG | Single Query |
| EmergenceMem | β | 86.0% | Graph + Vector | Parallel | Multi-Query |
| Supermemory | β | 85.2% | Graph + Vector | Parallel | Multi-Query |
| Memobase | 75.8% | β | Graph + Vector | Parallel | Single Query |
| Zep | 75.1% | 71.2% | Graph + Vector | Parallel | Single Query |
| Letta | 74.0% | β | Local Filesystem | RAG | Recursive |
| Full context | 72.9% | 60.2% | Full Context | Full Context | Single Query |
| Mem0 G | 68.4% | β | Graph + Vector | Parallel | Single Query |
| Mem0 | 66.9% | β | Vector Only | Parallel | Single Query |
| LangMem | 58.1% | β | Vector Only | RAG | Single Query |
The future of agentic memory
Memanto hitting 89.8% on LongMemEval and 87.1% on LoCoMo with a Vector-only architecture proves that highly-optimized retrieval can achieve state-of-the-art accuracy without forcing developers to pay for complex graph orchestration.
Recall over precision
In agentic memory, the semantic search recall/precision trade-off skews heavily toward recall. It is far better to retrieve noisy chunks and let a capable LLM filter context than to miss critical fragments entirely.
Zero ingestion tax
Memanto ingests raw conversational chunks directly into the vector store β bypassing the LLM extraction overhead entirely. Memories are semantically available for retrieval immediately after write.
Architecture simplicity wins
The next phase of agentic AI will not be won by the most complicated memory diagram. Strong, highly-optimized semantic infrastructure can already deliver SOTA performance without graph orchestration.
Start building with Memanto
Explore the docs and get started with Memanto β the universal memory layer for agentic AI.