EngineeringJune 12, 20268 min read

Memanto On-Prem: Your Agents' Memory, Entirely on Your Own Hardware

Memanto now runs fully on your own infrastructure — powered by a local Moorcheh on-prem server in Docker. No API key, no data leaving your environment, and zero per-request cost with local Ollama models.

Hetkumar PatelSoftware Developer

ON-PREM

Since we launched Memanto, the most consistent request hasn't been about a new memory type or another IDE integration. It has been a boundary question: "Can we run this without our data ever leaving our infrastructure?" Today the answer is yes. Memanto On-Prem runs the entire memory agent on your own hardware — no API key, no outbound calls, no data leaving your environment.

For most teams, Moorcheh Cloud is still the fastest way to give agents persistent memory: paste a free API key and you're storing memories in under a minute. But some environments can't make that trade, no matter how good the cloud is. Hospitals working under HIPAA. Financial institutions with strict data-residency rules. Defense and research teams on air-gapped networks. Or simply developers who want to hack on agent memory on a plane, with zero per-request cost.

Memanto On-Prem exists for exactly those cases, and it isn't a stripped-down "lite" edition. It's the same memory agent, the same CLI, the same 13 memory types, and the same retrieval engine — running in Docker on a machine you control.

Moorcheh On-Prem Is the Backend

Before getting into the workflow, it's worth being clear about what's actually running under the hood, because this is what makes the whole thing possible.

Memanto has never stored memories itself. Every remember, recall, and answer call is ultimately served by Moorcheh, the information-theoretic search engine we've written about before — the one that replaces the usual HNSW + float32 + cosine-similarity stack with Maximally Informative Binarization and deterministic, exhaustive search. In the default setup, that engine lives in Moorcheh Cloud and Memanto talks to it over the network with your MOORCHEH_API_KEY.

Memanto On-Prem swaps that one piece: instead of calling Moorcheh Cloud, the CLI talks to a Moorcheh on-prem server — the same engine, packaged as a Docker container and launched locally with a single moorcheh up from the moorcheh-client package. It exposes the same primitives the cloud does: namespaces, document storage, similarity search, and grounded answer generation, all served from http://localhost:8080 on your own hardware.

“Same engine, same search quality, different address. The only thing that changes is where your memories physically live.

This matters because on-prem deployments usually force a quality trade-off — the self-hosted version runs a weaker index, or search behaves subtly differently than production. Here there is no second implementation to drift out of sync. Memanto's service code never branches on backend; everything that works against Moorcheh Cloud works identically against the local server once you're configured.

Three Components, Zero API Keys

A full on-prem deployment is three pieces:

The Memanto CLI/server — the standard memanto binary you already use. When configured for on-prem, it routes all Moorcheh calls to http://localhost:8080 instead of the cloud.
The Moorcheh on-prem server — the containerized search and storage engine described above. This is where memories and embeddings actually live.
Embedding and LLM providers — by default, Ollama running in Docker alongside Moorcheh, which means no API keys anywhere in the stack. If you'd rather use hosted models, OpenAI and Cohere are drop-in options, and you can mix them — local embeddings with a hosted LLM, or the reverse.

Provider	Embeddings	LLM	API key
Ollama (default)	`nomic-embed-text`	`qwen2.5`	None
OpenAI	`text-embedding-3-small`	`gpt-4o-mini`	Required
Cohere	`embed-english-v3.0`	`command-r-plus-08-2024`	Required

With the all-Ollama configuration, the system is genuinely air-gap friendly: after the initial model pulls (roughly 5 GB), there are no outbound calls to OpenAI, Cohere, or Moorcheh at all. Embeddings, search, and answer generation all happen on your machine, at zero per-request cost.

The hardware bar is lower than you might expect. The minimum is 4 cores, 8 GB of RAM, and 10 GB of disk — no GPU required. For comfortable Ollama inference we recommend 16 GB of RAM and 30 GB of disk; an NVIDIA GPU speeds things up but is strictly optional. Windows (Docker Desktop with WSL2), macOS 12+, and all major Linux distributions are supported.

Nothing Is Missing

The promise of "full feature parity" tends to come with asterisks, so here is the list with none:

All 13 typed memory types — instruction, fact, decision, goal, commitment, preference, relationship, context, event, learning, observation, artifact, and error
The three core primitives: remember, recall, and answer
Temporal queries — --as-of, --changed-since, --recent
Batch ingestion (up to 100 memories per request) and file uploads (.pdf, .docx, .xlsx, .json, .txt, .csv, .md)
Daily summaries, conflict detection, and scheduled runs
The web UI via memanto ui
Every IDE integration — Claude Code, Cursor, Codex, Windsurf, Gemini CLI, Cline, Continue, and the rest via memanto connect
The complete REST API under /api/v2/agents/...

If your agents use it against Moorcheh Cloud today, it works on-prem tomorrow. Your prompts, your integration code, and your workflows don't change.

Up and Running in Ten Minutes

The setup is a wizard, not a runbook. With Docker running and Python 3.10+ on your PATH:

bashpip install memanto
memanto

The first run asks you to choose a backend:

textChoose your backend
  1  Moorcheh Cloud    (instant, needs API key)
> 2  Moorcheh On-Prem  (Docker, no API key)

Pick option 2, choose your embedding and LLM providers (just press through the defaults for all-local Ollama), and the wizard pulls the containers, starts the Moorcheh server, and waits for http://localhost:8080/health to go green. The whole thing takes 5–10 minutes, most of it model downloads.

Then prove to yourself that nothing left your machine:

bashmemanto agent create on-prem-demo
memanto remember "The user prefers dark mode for the dashboard" --type preference
memanto recall "What theme does the user want?"
memanto answer "Based on memory, what theme should I set?"

That answer call — embedding the query, searching memories, and generating a grounded response — ran entirely on your hardware. Unplug the network cable and run it again if you want to be sure.

Cloud and On-Prem, Side by Side

You don't have to choose one backend forever. Memanto keeps the two completely isolated: cloud state lives in ~/.memanto/, on-prem state in ~/.memanto/on-prem/, each with its own agents, memories, and configuration. Switching is one command:

bashmemanto config backend           # show the active backend
memanto config backend on-prem   # switch to local
memanto config backend cloud     # switch back

Switching never mutates or deletes data on either side — your cloud agents are exactly where you left them when you come back. The only thing intentionally cleared on a switch is the active session token, so a JWT issued for one backend can never be replayed against the other.

In practice this enables a workflow we use ourselves: prototype against the cloud where setup is instant, then flip to on-prem for anything involving real customer data. Same CLI, same muscle memory, different jurisdiction.

From Laptop to Kubernetes

A single-developer laptop setup is where most teams start, but on-prem really pays off when the whole team shares one memory deployment inside the network perimeter.

memanto serve starts the full REST API as a FastAPI server on port 8000, and it deploys like any stateless web service: Docker, Docker Compose with Moorcheh and Ollama as sibling services, systemd on a bare VM, or Kubernetes. The Memanto server itself holds no state — no leader election, no shared volumes — so it scales horizontally behind a load balancer without ceremony. The reference Kubernetes manifests run three replicas with an autoscaler, Moorcheh as an in-cluster service with a persistent volume for memory storage, and an optional GPU-backed Ollama pod for inference. NetworkPolicies lock Moorcheh down so only Memanto pods can reach it.

The health endpoints are split deliberately — /ready for liveness, /health for actual backend connectivity — so a transient hiccup in the search backend takes a pod out of rotation without triggering a restart storm.

The result is a private, self-contained memory service for every agent in your organization: your IDE assistants, your CrewAI pipelines, and your production agents all sharing one memory agent that never crosses your network boundary.

Get Started

If your agents handle data that can't leave the building, there's no longer a reason they have to forget everything between sessions.

Install: pip install memanto, run memanto, pick option 2.
Read the docs: the full On-Prem Overview, from requirements to Kubernetes deployment.
Already on cloud? Try memanto config backend on-prem — your cloud data stays untouched, and you can switch back any time.