BenchmarksEngineeringJune 23, 2026

Benchmarking AI agent memory on LoCoMo

Every memory system for AI agents makes the same claim: it remembers the right thing at the right time. Very few show their work. Recall is easy to assert and hard to measure honestly, so the field is full of demos and short on numbers.

So we measured. memkeeper's retrieval runs against LoCoMo, a public long-term conversational QA benchmark, and the numbers below are out-of-the-box: what a plain cargo build --release plus memkeeper pull-models produces, with no cherry-picking and a script you can run yourself.

What LoCoMo measures

LoCoMo is a set of long, multi-session dialogues with annotated questions, where the evidence needed to answer each question lives somewhere back in the conversation. It is a good proxy for the agent-memory problem: a fact gets stated once, many turns later something depends on it, and the system has to surface it again.

Our harness seeds every dialogue turn as a memory, then for each question runs memkeeper's bounded pack retrieval and scores whether the annotated evidence turns come back in the top 20. The full locomo10 set is 10 dialogues, 5,882 turn-memories, and 1,982 evidence-bearing questions.

One deliberate limitation: the harness is retrieval-only. It scores whether the right evidence is retrieved. It does not call an answering model and does not use an LLM judge. That keeps the measurement about the part we control, the retrieval, and removes the confound of a generator papering over a bad recall or a judge scoring generously.

The numbers

Config: --max-memories 20 --max-chars 8000 --rerank-candidates 50.

Config	recall@20	hit@20	MRR
Default (semantic + rerank)	0.768	0.880	0.668
+ late-interaction (ColBERT)	0.784	0.894	0.666

The default row is the baseline anyone gets on install: hybrid semantic retrieval with a cross-encoder rerank. hit@20 of 0.880 means that for ~88% of questions, at least one annotated evidence turn lands in the top 20 memkeeper hands back.

The second row adds an optional ColBERT late-interaction pass on top. It lifts hit@20 to 0.894 and recall@20 to 0.784, and it slightly lowers MRR (0.666 vs 0.668), so it is an honest tradeoff rather than a free win. It is off by default and its model is not fetched by pull-models, so we report it as an upgrade, not the headline.

Latency: run it warm

Retrieval quality is half the story. The other half is whether you can afford to call it at prompt time. memkeeper loads its ONNX models once in a persistent serve daemon; the alternative is a cold per-call binary that reloads the models on every query.

Path	p50	p95
Warm `serve` search	24.9 ms	25.5 ms
Cold per-call CLI search	799 ms	815 ms

The warm daemon is about 32x faster. The lesson is simple: run memkeeper as a persistent process if you are calling it inside a prompt loop. (These are 30-run measurements on a 30-memory store. The heavier semantic pack path, a 50-candidate rerank over the whole LoCoMo store, is a separate, slower operation.)

Reproduce it

None of this is interesting if you cannot run it. The harness ships in the repo:

cargo build --release
scripts/fetch-models.sh          # mxbai embed + rerank, ~2.1GB
# download locomo10.json from snap-research/locomo (CC BY-NC 4.0)

export MEMKEEPER_EMBED_MODEL_DIR=~/.memkeeper/models/mxbai-embed-large
export MEMKEEPER_RERANK_MODEL_DIR=~/.memkeeper/models/mxbai-rerank-base
./target/release/memkeeper serve --socket /tmp/mk-bench.sock &

MEMKEEPER_BENCH_SOCK=/tmp/mk-bench.sock \
  python3 scripts/memkeeper_locomo_benchmark.py \
    --dataset path/to/locomo10.json \
    --binary ./target/release/memkeeper \
    --max-memories 20 --max-chars 8000 --rerank-candidates 50 --json

On honesty. These are reproducible single-machine numbers, not a controlled cross-system comparison against other memory products. We are publishing the method and the harness precisely so the claim is checkable rather than taken on faith. If you run it and see something different, we want to hear about it.

Measured, not asserted. That is the bar we want to hold memory systems to, starting with our own.

Want the design behind these numbers? Read why hybrid retrieval beats pure vector search.