memkeeper
BenchmarksPositioning

Cheap, local, and it tells you when it can't answer

Earlier this week we published memkeeper's retrieval numbers on LoCoMo: retrieval recall, scored without an answering model, not a head-to-head against other memory products. This post adds the layer that was missing, full answer accuracy, on the same axis the commercial systems report.

The result: memkeeper posts solid answer accuracy while the memory layer runs on your own machine at no marginal cost. It also declines to answer when the evidence is not there, which is the part of the picture most leaderboards leave out.

Same axis, this time

The earlier post measured whether the right evidence came back in the top results. This one runs the full pipeline: for each of LoCoMo's 1,986 questions we retrieve, hand the context to an answering model, and score the answer against the annotated ground truth with an LLM judge. That is the standard end-to-end way these systems are measured, with one large caveat we get to below.

LoCoMo, memkeeperscore
4-category answer-accuracy aggregate0.720
adversarial (declines the unanswerable)0.496

memkeeper's four-category aggregate is 0.720. We used to set that next to published commercial numbers; we have stopped, and the note below says why. The claim that stands without a competitor in the frame is the one under the number: this is an open-source layer running on your laptop, not a hosted product that bills per call.

Why we stopped posting a comparison number. Answer accuracy on these benchmarks is scored by an LLM judge, and every team uses its own judge model, prompts, and harness, so the figures are not measured the same way. LoCoMo especially is contested: published results for the same systems range from the high 50s to the low 90s, and one of its most-cited numbers was revised down by roughly twenty-five points after a public methodology dispute. So we report our own number and our method, and leave the leaderboard jousting alone.

memkeeper is one SQLite file and a local model. There is no per-query API charge for the memory layer, no memory data leaves your machine in the on-device configuration, and retrieval comes back warm in about 25ms. The accuracy holds up, the cost is zero marginal, and the memory stays local. That is the trade.

The adversarial category

LoCoMo has a fifth category, adversarial: questions the conversation does not actually support, where the correct response is to decline rather than fabricate an answer. memkeeper scores 0.496 here. It is the weakest of the categories and an open item we are still working on. We include it because a memory layer that always returns a confident answer is a liability, and the right behavior on an unanswerable question is to say so.

On LongMemEval, our next benchmark, the full thirty-question abstention number landed at 0.833: the system declines on twenty-five of thirty false-premise questions instead of guessing. We wrote that one up on its own, a harder benchmark, and it still tells you when it can't answer.

Why abstention matters. A memory layer that always returns a confident answer is a liability. The failure mode is silent: the agent acts on a fabricated recall and you find out downstream. A system that declines on an unsupported question fails loud instead.

Context, not just a snippet

There is a structural difference under the numbers. A lot of memory systems extract isolated facts and hand back bare strings. memkeeper keeps each memory with its surrounding context and internal provenance, while normal reads keep source hidden unless you explicitly ask for it. The agent gets the fact and the frame it lived in without including source metadata in every recall. That is part of why the harder, multi-step questions hold up, and it is the thing that does not show in a single accuracy figure but shows the moment you actually use it.

Run it yourself

The method and the harness are in the repo, so you can run the same script and check these numbers against your own machine. One note on the runs above: an earlier full-scale run was labeled as the hybrid pipeline when it had actually fallen back to keyword-only retrieval. We caught it, re-ran it correctly, and the corrected numbers are what appear here.

Local, zero marginal cost, honest about what it cannot answer, and reproducible from the harness in the repo. Clone it and run it yourself.