BenchmarksPositioningJune 26, 2026

Cheap, local, and it tells you when it can't answer

Earlier this week we published memkeeper's retrieval numbers on LoCoMo: retrieval recall, scored without an answering model, not a head-to-head against other memory products. This post adds the layer that was missing, full answer accuracy, on the same axis the commercial systems report.

The result: memkeeper is competitive on accuracy while running entirely on your own machine at no marginal cost. It also declines to answer when the evidence is not there, which is the part of the picture most leaderboards leave out.

Same axis, this time

The earlier post measured whether the right evidence came back in the top results. This one runs the full pipeline: for each of LoCoMo's 1,986 questions we retrieve, hand the context to an answering model, and score the answer against the annotated ground truth with an LLM judge. That is the same end-to-end measurement the published commercial memory systems use, so the numbers are directly comparable.

LoCoMo answer accuracy	score
memkeeper (4-category aggregate)	0.720
Published commercial memory systems	~0.67–0.68

memkeeper's four-category aggregate is 0.720, at or above the published answer-accuracy of leading commercial agent-memory systems on this benchmark. An open-source layer running on your laptop is in the same band as hosted products that bill per call.

memkeeper is one SQLite file and a local model. There is no per-query API charge for the memory layer, no data leaves your machine, and retrieval comes back warm in about 25ms. The accuracy is competitive; the cost is zero marginal and the data stays local. That is the trade.

The adversarial category

LoCoMo has a fifth category, adversarial: questions the conversation does not actually support, where the correct response is to decline rather than fabricate an answer. memkeeper scores 0.496 here. It is the weakest of the categories and an open item we are still working on. We include it because a memory layer that always returns a confident answer is a liability, and the right behavior on an unanswerable question is to say so.

On LongMemEval, our next benchmark, the full thirty-question abstention number landed at 0.833: the system declines on twenty-five of thirty false-premise questions instead of guessing. We wrote that one up on its own, a harder benchmark, and it still tells you when it can't answer.

Why abstention matters. A memory layer that always returns a confident answer is a liability. The failure mode is silent: the agent acts on a fabricated recall and you find out downstream. A system that declines on an unsupported question fails loud instead.

Context, not just a snippet

There is a structural difference under the numbers. A lot of memory systems extract isolated facts and hand back bare strings. memkeeper returns memories with their surrounding context and provenance, so the agent gets the fact and the frame it lived in. That is part of why the harder, multi-step questions hold up, and it is the thing that does not show in a single accuracy figure but shows the moment you actually use it.

Run it yourself

The method and the harness are in the repo, so you can run the same script and check these numbers against your own machine. One note on the runs above: an earlier full-scale run was labeled as the hybrid pipeline when it had actually fallen back to keyword-only retrieval. We caught it, re-ran it correctly, and the corrected numbers are what appear here.

Competitive accuracy, local, zero marginal cost, with abstention behavior reported alongside. Clone it and run the harness.

Curious how the retrieval gets there? Read the LoCoMo retrieval numbers and the reproducible harness, or why hybrid retrieval beats pure vector search.