Cheap, local, and it tells you when it can't answer
Earlier this week we published memkeeper's retrieval numbers on LoCoMo: retrieval recall, scored without an answering model, not a head-to-head against other memory products. This post adds the layer that was missing, full answer accuracy, on the same axis the commercial systems report.
The result: memkeeper posts solid answer accuracy while the memory layer runs on your own machine at no marginal cost. It also declines to answer when the evidence is not there, which is the part of the picture most leaderboards leave out.
Same axis, this time
The earlier post measured whether the right evidence came back in the top results. This one runs the full pipeline: for each of LoCoMo's 1,986 questions we retrieve, hand the context to an answering model, and score the answer against the annotated ground truth with an LLM judge. That is the standard end-to-end way these systems are measured, with one large caveat we get to below.
| LoCoMo, memkeeper | score |
|---|---|
| 4-category answer-accuracy aggregate | 0.720 |
| adversarial (declines the unanswerable) | 0.496 |
memkeeper's four-category aggregate is 0.720. We used to set that next to published commercial numbers; we have stopped, and the note below says why. The claim that stands without a competitor in the frame is the one under the number: this is an open-source layer running on your laptop, not a hosted product that bills per call.
memkeeper is one SQLite file and a local model. There is no per-query API charge for the memory layer, no memory data leaves your machine in the on-device configuration, and retrieval comes back warm in about 25ms. The accuracy holds up, the cost is zero marginal, and the memory stays local. That is the trade.
The adversarial category
LoCoMo has a fifth category, adversarial: questions the conversation does not actually support, where the correct response is to decline rather than fabricate an answer. memkeeper scores 0.496 here. It is the weakest of the categories and an open item we are still working on. We include it because a memory layer that always returns a confident answer is a liability, and the right behavior on an unanswerable question is to say so.
On LongMemEval, our next benchmark, the full thirty-question abstention number landed at 0.833: the system declines on twenty-five of thirty false-premise questions instead of guessing. We wrote that one up on its own, a harder benchmark, and it still tells you when it can't answer.
Context, not just a snippet
There is a structural difference under the numbers. A lot of memory systems extract isolated facts and hand back bare strings. memkeeper keeps each memory with its surrounding context and internal provenance, while normal reads keep source hidden unless you explicitly ask for it. The agent gets the fact and the frame it lived in without including source metadata in every recall. That is part of why the harder, multi-step questions hold up, and it is the thing that does not show in a single accuracy figure but shows the moment you actually use it.
Run it yourself
The method and the harness are in the repo, so you can run the same script and check these numbers against your own machine. One note on the runs above: an earlier full-scale run was labeled as the hybrid pipeline when it had actually fallen back to keyword-only retrieval. We caught it, re-ran it correctly, and the corrected numbers are what appear here.
Local, zero marginal cost, honest about what it cannot answer, and reproducible from the harness in the repo. Clone it and run it yourself.