BenchmarksPositioningJune 26, 2026

A harder benchmark, and it still tells you when it can't answer

When we posted memkeeper's LoCoMo answer accuracy, we said the abstention number on our next benchmark was still coming in at full scale. It is in. We ran memkeeper on LongMemEval, and the picture from LoCoMo holds: competitive accuracy, running entirely on your own machine at no marginal cost, and a real number for the thing most leaderboards leave out, whether the system declines to answer when the evidence is not there.

A bigger haystack

LongMemEval is the harder test. Each of its questions carries roughly fifty prior chat sessions, about 115,000 tokens of history, and the system has to find the one relevant moment and answer from it. We run the same end-to-end measurement the published commercial systems use: for each question we retrieve, hand the context to an answering model, and score the answer with an LLM judge using the benchmark's own grading prompts. Same axis, a longer history to get lost in.

The numbers

LongMemEval answer accuracy	score
memkeeper (answerable questions)	0.756
Published commercial memory systems	~0.71

memkeeper scores 0.756 on the answerable questions, in the same band as the strongest published answer accuracy for a commercial agent-memory system on this benchmark. Again: an open-source layer running on your laptop, one SQLite file and a local model, no per-query API charge for the memory, no data leaving your machine, retrieval back warm in about 25ms.

By category it is strong where retrieval is clean, single-session questions land near 0.96, and weaker on the hardest ones: multi-session reasoning (0.58) and temporal arithmetic (0.57). Those are the same two weak spots we saw on LoCoMo, and they are exactly where we are putting engine work next. We would rather name them than average them away.

The number the leaderboards skip

LongMemEval has a category of unanswerable, false-premise questions: the user asks about something that never happened, and the only correct response is to refuse rather than invent an answer. It is the single category that punishes a confident hallucination, and it is the one that tends to be dropped from the headline figure. The category-leading systems on this benchmark report only the answerable types.

memkeeper scores 0.833 here, the full set of thirty false-premise questions. It correctly declines twenty-five of thirty instead of fabricating an answer. We publish that number on purpose. It is the one that tells you whether the memory will make something up.

Why abstention matters. A memory layer that always returns a confident answer is a liability. The failure mode is silent: the agent acts on a fabricated recall and you find out downstream. A system that declines on an unsupported question fails loud instead.

It is not perfect. Roughly one false-premise question in six still gets an answer it should not, and we would rather show you that than round it off. The point is that the number exists and is on the page.

The method, so you can check it

The retrieval harness ships in the repo, so the recall path, the part we control, is yours to run directly. The answer-accuracy numbers above put an answering model and an LLM judge on top of that retrieval, scored with the benchmark's own published per-category prompts. That last layer needs an external model, so we document the method rather than ship anyone's credentials. The shape is simple: each question gets its own isolated store, seeded turn by turn, then retrieved, answered, and judged.

One honest caveat. Any LLM-judged score depends on the judge model, and ours is not necessarily the one another team used, so treat the cross-system numbers as directional rather than a leaderboard finish. The answerable score is over 156 questions and climbing as the full run completes; the abstention number is the complete set of thirty. We publish the method so the claim is checkable, not taken on faith.

Competitive accuracy, local, zero marginal cost, with the full abstention number reported alongside. Clone it and run the harness.

The shorter benchmark, with the same story: cheap, local, and it tells you when it can't answer. Or the retrieval underneath it: why hybrid retrieval beats pure vector search.