DesignPositioningJune 27, 2026

A memory that says "I don't know"

Ask an agent's memory a question it cannot answer, and watch what it does. Most systems will answer anyway. They were built to return results, so they return results: the top few matches by similarity, handed up with the same confidence whether the evidence is solid or whether nothing in the store is actually relevant. The agent reads those matches as fact, acts on them, and the fabrication propagates downstream. You find out later, if you find out at all.

We think the single most important thing a memory layer can do is decline. When the evidence is not there, the correct answer is "I don't know," and a system that cannot say it is a liability dressed up as a feature.

The failure mode is silence

A wrong answer that announces itself is a nuisance. A wrong answer that arrives with full confidence is a hazard, because nothing about it looks wrong. The agent recalls a "fact" that was never true, threads it into a plan, and every step after that inherits the error. There is no exception, no red line in a log. The system behaved exactly as designed; the design was just willing to invent.

Why abstention matters. A memory layer that always returns a confident answer is a liability. The failure mode is silent: the agent acts on a fabricated recall and you find out downstream. A system that declines on an unsupported question fails loud instead.

This is the same principle that runs through everything we build: prefer loud failure over silent fallback. A memory that abstains is failing loud. It is telling you, at the moment it matters, that it does not have what you asked for, instead of papering over the gap with a plausible guess.

Why most systems can't say it

The reason abstention is rare is structural, not accidental. A pure vector search returns a ranked list by construction: ask for the top five and you get five, even when the best of them is a weak match to a question the store simply does not answer. Similarity is always defined. There is no point on that ranking where the system is built to say "none of these clear the bar." So the weak matches go up the pipe wearing the same clothes as the strong ones, and the answering model, handed five passages, dutifully writes an answer from them.

The fix is not a bigger model on top. It is a retrieval layer that is willing to come back with nothing, or with an honest signal that what it found is thin.

How memkeeper is built to abstain

Two design choices do most of the work.

The first is the hybrid retrieval pipeline. Dense embeddings, BM25, and a cross-encoder reranker each get a vote, and the reranker reads the query against each candidate directly rather than trusting a cosine distance. A passage that a vector index ranked highly on surface similarity, but that does not actually answer the question, is where the reranker earns its keep: it scores the match low, and a low score is the signal that lets the system decline instead of forcing a confident top-k it does not believe in.

The second is that memkeeper stores memories with their context and provenance, not as bare extracted strings. Normal recall keeps source hidden by default, but the memory still carries the frame it lived in, and provenance can be exposed when the user asks to inspect it. That gives an answering layer something to check against. When the retrieved context plainly does not bear on the question, a model that can see the context has the grounds to refuse, where a model handed a naked snippet has nothing to push back on.

We put a number on it

None of this is worth anything if we only assert it, so we measure it and publish the figure most leaderboards quietly drop. Benchmarks like LoCoMo and LongMemEval include adversarial, false-premise questions: the user asks about something that never happened, and the only correct response is to refuse. It is the single category that punishes a confident hallucination, and it is the one that tends to vanish from the headline number.

Abstention on unanswerable questions	score
LongMemEval false-premise (25 of 30 declined)	0.833
LoCoMo adversarial category	0.496

On LongMemEval, memkeeper correctly declines twenty-five of thirty false-premise questions instead of fabricating an answer. On LoCoMo's adversarial category it scores 0.496, the weakest of its categories and an open item we are still working. We report both, including the one that is not flattering, because the point of the number is to tell you whether the memory will make something up, and a number you only publish when it looks good does not do that.

It is not perfect, and we say so plainly: roughly one false-premise question in six still gets an answer it should not. The claim is not that memkeeper never invents. The claim is that it is built to decline, that we measure how often it does, and that the number is on the page where you can argue with it.

The honest default

An agent that occasionally says "I don't have that" is more useful than one that never does, because you can trust the answers it does give. Abstention is not a gap in the product to be closed; it is the behavior that makes the rest of the memory worth relying on. A memory you cannot trust to stay quiet is a memory you have to second-guess on every recall, which is most of the value gone.

So we built memkeeper to say it. When the evidence is there, it returns support the answering layer can use. When it is not, it tells you the recall is thin instead of pretending otherwise. That second half is the one we are proud of.

The numbers and the method behind them: a harder benchmark, and it still tells you when it can't answer, and the retrieval underneath it, why hybrid retrieval beats pure vector search.