Why hybrid retrieval beats pure vector search
If you build anything with memory today, the reflex is to reach for a vector database. Embed everything, store the vectors, search by cosine similarity. It works, and for a lot of cases it works well. But for agent memory specifically, pure vector search has a blind spot that shows up exactly when it matters most.
Where dense vectors fall short
Embeddings are built to capture meaning. That is their strength and the root of the problem. Two phrases that mean nearly the same thing land close together in vector space, which is what you want when a query is conceptual. But when a query hinges on an exact token, a function name, an error code, a flag, a person's name, an identifier, similarity-of-meaning is the wrong tool.
Ask a vector index for mem_a31f or MEMKEEPER_REQUIRE_SEMANTIC or route2.mx.cloudflare.net and it will cheerfully return things that are about the same topic while missing the one record that contains the literal string. The closest vector is not always the correct answer. Sometimes the correct answer is a keyword match, and a dense model has no special respect for keywords.
memkeeper's answer: three stages, each covering a blind spot
1 · Dense vectors for meaning
A local ONNX embedding model (mxbai-embed-large, 1024 dimensions) turns each memory into a vector so conceptual queries find conceptually related memories, even when they share no words. This is the part pure vector search gets right, and we keep it.
2 · BM25 for exact terms
In parallel, a deterministic BM25 / full-text index catches the literal tokens dense vectors gloss over. Error codes, identifiers, names, and flags resolve by the words actually written, not by their neighborhood in embedding space. Where the embedding is fuzzy, BM25 is sharp.
3 · A cross-encoder reranks the merged set
First-stage retrieval, whether ANN or BM25, optimizes for cheap, high-recall candidate generation. It casts a wide net. The problem is ordering: the best memory is somewhere in the candidate pool, but not necessarily on top.
So memkeeper runs a cross-encoder reranker (mxbai-rerank-base) over the merged candidates. Unlike the first stage, which embeds the query and the memory separately and compares them, a cross-encoder reads the query and each candidate together and scores their actual relevance. It is more expensive per candidate, which is why it runs only on the shortlist, and it is what moves the right memory from "somewhere in the top 50" to "first."
Meaning plus exact terms plus judgment
Put together: dense vectors catch what you mean, BM25 catches what you typed, and the cross-encoder decides which of the candidates actually answers the query. Each stage covers a failure mode of the others.
This is not a tuning detail. It is the difference between a retriever that demos well on conceptual questions and one you can trust with the exact, literal recall that real work depends on. And it is measurable: the same configuration is what produces memkeeper's LoCoMo benchmark numbers, hit@20 of 0.880 out of the box, on a public long-term QA set.
One more property worth naming: all three stages run locally, through ONNX, with no API round-trip. Hybrid retrieval does not mean shipping your query to a remote reranking service. The whole pipeline runs on your machine, which is a design choice we care enough about to have written down separately.