memkeeper
BenchmarksEngineering

The leaderboard reorders by faculty

"Which model is smartest" is the wrong question, and a single benchmark number is a bad way to answer it. A leaderboard score is not false. It is anecdotal: one task's result standing in for a whole mind. Measure the faculties separately and the ranking will not sit still.

We built a small harness to test that directly. Five current frontier models, five distinct reasoning faculties, each probed at a difficulty that actually separates the field. No model wins every column. The "best model" changes depending on which faculty you happen to be measuring.

Abstraction arc-lite-hard Deduction logic-grid-hard Planning maze-keys Execution maze-solve 96 3D spatial maze-3d 15×6 Fable 5 100 100 100 100 90 Opus 4.8 80 100 90 80 60 Sonnet 5 70 80 100 0 80 GPT-5.5 80 90 40 90 100 Haiku 4.5 10 50 20 n/r n/r pass% 10 40 70 100 not run
Pass rate (%) at the hardest tier of each faculty, n=10 per cell. Cyan = higher. No model wins every column, and first place changes hands between Fable 5 and GPT-5.5. "n/r" = not yet run at that tier.
Task instancesreal solved runs; deduction illustrative
Abstractionarc-lite · real solve
Deductionsearch-required (illustration)
Planningmaze-keys · real solve
3D spatialmaze-3d · real solve
Actual test instances with the model's real single-shot answer replayed (Fable 5, all passing reps). Deduction stays a schematic; a Zebra constraint table has no faithful spatial replay.

How we measured

The probes are deliberately austere, so the number reflects reasoning rather than tooling:

The reordering

Abstraction: infer a hidden grid transformation from three examples, then apply it. A single transformation is trivial: every model scores 100%. The signal only appears when the hidden rule is a composition of two transforms.

Modelsingle transform2-compositiontokens (med)
Fable 5100%100%4.4K
GPT-5.5100%80%21.7K
Opus 4.8100%80%5.5K
Sonnet 5100%70%6.6K
Haiku 4.5100%10%11.9K

Planning: a locked door blocks the only route out, so the solver must detour to collect the key first, then reach the exit. Here the order flips: Sonnet, weak elsewhere, is perfect, while GPT-5.5, strong elsewhere, falls to 40%.

Modelmaze-keystokens (med)
Fable 5100%2.9K
Sonnet 5100%5.4K
Opus 4.890%5.8K
GPT-5.540%18.7K
Haiku 4.520%9.0K

3D spatial: stacked grid levels connected by ladders; the model tracks its position across all levels and emits a path. Difficulty is path length. At three small levels the top tier saturates; push to six and then seven larger levels (~67 and ~132 move solutions) and it spreads. This is GPT-5.5's column, where it leads the field.

Model3 × 9×9 (~22)6 × 15×15 (~67)7 × 19×19 (~132)
GPT-5.550%100%100%
Fable 5100%90%90%
Sonnet 580%80%in progress
Opus 4.8100%60%in progress
Haiku 4.510%n/rn/r

The remaining two faculties round out the picture. Execution, tracking a long dependent path without drift, is where length punishes: on a 96-move maze, Fable holds 100%, GPT-5.5 90%, Opus 80%, and Sonnet collapses to 0% (per-step reliability does not survive ~100 compounding steps). Deduction, Zebra-style constraint solving, looks saturated until you keep only puzzles that require search: Fable 100%, Opus 100%, GPT-5.5 90%, Sonnet 80%, Haiku 50%.

Execution examplemaze-solve-21 · 21×21 · 96 moves
Fable 5solved · 10/10
Sonnet 5wall clip · 0/10
Blue = start, amber = exit, green = legal path, red = first wall clip.

Read down the columns and the point is unmissable. Fable 5 is the all-rounder, and cheapest by token count. But GPT-5.5 is spiky: top of the field on 3D spatial, near the bottom on planning. Sonnet 5 wins planning outright yet is last on long-horizon execution. Opus 4.8 is robust but never dominant. A single scalar would average all of that into a rank that describes none of it.

Checks that changed the numbers

Two validation checks changed the table.

A "hard" tier that was secretly easy. The composed-abstraction task picks two transforms and applies them in sequence. But many pairs collapse: rotate-90 twice is just rotate-180; flip-then-flip is the identity, where the answer equals the input. About 42% of generated "hard" puzzles were degenerate (7% were the identity), so nearly half the hard tier was the easy tier wearing a disguise, inflating every score. We now reject any composition that is functionally equal to a single transform, so every hard puzzle is genuinely two-step. The lesson generalizes: when you build a hard tier by composing operations, prove the composition does not collapse back into the easy one.

Zeros need transcript checks. On the largest 3D maze, concurrent runs can fail at the transport layer before a model emits tokens, producing an empty run that looks like a clean 0% unless the transcript is inspected. We do not score those as capability failures. The cells marked "in progress" are being re-run serially before publication.

Scope. These are single-machine, single-shot numbers on generated puzzles, not a controlled certification of any vendor's model. The useful claim is narrower: with fixed prompts, generated instances, and deterministic verifiers, the ranking changes when the faculty changes.

That is the reason to distrust a single number, whether it is describing a model or a memory system.