BenchmarksEngineeringJuly 2, 2026

The leaderboard reorders by faculty

"Which model is smartest" is the wrong question, and a single benchmark number is a bad way to answer it. A leaderboard score is not false. It is anecdotal: one task's result standing in for a whole mind. Measure the faculties separately and the ranking will not sit still.

We built a small harness to test that directly. Five current frontier models, five distinct reasoning faculties, each probed at a difficulty that actually separates the field. No model wins every column. The "best model" changes depending on which faculty you happen to be measuring.

Pass rate (%) at the hardest tier of each faculty, n=10 per cell. Cyan = higher. No model wins every column, and first place changes hands between Fable 5 and GPT-5.5. "n/r" = not yet run at that tier.

Task instancesreal solved runs; deduction illustrative

Abstractionarc-lite · real solve

Deductionsearch-required (illustration)

Planningmaze-keys · real solve

3D spatialmaze-3d · real solve

Actual test instances with the model's real single-shot answer replayed (Fable 5, all passing reps). Deduction stays a schematic; a Zebra constraint table has no faithful spatial replay.

How we measured

The probes are deliberately austere, so the number reflects reasoning rather than tooling:

Single-shot, no tools. The model reasons in-context and emits an answer. It cannot write a solver and run it, which would saturate everything and measure code, not thought.
Generated fresh per rep. Each of the 10 reps is a new instance from a seeded generator, so the score is generalization across instances, not luck on one board.
Deterministically scored. Every task has a verifier that simulates the answer and exits non-zero on any wrong or partial output, with no LLM judge grading generously. A path that hits a wall fails; an output grid that is off by one cell fails.
n=10 with a 95% Wilson interval. Overlapping intervals mean "tied," not "ranked." We do not read a one-rep gap as a result.

The reordering

Abstraction: infer a hidden grid transformation from three examples, then apply it. A single transformation is trivial: every model scores 100%. The signal only appears when the hidden rule is a composition of two transforms.

Model	single transform	2-composition	tokens (med)
Fable 5	100%	100%	4.4K
GPT-5.5	100%	80%	21.7K
Opus 4.8	100%	80%	5.5K
Sonnet 5	100%	70%	6.6K
Haiku 4.5	100%	10%	11.9K

Planning: a locked door blocks the only route out, so the solver must detour to collect the key first, then reach the exit. Here the order flips: Sonnet, weak elsewhere, is perfect, while GPT-5.5, strong elsewhere, falls to 40%.

Model	maze-keys	tokens (med)
Fable 5	100%	2.9K
Sonnet 5	100%	5.4K
Opus 4.8	90%	5.8K
GPT-5.5	40%	18.7K
Haiku 4.5	20%	9.0K

3D spatial: stacked grid levels connected by ladders; the model tracks its position across all levels and emits a path. Difficulty is path length. At three small levels the top tier saturates; push to six and then seven larger levels (~67 and ~132 move solutions) and it spreads. This is GPT-5.5's column, where it leads the field.

Model	3 × 9×9 (~22)	6 × 15×15 (~67)	7 × 19×19 (~132)
GPT-5.5	50%	100%	100%
Fable 5	100%	90%	90%
Sonnet 5	80%	80%	in progress
Opus 4.8	100%	60%	in progress
Haiku 4.5	10%	n/r	n/r

The remaining two faculties round out the picture. Execution, tracking a long dependent path without drift, is where length punishes: on a 96-move maze, Fable holds 100%, GPT-5.5 90%, Opus 80%, and Sonnet collapses to 0% (per-step reliability does not survive ~100 compounding steps). Deduction, Zebra-style constraint solving, looks saturated until you keep only puzzles that require search: Fable 100%, Opus 100%, GPT-5.5 90%, Sonnet 80%, Haiku 50%.

Execution examplemaze-solve-21 · 21×21 · 96 moves

Fable 5solved · 10/10

Sonnet 5wall clip · 0/10

Blue = start, amber = exit, green = legal path, red = first wall clip.

Read down the columns and the point is unmissable. Fable 5 is the all-rounder, and cheapest by token count. But GPT-5.5 is spiky: top of the field on 3D spatial, near the bottom on planning. Sonnet 5 wins planning outright yet is last on long-horizon execution. Opus 4.8 is robust but never dominant. A single scalar would average all of that into a rank that describes none of it.

Checks that changed the numbers

Two validation checks changed the table.

A "hard" tier that was secretly easy. The composed-abstraction task picks two transforms and applies them in sequence. But many pairs collapse: rotate-90 twice is just rotate-180; flip-then-flip is the identity, where the answer equals the input. About 42% of generated "hard" puzzles were degenerate (7% were the identity), so nearly half the hard tier was the easy tier wearing a disguise, inflating every score. We now reject any composition that is functionally equal to a single transform, so every hard puzzle is genuinely two-step. The lesson generalizes: when you build a hard tier by composing operations, prove the composition does not collapse back into the easy one.

Zeros need transcript checks. On the largest 3D maze, concurrent runs can fail at the transport layer before a model emits tokens, producing an empty run that looks like a clean 0% unless the transcript is inspected. We do not score those as capability failures. The cells marked "in progress" are being re-run serially before publication.

Scope. These are single-machine, single-shot numbers on generated puzzles, not a controlled certification of any vendor's model. The useful claim is narrower: with fixed prompts, generated instances, and deterministic verifiers, the ranking changes when the faculty changes.

That is the reason to distrust a single number, whether it is describing a model or a memory system.

More measurement, not assertion: see memkeeper on LoCoMo and memory that says "I don't know".