Figures - Pencil Puzzle Bench

Model Leaderboard. Overall solve rates for all 51 evaluated models, grouped by provider. Direct-ask and agentic strategies shown separately.

Frontier Model Progress (Recent). Solve rates of frontier models over the past year. Each cell shows the model's solve rate at a given reasoning effort level.

Frontier Model Progress (Full History). Extended timeline showing how pencil puzzle performance has evolved across model generations.

Puzzle Type Gallery. Examples of the 20 puzzle types included in the benchmark.

Leaderboard Puzzle Grid. Per-puzzle solve results showing the granular performance landscape.