Figures

Key figures from the paper.

Model Success Leaderboard
Model Leaderboard. Overall solve rates for all 51 evaluated models, grouped by provider. Direct-ask and agentic strategies shown separately.
Model Success Over Time (Recent)
Frontier Model Progress (Recent). Solve rates of frontier models over the past year. Each cell shows the model's solve rate at a given reasoning effort level.
Model Success Over Time (Full History)
Frontier Model Progress (Full History). Extended timeline showing how pencil puzzle performance has evolved across model generations.
Cost vs Success Pareto Frontier
Cost vs. Success. Pareto frontier of solve rate versus total cost per puzzle.
Reasoning Effort Scaling
Reasoning Effort Scaling. How solve rates change as reasoning effort (thinking budget) increases.
Difficulty Predictors
Difficulty Predictors. Comparison of features predicting puzzle difficulty for AI models. Compression ratio of solution moves is the strongest single predictor.
Difficulty Distribution
Difficulty Distribution. Distribution of puzzle difficulty across the benchmark set.
Puzzle Type Gallery
Puzzle Type Gallery. Examples of the 20 puzzle types included in the benchmark.
Puzzle Example: Initial State
Puzzle Example: Initial. A puzzle in its initial unsolved state.
Puzzle Example: Mid-Solve
Mid-Solve. Partially completed puzzle showing intermediate progress.
Puzzle Example: Complete
Complete. Fully solved puzzle with all constraints satisfied.
Leaderboard Puzzle Grid
Leaderboard Puzzle Grid. Per-puzzle solve results showing the granular performance landscape.