Model Leaderboard.
Overall solve rates for all 51 evaluated models, grouped by provider. Direct-ask and agentic
strategies shown separately.
Frontier Model Progress (Recent).
Solve rates of frontier models over the past year. Each cell shows the model's
solve rate at a given reasoning effort level.
Frontier Model Progress (Full History).
Extended timeline showing how pencil puzzle performance has evolved across model generations.
Cost vs. Success.
Pareto frontier of solve rate versus total cost per puzzle.
Reasoning Effort Scaling.
How solve rates change as reasoning effort (thinking budget) increases.
Difficulty Predictors.
Comparison of features predicting puzzle difficulty for AI models.
Compression ratio of solution moves is the strongest single predictor.
Difficulty Distribution.
Distribution of puzzle difficulty across the benchmark set.
Puzzle Type Gallery.
Examples of the 20 puzzle types included in the benchmark.
Puzzle Example: Initial.
A puzzle in its initial unsolved state.
Mid-Solve.
Partially completed puzzle showing intermediate progress.
Complete.
Fully solved puzzle with all constraints satisfied.