Pencil Puzzle Bench
A Benchmark for Multi-Step Verifiable Reasoning
Can AI solve pencil puzzles? We evaluated 51 frontier models on a selection of 300 puzzles from our 62k puzzle dataset, spanning 20 types.
62,231
Dataset Puzzles
20
Puzzle Types
51
Models Tested
17k
Eval Runs
Model Leaderboard
| # ▼ | Model | Direct Ask | Agentic | Cost/Attempt |
|---|---|---|---|---|
| 1 | gpt-5.4@xhighOpenAI | -- | 70.2% | $8.0758 |
| 2 | gpt-5.2@xhighOpenAI | 27.0% | 56.0% | $5.2145 |
| 3 | gpt-5.2@highOpenAI | 20.7% | 36.7% | $1.6702 |
| 4 | claude-opus-4-6-1mAnthropic | 0.0% | 36.7% | $1.3548 |
| 5 | claude-opus-4-6@thinkingAnthropic | 27.3% | 33.3% | $3.8474 |
| 6 | gemini-3.1-proGoogle | 20.0% | 33.3% | $3.4593 |
| 7 | claude-opus-4-6Anthropic | 0.3% | 30.0% | $1.2089 |
| 8 | claude-sonnet-4-6@thinkingAnthropic | 10.3% | 26.7% | $0.8668 |
| 9 | gpt-5.2-proOpenAI | 9.7% | 26.7% | $6.2532 |
| 10 | gpt-5.2@mediumOpenAI | 9.3% | 23.3% | $0.6898 |
| 11 | claude-opus-4-6@maxAnthropic | 0.3% | 23.3% | $1.0277 |
| 12 | claude-sonnet-4-6-1mAnthropic | 0.3% | 23.3% | $15.4337 |
| 13 | gemini-3-pro@highGoogle | 3.3% | 16.7% | $1.2710 |
| 14 | claude-sonnet-4-6Anthropic | 0.3% | 16.7% | $0.8372 |
| 15 | gemini-3-proGoogle | 4.3% | 13.3% | $0.8876 |
| 16 | gemini-3-pro@minimalGoogle | 4.0% | 10.0% | $1.2281 |
| 17 | gpt-5.2@lowOpenAI | 2.3% | 10.0% | $0.1186 |
| 18 | gpt-5.1@mediumOpenAI | 7.7% | 6.7% | $0.2794 |
| 19 | claude-opus-4-5@thinkingAnthropic | 6.0% | 6.7% | $1.1401 |
| 20 | gemini-3-flash@minimalGoogle | 4.7% | 6.7% | $0.1708 |
| 21 | gemini-3-flash@highGoogle | 3.0% | 6.7% | $0.1670 |
| 22 | gpt-5@mediumOpenAI | 6.0% | 3.3% | $1.1868 |
| 23 | kimi-k2.5Moonshot | 6.0% | 3.3% | $0.3559 |
| 24 | grok-4-1-fastxAI | 5.7% | 3.3% | $0.0619 |
| 25 | grok-4-1-fast-reasoningxAI | 5.3% | 0.0% | $0.0789 |
| 26 | o3OpenAI | 3.0% | 3.3% | $0.3995 |
| 27 | minimax-m2.5Minimax | 0.7% | 3.3% | $0.2411 |
| 28 | claude-opus-4-5-highAnthropic | 0.3% | 3.3% | $0.8425 |
| 29 | claude-sonnet-4-5Anthropic | 0.0% | 3.3% | $1.1334 |
| 30 | claude-sonnet-4-5@thinkingAnthropic | 2.3% | 0.0% | $1.0436 |
| 31 | deepseek-v3.2-specialeDeepSeek | 2.0% | -- | $0.1012 |
| 32 | deepseek-v3.2DeepSeek | 2.0% | 0.0% | $0.1823 |
| 33 | kimi-k2-thinkingMoonshot | 1.3% | 0.0% | $0.2749 |
| 34 | o1OpenAI | 0.7% | 0.0% | $0.8292 |
| 35 | qwen3.5-397b-a17bQwen | 0.7% | 0.0% | $0.0741 |
| 36 | glm-5Zhipu | 0.7% | 0.0% | $0.8676 |
| 37 | gemini-2.5-proGoogle | 0.3% | 0.0% | $0.4337 |
| 38 | gpt-5.2OpenAI | 0.3% | 0.0% | $0.0618 |
| 39 | minimax-m2.1Minimax | 0.3% | 0.0% | $0.2484 |
| 40 | gpt-oss-120bOpenAI | 0.3% | -- | $0.0022 |
| 41 | qwen3-235b-a22b-thinking-2507Qwen | 0.3% | 0.0% | $0.0782 |
| 42 | qwen3-next-80b-a3b-thinkingQwen | 0.3% | 0.0% | $0.2465 |
| 43 | qwen3-vl-235b-a22b-thinkingQwen | 0.3% | -- | $0.0612 |
| 44 | mimo-v2-flashXiaomi | 0.3% | 0.0% | $0.0992 |
| 45 | glm-4.7Zhipu | 0.3% | 0.0% | $0.2266 |
| 46 | grok-code-fast-1xAI | 0.3% | 0.0% | $0.2743 |
| 47 | gpt-3.5-turboOpenAI | 0.0% | 0.0% | $0.0015 |
| 48 | gpt-4.1OpenAI | 0.0% | 0.0% | $19.5512 |
| 49 | gpt-4oOpenAI | 0.0% | 0.0% | $5.0630 |
| 50 | devstral-2512Mistral | 0.0% | 0.0% | $0.0188 |
| 51 | mistral-large-2512Mistral | 0.0% | 0.0% | $3.5163 |
| 52 | qwen3-coderQwen | 0.0% | 0.0% | $0.0632 |