Pencil Puzzle Bench
A Benchmark for Multi-Step Verifiable Reasoning
Can AI solve pencil puzzles? We evaluated 51 frontier models on a selection of 300 puzzles from our 62k puzzle dataset, spanning 20 types.
62,231
Dataset Puzzles
20
Puzzle Types
51
Models Tested
17k
Eval Runs
Model Leaderboard
| # ▼ | Model | Direct Ask | Agentic | Cost/Attempt |
|---|---|---|---|---|
| 1 | gpt-5.4@xhighOpenAI | -- | 70.2% | $8.0758 |
| 2 | gpt-5.2@xhighOpenAI | 27.0% | 56.0% | $5.0702 |
| 3 | gpt-5.2@highOpenAI | 20.7% | 36.7% | $1.6400 |
| 4 | claude-opus-4-6-1mAnthropic | 0.0% | 36.7% | $1.3548 |
| 5 | claude-opus-4-6@thinkingAnthropic | 27.3% | 33.3% | $3.8474 |
| 6 | gemini-3.1-proGoogle | 20.0% | 33.3% | $3.4593 |
| 7 | claude-opus-4-6Anthropic | 0.3% | 30.0% | $1.2089 |
| 8 | claude-sonnet-4-6@thinkingAnthropic | 10.3% | 26.7% | $0.8668 |
| 9 | gpt-5.2-proOpenAI | 9.7% | 26.7% | $6.2333 |
| 10 | gpt-5.2@mediumOpenAI | 9.3% | 23.3% | $0.6855 |
| 11 | claude-opus-4-6@maxAnthropic | 0.3% | 23.3% | $1.0277 |
| 12 | claude-sonnet-4-6-1mAnthropic | 0.3% | 23.3% | $15.4337 |
| 13 | gemini-3-pro@highGoogle | 3.3% | 16.7% | $1.2710 |
| 14 | claude-sonnet-4-6Anthropic | 0.3% | 16.7% | $0.8372 |
| 15 | gemini-3-proGoogle | 4.3% | 13.3% | $0.8876 |
| 16 | gemini-3-pro@minimalGoogle | 4.0% | 10.0% | $1.2281 |
| 17 | gpt-5.2@lowOpenAI | 2.3% | 10.0% | $0.1186 |
| 18 | gpt-5.1@mediumOpenAI | 7.7% | 6.7% | $0.2779 |
| 19 | claude-opus-4-5@thinkingAnthropic | 6.0% | 6.7% | $1.1401 |
| 20 | gemini-3-flash@minimalGoogle | 4.7% | 6.7% | $0.1708 |
| 21 | gemini-3-flash@highGoogle | 3.0% | 6.7% | $0.1670 |
| 22 | gpt-5@mediumOpenAI | 6.0% | 3.3% | $1.1862 |
| 23 | kimi-k2.5Moonshot | 6.0% | 3.3% | $0.3549 |
| 24 | grok-4-1-fastxAI | 5.7% | 3.3% | $0.0415 |
| 25 | grok-4-1-fast-reasoningxAI | 5.3% | 0.0% | $0.0563 |
| 26 | o3OpenAI | 3.0% | 3.3% | $0.3995 |
| 27 | minimax-m2.5Minimax | 0.7% | 3.3% | $0.2405 |
| 28 | claude-opus-4-5-highAnthropic | 0.3% | 3.3% | $0.8425 |
| 29 | claude-sonnet-4-5Anthropic | 0.0% | 3.3% | $1.1334 |
| 30 | claude-sonnet-4-5@thinkingAnthropic | 2.3% | 0.0% | $1.0436 |
| 31 | deepseek-v3.2-specialeDeepSeek | 2.0% | -- | $0.1012 |
| 32 | deepseek-v3.2DeepSeek | 2.0% | 0.0% | $0.1815 |
| 33 | kimi-k2-thinkingMoonshot | 1.3% | 0.0% | $0.2710 |
| 34 | o1OpenAI | 0.7% | 0.0% | $0.8292 |
| 35 | qwen3.5-397b-a17bQwen | 0.7% | 0.0% | $0.0741 |
| 36 | glm-5Zhipu | 0.7% | 0.0% | $0.8609 |
| 37 | gemini-2.5-proGoogle | 0.3% | 0.0% | $0.4337 |
| 38 | gpt-5.2OpenAI | 0.3% | 0.0% | $0.0618 |
| 39 | minimax-m2.1Minimax | 0.3% | 0.0% | $0.2290 |
| 40 | gpt-oss-120bOpenAI | 0.3% | -- | $0.0021 |
| 41 | qwen3-235b-a22b-thinking-2507Qwen | 0.3% | 0.0% | $0.0780 |
| 42 | qwen3-next-80b-a3b-thinkingQwen | 0.3% | 0.0% | $0.2464 |
| 43 | qwen3-vl-235b-a22b-thinkingQwen | 0.3% | -- | $0.0612 |
| 44 | mimo-v2-flashXiaomi | 0.3% | 0.0% | $0.0985 |
| 45 | glm-4.7Zhipu | 0.3% | 0.0% | $0.2265 |
| 46 | grok-code-fast-1xAI | 0.3% | 0.0% | $0.2578 |
| 47 | gpt-3.5-turboOpenAI | 0.0% | 0.0% | $0.0015 |
| 48 | gpt-4.1OpenAI | 0.0% | 0.0% | $19.5512 |
| 49 | gpt-4oOpenAI | 0.0% | 0.0% | $5.0630 |
| 50 | devstral-2512Mistral | 0.0% | 0.0% | $0.0188 |
| 51 | mistral-large-2512Mistral | 0.0% | 0.0% | $3.5163 |
| 52 | qwen3-coderQwen | 0.0% | 0.0% | $0.0632 |