Pencil Puzzle Bench
A Benchmark for Multi-Step Verifiable Reasoning
Can AI solve pencil puzzles? We evaluated 51 frontier models on a selection of 300 puzzles from our 62k puzzle dataset, spanning 20 types.
62,231
Dataset Puzzles
20
Puzzle Types
51
Models Tested
17k
Eval Runs
Model Leaderboard
| # ▼ | Model | Direct Ask | Agentic | Cost/Attempt |
|---|---|---|---|---|
| 1 | gpt-5.5@xhighOpenAI | -- | 83.3% | $3.0703 |
| 2 | gpt-5.4@xhighOpenAI | -- | 70.2% | $8.0758 |
| 3 | gpt-5.2@xhighOpenAI | 27.0% | 56.0% | $5.0702 |
| 4 | claude-opus-4-7@thinkingAnthropic | -- | 50.0% | $9.6427 |
| 5 | gpt-5.2@highOpenAI | 20.7% | 36.7% | $1.6400 |
| 6 | claude-opus-4-6-1mAnthropic | 0.0% | 36.7% | $1.3548 |
| 7 | claude-opus-4-6@thinkingAnthropic | 27.3% | 33.3% | $3.8474 |
| 8 | gemini-3.1-proGoogle | 20.0% | 33.3% | $3.4593 |
| 9 | claude-opus-4-6Anthropic | 0.3% | 30.0% | $1.2089 |
| 10 | claude-sonnet-4-6@thinkingAnthropic | 10.3% | 26.7% | $0.8668 |
| 11 | gpt-5.2-proOpenAI | 9.7% | 26.7% | $6.2333 |
| 12 | gpt-5.2@mediumOpenAI | 9.3% | 23.3% | $0.6855 |
| 13 | claude-opus-4-6@maxAnthropic | 0.3% | 23.3% | $1.0277 |
| 14 | claude-sonnet-4-6-1mAnthropic | 0.3% | 23.3% | $15.4337 |
| 15 | kimi-k2.6Moonshot | 11.7% | 20.0% | $1.0582 |
| 16 | gemini-3-pro@highGoogle | 3.3% | 16.7% | $1.2710 |
| 17 | claude-sonnet-4-6Anthropic | 0.3% | 16.7% | $0.8372 |
| 18 | gemini-3-proGoogle | 4.3% | 13.3% | $0.8876 |
| 19 | gemini-3-pro@minimalGoogle | 4.0% | 10.0% | $1.2281 |
| 20 | gpt-5.2@lowOpenAI | 2.3% | 10.0% | $0.1186 |
| 21 | qwen3.6-plusQwen | 0.3% | 10.0% | $0.0000 |
| 22 | gpt-5.1@mediumOpenAI | 7.7% | 6.7% | $0.2779 |
| 23 | claude-opus-4-5@thinkingAnthropic | 6.0% | 6.7% | $1.1401 |
| 24 | gemini-3-flash@minimalGoogle | 4.7% | 6.7% | $0.1748 |
| 25 | gemini-3-flash@highGoogle | 3.0% | 6.7% | $0.1670 |
| 26 | grok-4.20-reasoningxAI | 0.3% | 6.7% | $0.2821 |
| 27 | gpt-5@mediumOpenAI | 6.0% | 3.3% | $1.1862 |
| 28 | kimi-k2.5Moonshot | 6.0% | 3.3% | $0.3657 |
| 29 | grok-4-1-fastxAI | 5.7% | 3.3% | $0.0619 |
| 30 | grok-4-1-fast-reasoningxAI | 5.3% | 0.0% | $0.0789 |
| 31 | deepseek-v4-proDeepSeek | 4.0% | 0.0% | $0.6205 |
| 32 | grok-4.3@xhighxAI | 3.3% | 3.3% | $5.4048 |
| 33 | o3OpenAI | 3.0% | 3.3% | $0.3995 |
| 34 | minimax-m2.5Minimax | 0.7% | 3.3% | $0.2405 |
| 35 | claude-opus-4-5-highAnthropic | 0.3% | 3.3% | $0.8425 |
| 36 | claude-sonnet-4-5Anthropic | 0.0% | 3.3% | $1.1334 |
| 37 | deepseek-v3.2-specialeDeepSeek | 2.3% | -- | $0.0957 |
| 38 | claude-sonnet-4-5@thinkingAnthropic | 2.3% | 0.0% | $1.0436 |
| 39 | deepseek-v3.2DeepSeek | 2.0% | 0.0% | $0.1797 |
| 40 | grok-4.3xAI | 2.0% | 0.0% | $1.1933 |
| 41 | kimi-k2-thinkingMoonshot | 1.3% | 0.0% | $0.2710 |
| 42 | mimo-v2-proXiaomi | 1.0% | 0.0% | $0.9533 |
| 43 | o1OpenAI | 0.7% | 0.0% | $0.8292 |
| 44 | minimax-m2.7Minimax | 0.7% | 0.0% | $0.4297 |
| 45 | qwen3.5-397b-a17bQwen | 0.7% | 0.0% | $0.7384 |
| 46 | glm-5Zhipu | 0.7% | 0.0% | $0.8609 |
| 47 | gemini-2.5-proGoogle | 0.3% | 0.0% | $0.4337 |
| 48 | gpt-5.2OpenAI | 0.3% | 0.0% | $0.0618 |
| 49 | gemma-4-31b-itOther | 0.3% | 0.0% | $0.0022 |
| 50 | minimax-m2.1Minimax | 0.3% | 0.0% | $0.2290 |
| 51 | gpt-oss-120bOpenAI | 0.3% | -- | $0.0021 |
| 52 | qwen3-235b-a22b-thinking-2507Qwen | 0.3% | 0.0% | $0.0780 |
| 53 | qwen3-next-80b-a3b-thinkingQwen | 0.3% | 0.0% | $0.2464 |
| 54 | qwen3-vl-235b-a22b-thinkingQwen | 0.3% | -- | $0.0612 |
| 55 | mimo-v2-flashXiaomi | 0.3% | 0.0% | $0.0822 |
| 56 | glm-4.7Zhipu | 0.3% | 0.0% | $0.2265 |
| 57 | grok-code-fast-1xAI | 0.3% | 0.0% | $0.2744 |
| 58 | gpt-3.5-turboOpenAI | 0.0% | 0.0% | $0.0015 |
| 59 | gpt-4.1OpenAI | 0.0% | 0.0% | $20.0511 |
| 60 | gpt-4oOpenAI | 0.0% | 0.0% | $5.0630 |
| 61 | devstral-2512Mistral | 0.0% | 0.0% | $0.0107 |
| 62 | mistral-large-2512Mistral | 0.0% | 0.0% | $3.5163 |
| 63 | mistral-small-2603Mistral | 0.0% | 0.0% | $0.9143 |
| 64 | qwen3-coderQwen | 0.0% | 0.0% | $0.0632 |