Pencil Puzzle Bench
A Benchmark for Multi-Step Verifiable Reasoning
Can AI solve pencil puzzles? We evaluated 51 frontier models on a selection of 300 puzzles from our 62k puzzle dataset, spanning 20 types.
62,231
Dataset Puzzles
20
Puzzle Types
51
Models Tested
17k
Eval Runs
Model Leaderboard
| # ▼ | Model | Direct Ask | Agentic | Cost/Agentic |
|---|---|---|---|---|
| 1 | gpt-5.5@xhighOpenAI | -- | 83.3% | $3.0703 |
| 2 | gpt-5.4@xhighOpenAI | -- | 70.2% | $8.0759 |
| 3 | gpt-5.2@xhighOpenAI | 27.0% | 56.0% | $9.7432 |
| 4 | claude-opus-4-7@thinkingAnthropic | -- | 50.0% | $9.6427 |
| 5 | gemini-3.5-flash@highGoogle | 13.0% | 43.3% | $22.9637 |
| 6 | gpt-5.2@highOpenAI | 20.7% | 36.7% | $7.3048 |
| 7 | claude-opus-4-6-1mAnthropic | 0.0% | 36.7% | $14.1105 |
| 8 | claude-opus-4-6@thinkingAnthropic | 27.3% | 33.3% | $6.2425 |
| 9 | gemini-3.1-proGoogle | 20.0% | 33.3% | $3.6678 |
| 10 | claude-opus-4-6Anthropic | 0.3% | 30.0% | $10.8893 |
| 11 | claude-sonnet-4-6@thinkingAnthropic | 10.3% | 26.7% | $3.9361 |
| 12 | gpt-5.2-proOpenAI | 9.7% | 26.7% | $41.5243 |
| 13 | gpt-5.2@mediumOpenAI | 9.3% | 23.3% | $2.8149 |
| 14 | claude-opus-4-6@maxAnthropic | 0.3% | 23.3% | $10.8699 |
| 15 | claude-sonnet-4-6-1mAnthropic | 0.3% | 23.3% | $169.2700 |
| 16 | kimi-k2.6Moonshot | 11.7% | 20.0% | $6.2137 |
| 17 | gemini-3-pro@highGoogle | 3.3% | 16.7% | $3.0609 |
| 18 | claude-sonnet-4-6Anthropic | 0.3% | 16.7% | $8.9114 |
| 19 | gemini-3-proGoogle | 4.3% | 13.3% | $2.4814 |
| 20 | gemini-3-pro@minimalGoogle | 4.0% | 10.0% | $3.3548 |
| 21 | gpt-5.2@lowOpenAI | 2.3% | 10.0% | $0.5968 |
| 22 | qwen3.6-plusQwen | 0.3% | 10.0% | $0.0000 |
| 23 | gpt-5.1@mediumOpenAI | 7.7% | 6.7% | $0.8125 |
| 24 | claude-opus-4-5@thinkingAnthropic | 6.0% | 6.7% | $4.7130 |
| 25 | gemini-3-flash@minimalGoogle | 4.7% | 6.7% | $0.4156 |
| 26 | gemini-3-flash@highGoogle | 3.0% | 6.7% | $0.3958 |
| 27 | grok-4.20-reasoningxAI | 0.3% | 6.7% | $1.1939 |
| 28 | gpt-5@mediumOpenAI | 6.0% | 3.3% | $11.2062 |
| 29 | kimi-k2.5Moonshot | 6.0% | 3.3% | $1.7154 |
| 30 | grok-4-1-fastxAI | 5.7% | 3.3% | $0.4391 |
| 31 | grok-4-1-fast-reasoningxAI | 5.3% | 0.0% | $0.4880 |
| 32 | deepseek-v4-proDeepSeek | 4.0% | 0.0% | $3.5881 |
| 33 | grok-4.3@xhighxAI | 3.3% | 3.3% | $58.5513 |
| 34 | o3OpenAI | 3.0% | 3.3% | $2.4470 |
| 35 | minimax-m2.5Minimax | 0.7% | 3.3% | $1.9467 |
| 36 | claude-opus-4-5-highAnthropic | 0.3% | 3.3% | $8.8238 |
| 37 | claude-sonnet-4-5Anthropic | 0.0% | 3.3% | $12.1957 |
| 38 | deepseek-v3.2-specialeDeepSeek | 2.3% | -- | -- |
| 39 | claude-sonnet-4-5@thinkingAnthropic | 2.3% | 0.0% | $7.4948 |
| 40 | deepseek-v3.2DeepSeek | 2.0% | 0.0% | $1.5830 |
| 41 | grok-4.3xAI | 2.0% | 0.0% | $12.7441 |
| 42 | kimi-k2-thinkingMoonshot | 1.3% | 0.0% | $1.0019 |
| 43 | mimo-v2-proXiaomi | 1.0% | 0.0% | $9.2958 |
| 44 | o1OpenAI | 0.7% | 0.0% | $4.0281 |
| 45 | minimax-m2.7Minimax | 0.7% | 0.0% | $3.7122 |
| 46 | qwen3.5-397b-a17bQwen | 0.7% | 0.0% | $7.3052 |
| 47 | glm-5Zhipu | 0.7% | 0.0% | $8.4639 |
| 48 | gemini-2.5-proGoogle | 0.3% | 0.0% | $1.1731 |
| 49 | gpt-5.2OpenAI | 0.3% | 0.0% | $0.6118 |
| 50 | gemma-4-31b-itOther | 0.3% | 0.0% | $0.0145 |
| 51 | minimax-m2.1Minimax | 0.3% | 0.0% | $1.6709 |
| 52 | gpt-oss-120bOpenAI | 0.3% | -- | -- |
| 53 | qwen3-235b-a22b-thinking-2507Qwen | 0.3% | 0.0% | $0.7058 |
| 54 | qwen3-next-80b-a3b-thinkingQwen | 0.3% | 0.0% | $2.6243 |
| 55 | qwen3-vl-235b-a22b-thinkingQwen | 0.3% | -- | -- |
| 56 | mimo-v2-flashXiaomi | 0.3% | 0.0% | $0.7595 |
| 57 | glm-4.7Zhipu | 0.3% | 0.0% | $1.7271 |
| 58 | grok-code-fast-1xAI | 0.3% | 0.0% | $1.0183 |
| 59 | gpt-3.5-turboOpenAI | 0.0% | 0.0% | $0.0000 |
| 60 | gpt-4.1OpenAI | 0.0% | 0.0% | $220.4705 |
| 61 | gpt-4oOpenAI | 0.0% | 0.0% | $55.5660 |
| 62 | devstral-2512Mistral | 0.0% | 0.0% | $0.1113 |
| 63 | mistral-large-2512Mistral | 0.0% | 0.0% | $38.6499 |
| 64 | mistral-small-2603Mistral | 0.0% | 0.0% | $10.0508 |
| 65 | qwen3-coderQwen | 0.0% | 0.0% | $0.6826 |