Pencil Puzzle Bench

A Benchmark for Multi-Step Verifiable Reasoning

Can AI solve pencil puzzles? We evaluated 51 frontier models on a selection of 300 puzzles from our 62k puzzle dataset, spanning 20 types.

62,231
Dataset Puzzles
20
Puzzle Types
51
Models Tested
17k
Eval Runs
Sample puzzle grid showing 9 puzzle types with AI solve attempts

Model Leaderboard

# Model Direct Ask Agentic Cost/Agentic
1 gpt-5.5@xhighOpenAI -- 83.3% $3.0703
2 gpt-5.4@xhighOpenAI -- 70.2% $8.0759
3 gpt-5.2@xhighOpenAI 27.0% 56.0% $9.7432
4 claude-opus-4-7@thinkingAnthropic -- 50.0% $9.6427
5 gemini-3.5-flash@highGoogle 13.0% 43.3% $22.9637
6 gpt-5.2@highOpenAI 20.7% 36.7% $7.3048
7 claude-opus-4-6-1mAnthropic 0.0% 36.7% $14.1105
8 claude-opus-4-6@thinkingAnthropic 27.3% 33.3% $6.2425
9 gemini-3.1-proGoogle 20.0% 33.3% $3.6678
10 claude-opus-4-6Anthropic 0.3% 30.0% $10.8893
11 claude-sonnet-4-6@thinkingAnthropic 10.3% 26.7% $3.9361
12 gpt-5.2-proOpenAI 9.7% 26.7% $41.5243
13 gpt-5.2@mediumOpenAI 9.3% 23.3% $2.8149
14 claude-opus-4-6@maxAnthropic 0.3% 23.3% $10.8699
15 claude-sonnet-4-6-1mAnthropic 0.3% 23.3% $169.2700
16 kimi-k2.6Moonshot 11.7% 20.0% $6.2137
17 gemini-3-pro@highGoogle 3.3% 16.7% $3.0609
18 claude-sonnet-4-6Anthropic 0.3% 16.7% $8.9114
19 gemini-3-proGoogle 4.3% 13.3% $2.4814
20 gemini-3-pro@minimalGoogle 4.0% 10.0% $3.3548
21 gpt-5.2@lowOpenAI 2.3% 10.0% $0.5968
22 qwen3.6-plusQwen 0.3% 10.0% $0.0000
23 gpt-5.1@mediumOpenAI 7.7% 6.7% $0.8125
24 claude-opus-4-5@thinkingAnthropic 6.0% 6.7% $4.7130
25 gemini-3-flash@minimalGoogle 4.7% 6.7% $0.4156
26 gemini-3-flash@highGoogle 3.0% 6.7% $0.3958
27 grok-4.20-reasoningxAI 0.3% 6.7% $1.1939
28 gpt-5@mediumOpenAI 6.0% 3.3% $11.2062
29 kimi-k2.5Moonshot 6.0% 3.3% $1.7154
30 grok-4-1-fastxAI 5.7% 3.3% $0.4391
31 grok-4-1-fast-reasoningxAI 5.3% 0.0% $0.4880
32 deepseek-v4-proDeepSeek 4.0% 0.0% $3.5881
33 grok-4.3@xhighxAI 3.3% 3.3% $58.5513
34 o3OpenAI 3.0% 3.3% $2.4470
35 minimax-m2.5Minimax 0.7% 3.3% $1.9467
36 claude-opus-4-5-highAnthropic 0.3% 3.3% $8.8238
37 claude-sonnet-4-5Anthropic 0.0% 3.3% $12.1957
38 deepseek-v3.2-specialeDeepSeek 2.3% -- --
39 claude-sonnet-4-5@thinkingAnthropic 2.3% 0.0% $7.4948
40 deepseek-v3.2DeepSeek 2.0% 0.0% $1.5830
41 grok-4.3xAI 2.0% 0.0% $12.7441
42 kimi-k2-thinkingMoonshot 1.3% 0.0% $1.0019
43 mimo-v2-proXiaomi 1.0% 0.0% $9.2958
44 o1OpenAI 0.7% 0.0% $4.0281
45 minimax-m2.7Minimax 0.7% 0.0% $3.7122
46 qwen3.5-397b-a17bQwen 0.7% 0.0% $7.3052
47 glm-5Zhipu 0.7% 0.0% $8.4639
48 gemini-2.5-proGoogle 0.3% 0.0% $1.1731
49 gpt-5.2OpenAI 0.3% 0.0% $0.6118
50 gemma-4-31b-itOther 0.3% 0.0% $0.0145
51 minimax-m2.1Minimax 0.3% 0.0% $1.6709
52 gpt-oss-120bOpenAI 0.3% -- --
53 qwen3-235b-a22b-thinking-2507Qwen 0.3% 0.0% $0.7058
54 qwen3-next-80b-a3b-thinkingQwen 0.3% 0.0% $2.6243
55 qwen3-vl-235b-a22b-thinkingQwen 0.3% -- --
56 mimo-v2-flashXiaomi 0.3% 0.0% $0.7595
57 glm-4.7Zhipu 0.3% 0.0% $1.7271
58 grok-code-fast-1xAI 0.3% 0.0% $1.0183
59 gpt-3.5-turboOpenAI 0.0% 0.0% $0.0000
60 gpt-4.1OpenAI 0.0% 0.0% $220.4705
61 gpt-4oOpenAI 0.0% 0.0% $55.5660
62 devstral-2512Mistral 0.0% 0.0% $0.1113
63 mistral-large-2512Mistral 0.0% 0.0% $38.6499
64 mistral-small-2603Mistral 0.0% 0.0% $10.0508
65 qwen3-coderQwen 0.0% 0.0% $0.6826