Pencil Puzzle Bench

A Benchmark for Multi-Step Verifiable Reasoning

Can AI solve pencil puzzles? We evaluated 51 frontier models on a selection of 300 puzzles from our 62k puzzle dataset, spanning 20 types.

62,231
Dataset Puzzles
20
Puzzle Types
51
Models Tested
17k
Eval Runs
Sample puzzle grid showing 9 puzzle types with AI solve attempts

Model Leaderboard

# Model Direct Ask Agentic Cost/Attempt
1 gpt-5.5@xhighOpenAI -- 83.3% $3.0703
2 gpt-5.4@xhighOpenAI -- 70.2% $8.0758
3 gpt-5.2@xhighOpenAI 27.0% 56.0% $5.0702
4 claude-opus-4-7@thinkingAnthropic -- 50.0% $9.6427
5 gpt-5.2@highOpenAI 20.7% 36.7% $1.6400
6 claude-opus-4-6-1mAnthropic 0.0% 36.7% $1.3548
7 claude-opus-4-6@thinkingAnthropic 27.3% 33.3% $3.8474
8 gemini-3.1-proGoogle 20.0% 33.3% $3.4593
9 claude-opus-4-6Anthropic 0.3% 30.0% $1.2089
10 claude-sonnet-4-6@thinkingAnthropic 10.3% 26.7% $0.8668
11 gpt-5.2-proOpenAI 9.7% 26.7% $6.2333
12 gpt-5.2@mediumOpenAI 9.3% 23.3% $0.6855
13 claude-opus-4-6@maxAnthropic 0.3% 23.3% $1.0277
14 claude-sonnet-4-6-1mAnthropic 0.3% 23.3% $15.4337
15 kimi-k2.6Moonshot 11.7% 20.0% $1.0582
16 gemini-3-pro@highGoogle 3.3% 16.7% $1.2710
17 claude-sonnet-4-6Anthropic 0.3% 16.7% $0.8372
18 gemini-3-proGoogle 4.3% 13.3% $0.8876
19 gemini-3-pro@minimalGoogle 4.0% 10.0% $1.2281
20 gpt-5.2@lowOpenAI 2.3% 10.0% $0.1186
21 qwen3.6-plusQwen 0.3% 10.0% $0.0000
22 gpt-5.1@mediumOpenAI 7.7% 6.7% $0.2779
23 claude-opus-4-5@thinkingAnthropic 6.0% 6.7% $1.1401
24 gemini-3-flash@minimalGoogle 4.7% 6.7% $0.1748
25 gemini-3-flash@highGoogle 3.0% 6.7% $0.1670
26 grok-4.20-reasoningxAI 0.3% 6.7% $0.2821
27 gpt-5@mediumOpenAI 6.0% 3.3% $1.1862
28 kimi-k2.5Moonshot 6.0% 3.3% $0.3657
29 grok-4-1-fastxAI 5.7% 3.3% $0.0619
30 grok-4-1-fast-reasoningxAI 5.3% 0.0% $0.0789
31 deepseek-v4-proDeepSeek 4.0% 0.0% $0.6205
32 grok-4.3@xhighxAI 3.3% 3.3% $5.4048
33 o3OpenAI 3.0% 3.3% $0.3995
34 minimax-m2.5Minimax 0.7% 3.3% $0.2405
35 claude-opus-4-5-highAnthropic 0.3% 3.3% $0.8425
36 claude-sonnet-4-5Anthropic 0.0% 3.3% $1.1334
37 deepseek-v3.2-specialeDeepSeek 2.3% -- $0.0957
38 claude-sonnet-4-5@thinkingAnthropic 2.3% 0.0% $1.0436
39 deepseek-v3.2DeepSeek 2.0% 0.0% $0.1797
40 grok-4.3xAI 2.0% 0.0% $1.1933
41 kimi-k2-thinkingMoonshot 1.3% 0.0% $0.2710
42 mimo-v2-proXiaomi 1.0% 0.0% $0.9533
43 o1OpenAI 0.7% 0.0% $0.8292
44 minimax-m2.7Minimax 0.7% 0.0% $0.4297
45 qwen3.5-397b-a17bQwen 0.7% 0.0% $0.7384
46 glm-5Zhipu 0.7% 0.0% $0.8609
47 gemini-2.5-proGoogle 0.3% 0.0% $0.4337
48 gpt-5.2OpenAI 0.3% 0.0% $0.0618
49 gemma-4-31b-itOther 0.3% 0.0% $0.0022
50 minimax-m2.1Minimax 0.3% 0.0% $0.2290
51 gpt-oss-120bOpenAI 0.3% -- $0.0021
52 qwen3-235b-a22b-thinking-2507Qwen 0.3% 0.0% $0.0780
53 qwen3-next-80b-a3b-thinkingQwen 0.3% 0.0% $0.2464
54 qwen3-vl-235b-a22b-thinkingQwen 0.3% -- $0.0612
55 mimo-v2-flashXiaomi 0.3% 0.0% $0.0822
56 glm-4.7Zhipu 0.3% 0.0% $0.2265
57 grok-code-fast-1xAI 0.3% 0.0% $0.2744
58 gpt-3.5-turboOpenAI 0.0% 0.0% $0.0015
59 gpt-4.1OpenAI 0.0% 0.0% $20.0511
60 gpt-4oOpenAI 0.0% 0.0% $5.0630
61 devstral-2512Mistral 0.0% 0.0% $0.0107
62 mistral-large-2512Mistral 0.0% 0.0% $3.5163
63 mistral-small-2603Mistral 0.0% 0.0% $0.9143
64 qwen3-coderQwen 0.0% 0.0% $0.0632