Pencil Puzzle Bench

A Benchmark for Multi-Step Verifiable Reasoning

Can AI solve pencil puzzles? We evaluated 51 frontier models on a selection of 300 puzzles from our 62k puzzle dataset, spanning 20 types.

62,231
Dataset Puzzles
20
Puzzle Types
51
Models Tested
17k
Eval Runs
Sample puzzle grid showing 9 puzzle types with AI solve attempts

Model Leaderboard

# Model Direct Ask Agentic Cost/Attempt
1 gpt-5.4@xhighOpenAI -- 70.2% $8.0758
2 gpt-5.2@xhighOpenAI 27.0% 56.0% $5.0702
3 gpt-5.2@highOpenAI 20.7% 36.7% $1.6400
4 claude-opus-4-6-1mAnthropic 0.0% 36.7% $1.3548
5 claude-opus-4-6@thinkingAnthropic 27.3% 33.3% $3.8474
6 gemini-3.1-proGoogle 20.0% 33.3% $3.4593
7 claude-opus-4-6Anthropic 0.3% 30.0% $1.2089
8 claude-sonnet-4-6@thinkingAnthropic 10.3% 26.7% $0.8668
9 gpt-5.2-proOpenAI 9.7% 26.7% $6.2333
10 gpt-5.2@mediumOpenAI 9.3% 23.3% $0.6855
11 claude-opus-4-6@maxAnthropic 0.3% 23.3% $1.0277
12 claude-sonnet-4-6-1mAnthropic 0.3% 23.3% $15.4337
13 gemini-3-pro@highGoogle 3.3% 16.7% $1.2710
14 claude-sonnet-4-6Anthropic 0.3% 16.7% $0.8372
15 gemini-3-proGoogle 4.3% 13.3% $0.8876
16 gemini-3-pro@minimalGoogle 4.0% 10.0% $1.2281
17 gpt-5.2@lowOpenAI 2.3% 10.0% $0.1186
18 gpt-5.1@mediumOpenAI 7.7% 6.7% $0.2779
19 claude-opus-4-5@thinkingAnthropic 6.0% 6.7% $1.1401
20 gemini-3-flash@minimalGoogle 4.7% 6.7% $0.1708
21 gemini-3-flash@highGoogle 3.0% 6.7% $0.1670
22 gpt-5@mediumOpenAI 6.0% 3.3% $1.1862
23 kimi-k2.5Moonshot 6.0% 3.3% $0.3549
24 grok-4-1-fastxAI 5.7% 3.3% $0.0415
25 grok-4-1-fast-reasoningxAI 5.3% 0.0% $0.0563
26 o3OpenAI 3.0% 3.3% $0.3995
27 minimax-m2.5Minimax 0.7% 3.3% $0.2405
28 claude-opus-4-5-highAnthropic 0.3% 3.3% $0.8425
29 claude-sonnet-4-5Anthropic 0.0% 3.3% $1.1334
30 claude-sonnet-4-5@thinkingAnthropic 2.3% 0.0% $1.0436
31 deepseek-v3.2-specialeDeepSeek 2.0% -- $0.1012
32 deepseek-v3.2DeepSeek 2.0% 0.0% $0.1815
33 kimi-k2-thinkingMoonshot 1.3% 0.0% $0.2710
34 o1OpenAI 0.7% 0.0% $0.8292
35 qwen3.5-397b-a17bQwen 0.7% 0.0% $0.0741
36 glm-5Zhipu 0.7% 0.0% $0.8609
37 gemini-2.5-proGoogle 0.3% 0.0% $0.4337
38 gpt-5.2OpenAI 0.3% 0.0% $0.0618
39 minimax-m2.1Minimax 0.3% 0.0% $0.2290
40 gpt-oss-120bOpenAI 0.3% -- $0.0021
41 qwen3-235b-a22b-thinking-2507Qwen 0.3% 0.0% $0.0780
42 qwen3-next-80b-a3b-thinkingQwen 0.3% 0.0% $0.2464
43 qwen3-vl-235b-a22b-thinkingQwen 0.3% -- $0.0612
44 mimo-v2-flashXiaomi 0.3% 0.0% $0.0985
45 glm-4.7Zhipu 0.3% 0.0% $0.2265
46 grok-code-fast-1xAI 0.3% 0.0% $0.2578
47 gpt-3.5-turboOpenAI 0.0% 0.0% $0.0015
48 gpt-4.1OpenAI 0.0% 0.0% $19.5512
49 gpt-4oOpenAI 0.0% 0.0% $5.0630
50 devstral-2512Mistral 0.0% 0.0% $0.0188
51 mistral-large-2512Mistral 0.0% 0.0% $3.5163
52 qwen3-coderQwen 0.0% 0.0% $0.0632