Pencil Puzzle Bench

A Benchmark for Multi-Step Verifiable Reasoning

Can AI solve pencil puzzles? We evaluated 51 frontier models on a selection of 300 puzzles from our 62k puzzle dataset, spanning 20 types.

62,231
Dataset Puzzles
20
Puzzle Types
51
Models Tested
17k
Eval Runs

Model Leaderboard

# Model Direct Ask Agentic Cost/Attempt
1 gpt-5.4@xhighOpenAI -- 70.2% $8.0758
2 gpt-5.2@xhighOpenAI 27.0% 56.0% $5.2145
3 gpt-5.2@highOpenAI 20.7% 36.7% $1.6702
4 claude-opus-4-6-1mAnthropic 0.0% 36.7% $1.3548
5 claude-opus-4-6@thinkingAnthropic 27.3% 33.3% $3.8474
6 gemini-3.1-proGoogle 20.0% 33.3% $3.4593
7 claude-opus-4-6Anthropic 0.3% 30.0% $1.2089
8 claude-sonnet-4-6@thinkingAnthropic 10.3% 26.7% $0.8668
9 gpt-5.2-proOpenAI 9.7% 26.7% $6.2532
10 gpt-5.2@mediumOpenAI 9.3% 23.3% $0.6898
11 claude-opus-4-6@maxAnthropic 0.3% 23.3% $1.0277
12 claude-sonnet-4-6-1mAnthropic 0.3% 23.3% $15.4337
13 gemini-3-pro@highGoogle 3.3% 16.7% $1.2710
14 claude-sonnet-4-6Anthropic 0.3% 16.7% $0.8372
15 gemini-3-proGoogle 4.3% 13.3% $0.8876
16 gemini-3-pro@minimalGoogle 4.0% 10.0% $1.2281
17 gpt-5.2@lowOpenAI 2.3% 10.0% $0.1186
18 gpt-5.1@mediumOpenAI 7.7% 6.7% $0.2794
19 claude-opus-4-5@thinkingAnthropic 6.0% 6.7% $1.1401
20 gemini-3-flash@minimalGoogle 4.7% 6.7% $0.1708
21 gemini-3-flash@highGoogle 3.0% 6.7% $0.1670
22 gpt-5@mediumOpenAI 6.0% 3.3% $1.1868
23 kimi-k2.5Moonshot 6.0% 3.3% $0.3559
24 grok-4-1-fastxAI 5.7% 3.3% $0.0619
25 grok-4-1-fast-reasoningxAI 5.3% 0.0% $0.0789
26 o3OpenAI 3.0% 3.3% $0.3995
27 minimax-m2.5Minimax 0.7% 3.3% $0.2411
28 claude-opus-4-5-highAnthropic 0.3% 3.3% $0.8425
29 claude-sonnet-4-5Anthropic 0.0% 3.3% $1.1334
30 claude-sonnet-4-5@thinkingAnthropic 2.3% 0.0% $1.0436
31 deepseek-v3.2-specialeDeepSeek 2.0% -- $0.1012
32 deepseek-v3.2DeepSeek 2.0% 0.0% $0.1823
33 kimi-k2-thinkingMoonshot 1.3% 0.0% $0.2749
34 o1OpenAI 0.7% 0.0% $0.8292
35 qwen3.5-397b-a17bQwen 0.7% 0.0% $0.0741
36 glm-5Zhipu 0.7% 0.0% $0.8676
37 gemini-2.5-proGoogle 0.3% 0.0% $0.4337
38 gpt-5.2OpenAI 0.3% 0.0% $0.0618
39 minimax-m2.1Minimax 0.3% 0.0% $0.2484
40 gpt-oss-120bOpenAI 0.3% -- $0.0022
41 qwen3-235b-a22b-thinking-2507Qwen 0.3% 0.0% $0.0782
42 qwen3-next-80b-a3b-thinkingQwen 0.3% 0.0% $0.2465
43 qwen3-vl-235b-a22b-thinkingQwen 0.3% -- $0.0612
44 mimo-v2-flashXiaomi 0.3% 0.0% $0.0992
45 glm-4.7Zhipu 0.3% 0.0% $0.2266
46 grok-code-fast-1xAI 0.3% 0.0% $0.2743
47 gpt-3.5-turboOpenAI 0.0% 0.0% $0.0015
48 gpt-4.1OpenAI 0.0% 0.0% $19.5512
49 gpt-4oOpenAI 0.0% 0.0% $5.0630
50 devstral-2512Mistral 0.0% 0.0% $0.0188
51 mistral-large-2512Mistral 0.0% 0.0% $3.5163
52 qwen3-coderQwen 0.0% 0.0% $0.0632