Overall Leaderboard

47 tasks 5 categories

Across 47 engineering tasks, the headline score is average rank (lower is better). On each task, models are sorted by best feasible score and given a rank; average rank is the mean of those ranks across all tasks — comparing models without mixing incompatible units. We also report the peer-relative Medal Score, which credits a model only for reaching each task's gold/silver/bronze podium.

Frontier Models

Frontier AI models evaluated directly on engineering problems — no extra search frameworks, just raw model capability.

Medal score

On each task, the three best scores in the v1 snapshot (2026-04-14) are frozen as gold / silver / bronze baselines; a model scores 1.00 / 0.67 / 0.33 for reaching each, then averaged over the task set (normalized to [0, 1]). It rewards only reaching each task's frontier and ignores negligible long-tail margins. We report it on both the full v1 set (47 tasks) and the v1-lite subset (10 tasks).

v1 · 47 tasks

#	Model	Medal	🥇	🥈	🥉
1	GPT-5.4	0.596	24	5	2
2	Claude Opus 4.6	0.490	9	18	6
3	GLM-5	0.312	4	10	12
4	DeepSeek V3.2	0.248	3	9	8
5	Gemini 3.1 Pro Preview	0.213	3	6	9
6	Seed 2.0 Pro	0.185	3	7	3
7	Grok 4.20	0.184	3	6	5
8	Qwen3 Coder Next	0.121	3	3	2

v1-lite · 10 tasks

#	Model	Medal	🥇	🥈	🥉
1	GPT-5.4	0.667	6	1	0
2	Claude Opus 4.6	0.501	2	4	1
3	GLM-5	0.233	0	2	3
4	Gemini 3.1 Pro Preview	0.200	0	2	2
5	DeepSeek V3.2	0.166	0	1	3
6	Grok 4.20	0.133	1	0	1
7	Seed 2.0 Pro	0.100	1	0	0
8	Qwen3 Coder Next	0.000	0	0	0

Average rank

Mean within-task rank over all 47 tasks (lower is better), across the paper's nine models.

#	Model	Avg. rank
1	GPT-5.4	3.54
2	Claude Opus 4.6	3.63
3	GLM-5	4.34
4	DeepSeek V3.2	4.76
5	gpt-oss-120b	4.81
6	Gemini 3.1 Pro Preview	5.53
7	Grok 4.20	5.82
8	Seed 2.0 Pro	5.86
9	Qwen3 Coder Next	6.71

Performance profile

A Dolan–Moré performance profile: each curve shows how often a model stays near the best score on the bench within a given slack. Higher on the left means stronger and more consistent across the 47 tasks.

Browse the benchmark task by task → open Tasks

← Back to Frontier-Engineering