Across 47 engineering tasks, the headline score is average rank (lower is better)—the same story we tell in the paper. The performance profile is the figure from the manuscript; the colour grid below is a quick, task-by-task snapshot of who leads where. GPT-5.4 sits first at 3.54, ahead of Claude Opus 4.6 (3.63), GLM-5 (4.34), DeepSeek V3.2 (4.76), Gemini 3.1 Pro Preview (5.53), Grok 4.20 (5.82), SEED 2.0 Pro (5.86), and Qwen3 Coder Next (6.71).
Evaluate frontier AI models directly on engineering problems — no extra search frameworks, just raw model capability.
On each task, models are sorted by best feasible score (higher is better; ties split evenly) and given a rank. Average rank is the mean of those ranks across all tasks—lower is better. It compares models without mixing incompatible units (throughput, cost, optics metrics, etc.).
| # | Model | Avg. Rank |
|---|---|---|
From the paper: a classic Dolan–Moré view—each curve shows how often a model stays near the best score on the bench when you allow a little slack. Higher on the left means stronger and more consistent across the 47 tasks.
Each column is one task (1–47). Hover a cell for a short readout. Pick a task to read more →