Overall Leaderboard

47 tasks 12 domains

Across 47 engineering tasks, the headline score is average rank (lower is better)—the same story we tell in the paper. The performance profile is the figure from the manuscript; the colour grid below is a quick, task-by-task snapshot of who leads where. GPT-5.4 sits first at 3.54, ahead of Claude Opus 4.6 (3.63), GLM-5 (4.34), DeepSeek V3.2 (4.76), Gemini 3.1 Pro Preview (5.53), Grok 4.20 (5.82), SEED 2.0 Pro (5.86), and Qwen3 Coder Next (6.71).

Frontier Models

Evaluate frontier AI models directly on engineering problems — no extra search frameworks, just raw model capability.

On each task, models are sorted by best feasible score (higher is better; ties split evenly) and given a rank. Average rank is the mean of those ranks across all tasks—lower is better. It compares models without mixing incompatible units (throughput, cost, optics metrics, etc.).

# Model Avg. Rank

From the paper: a classic Dolan–Moré view—each curve shows how often a model stays near the best score on the bench when you allow a little slack. Higher on the left means stronger and more consistent across the 47 tasks.

Dolan–Moré performance profile for frontier models on all 47 tasks (paper figure)

Each column is one task (1–47). Hover a cell for a short readout. Pick a task to read more →