Aerospace·
Quantum Circuits·
EDA·
Fiber Optics·
Battery Control·
Robotics·
Computational Fluid Dynamics·
Chip Design·
Portfolio Optimization·
Structural Engineering·
Job-Shop Scheduling·
Chemical Reaction·
Aerospace·
Quantum Circuits·
EDA·
Fiber Optics·
Battery Control·
Robotics·
Computational Fluid Dynamics·
Chip Design·
Portfolio Optimization·
Structural Engineering·
Job-Shop Scheduling·
Chemical Reaction·
Abstract
Engineering optimization—the systematic, iterative improvement of feasible solutions under domain-specific
constraints—represents a core challenge that AI has not yet been systematically evaluated on at scale.
We introduce Frontier-Eng, a large-scale benchmark of 47 real-world engineering
tasks spanning five broad categories: computing and quantum information, operations research
and decision science, robotics and control, optics and communication systems, and physical sciences
and engineering design. Unlike binary pass/fail benchmarks, Frontier-Eng evaluates generative
optimization: agents that iteratively propose code edits, receive feedback from frozen
domain-specific verifiers, and improve under a fixed interaction budget.
We evaluate eight frontier language models and three search frameworks.
GPT-5.4 is the most robust overall, with the lowest
mean within-task rank (3.54); then Claude Opus 4.6 (3.63),
GLM-5 (4.34), DeepSeek V3.2 (4.76),
Gemini 3.1 Pro Preview (5.53), Grok 4.20 (5.82),
SEED 2.0 Pro (5.86), and Qwen3 Coder Next (6.71).
Analysis of 500-iteration trajectories reveals a dual power-law structure: improvement
frequency decays as ∝ t−1 and per-improvement magnitude decays as ∝ k−1,
with both fits achieving R² > 0.83. Under a fixed budget, depth dominates width:
a single deep chain consistently outperforms restarting into multiple shorter chains.
Key Findings
freq ∝ t−1
Improvement Frequency Decays as 1/t
Running GPT-OSS-120B on all 47 tasks for 500 iterations,
improvement events become rarer following a power law:
the majority occur within the first ~30 steps,
with a long tail to iteration 500.
The ∝ t−1 fit achieves R² = 0.84, suggesting
a universal diminishing-return structure across engineering domains.
mag ∝ k−1
Improvement Magnitude Decays as 1/k
The magnitude of the k-th improvement within each task's trajectory
obeys the same power law: the first improvement is a large structural rewrite,
while each subsequent one is a smaller incremental refinement.
The ∝ k−1 fit achieves R² = 0.83,
forming a double squeeze that drives marginal returns near zero after ~50–100 iterations.
depth > width
Depth Dominates Width at Fixed Budget
Fixing total budget B = n × d and varying n ∈ {1,2,4,8,16}
on a 10-task subset: along the equal-budget diagonal,
the normalized score decreases monotonically with n —
1.00, 0.99, 0.99, 0.97, 0.91 for n = 1 to 16.
A single deep chain consistently outperforms spreading budget across restarts.
| # |
Model |
Avg. Rank |
Loading… |
News
2026-04-15
v1 Release
It’s live. Forty-seven real engineering tasks across twelve domains—your launchpad to push frontier models and search to the limit on the
Leaderboard.