Open benchmark

Frontier-Engineering Bench

A Large-Scale Benchmark for Evaluating AI Agents on
Generative Optimization of Real-World Engineering Tasks

Navers lab · Einsia AI

2026

Abstract
Engineering optimization—the systematic, iterative improvement of feasible solutions under domain-specific constraints—represents a core challenge that AI has not yet been systematically evaluated on at scale. We introduce Frontier-Eng, a large-scale benchmark of 47 real-world engineering tasks spanning five broad categories: computing and quantum information, operations research and decision science, robotics and control, optics and communication systems, and physical sciences and engineering design. Unlike binary pass/fail benchmarks, Frontier-Eng evaluates generative optimization: agents that iteratively propose code edits, receive feedback from frozen domain-specific verifiers, and improve under a fixed interaction budget. We evaluate eight frontier language models and three search frameworks. GPT-5.4 is the most robust overall, with the lowest mean within-task rank (3.54); then Claude Opus 4.6 (3.63), GLM-5 (4.34), DeepSeek V3.2 (4.76), Gemini 3.1 Pro Preview (5.53), Grok 4.20 (5.82), SEED 2.0 Pro (5.86), and Qwen3 Coder Next (6.71). Analysis of 500-iteration trajectories reveals a dual power-law structure: improvement frequency decays as ∝ t−1 and per-improvement magnitude decays as ∝ k−1, with both fits achieving R² > 0.83. Under a fixed budget, depth dominates width: a single deep chain consistently outperforms restarting into multiple shorter chains.
Overview
Frontier-Eng method overview: agent, task code, and frozen domain verifiers in an iterative optimization loop
A generative agent iteratively edits task code under a capped interaction budget; each step is compiled or executed by a frozen, read-only domain verifier (numerical kernels, physics or FEM backends, cryptographic checks, emulators, etc.) that returns objectives and constraint signals. The leaderboard on this site aggregates only verifier outputs across 47 real engineering tasks, with no judge model in the scoring loop.

Key Findings
freq ∝ t−1

Improvement Frequency Decays as 1/t

Running GPT-OSS-120B on all 47 tasks for 500 iterations, improvement events become rarer following a power law: the majority occur within the first ~30 steps, with a long tail to iteration 500. The ∝ t−1 fit achieves R² = 0.84, suggesting a universal diminishing-return structure across engineering domains.

Histogram of improvement events by iteration showing 1/t decay (paper Fig., left panel)
mag ∝ k−1

Improvement Magnitude Decays as 1/k

The magnitude of the k-th improvement within each task's trajectory obeys the same power law: the first improvement is a large structural rewrite, while each subsequent one is a smaller incremental refinement. The ∝ k−1 fit achieves R² = 0.83, forming a double squeeze that drives marginal returns near zero after ~50–100 iterations.

Median normalized improvement magnitude vs. improvement rank k (paper Fig., right panel)
depth > width

Depth Dominates Width at Fixed Budget

Fixing total budget B = n × d and varying n ∈ {1,2,4,8,16} on a 10-task subset: along the equal-budget diagonal, the normalized score decreases monotonically with n — 1.00, 0.99, 0.99, 0.97, 0.91 for n = 1 to 16. A single deep chain consistently outperforms spreading budget across restarts.

Normalized best-of-n score across depth n and width d under fixed budget (paper Fig.)

Motivating Example
Battery fast-charging optimization trajectory
Battery Fast-Charging (EnergyStorage). Score trajectory for the BatteryFastChargingProfile task. The agent discovers a multi-stage current profile navigating the trade-off between charging speed, thermal safety, and battery longevity.
MallocLab kernel optimization trajectory
MallocLab (ComputerSystems). Score trajectory for memory allocator optimization, illustrating iterative improvement in low-level systems code under a fixed evaluation budget.

Model Leaderboard

Full Leaderboard →
# Model Avg. Rank

Loading…


News
2026-04-15

v1 Release

It’s live. Forty-seven real engineering tasks across twelve domains—your launchpad to push frontier models and search to the limit on the Leaderboard.