Evaluate frontier AI models directly on engineering problems — no extra search frameworks, just raw model capability.
Average within-task rank across 47 tasks · lower is better
Each cell is one task — hover for score. Explore individual tasks →
How much do search and evolution frameworks boost performance on top of the base model? Results split by base model to isolate framework contribution.
All framework entries across all tasks — hover for score.