Overall Leaderboard

47 tasks 12 domains
Frontier Models

Evaluate frontier AI models directly on engineering problems — no extra search frameworks, just raw model capability.

Average within-task rank across 47 tasks · lower is better

Each cell is one task — hover for score. Explore individual tasks →

Framework Effects

How much do search and evolution frameworks boost performance on top of the base model? Results split by base model to isolate framework contribution.

Anthropic
Claude Opus 4.6 · Closed-source
Frameworks applied on top of Claude Opus 4.6
OpenAI
GPT-OSS · Open-source weights
Frameworks applied on top of GPT-OSS open-source weights

All framework entries across all tasks — hover for score.