EngDesign
Task 9 / 47
EngDesign
A bundle of EngDesign-sourced engineering cases (CY_03, WJ_01, XY_05, AM_02, AM_03, YJ_02, YJ_03) evaluated through Docker-backed scripts and per-case rubrics spanning structural, topology, and related design tasks. In the v1 pool it counts as one benchmark family though multiple distinct problem providers exist; use `task=engdesign` per the domain README.
Model leaderboard
| # | Participant | Score |
|---|---|---|
| 1 | Grok 4.20 | 100.0 |
| 2 | Gemini 3.1 Pro Preview | 100.0 |
| 3 | SEED 2.0 Pro | 100.0 |
| 4 | GLM-5 | 94.4 |
| 5 | Qwen3 Coder Next | 94.4 |
| 6 | DeepSeek V3.2 | 79.4 |
| 7 | GPT-5.4 | 0.0 |
| 8 | Claude Opus 4.6 | 0.0 |
Framework leaderboard
| # | Participant | Score |
|---|---|---|
| 1 | Claude Opus 4.6 + OpenEvolve | 100.0 |
| 2 | GPT-OSS + OpenEvolve | 53.2 |
| 3 | GPT-OSS + ShinkaiEvolve | 7.1 |
| 4 | Claude Opus 4.6 + ABMCTS | 7.1 |
| 5 | Claude Opus 4.6 + ShinkaiEvolve | 0.4 |
| 6 | GPT-OSS + ABMCTS | 0.0 |
Score is the normalized score for this task (0–100, higher is better).