Fugu Ultra Benchmarks: SWE, GPQA, LiveCode
Review Sakana AI's reported Fugu Ultra benchmark scores for coding, reasoning, science, and agentic tasks, with methodology and validation notes.
Last updated: 2026-06-24
The Fugu Ultra Orchestrator is evaluated across a variety of industry-standard benchmarks to measure its reasoning, coding, and mathematical capabilities against other frontier AI systems.
Key Performance Indicators
Because Fugu Ultra dynamically routes requests to specialized expert agents, it consistently outperforms standard monolithic models in domain-specific tasks.
Coding and Software Engineering
- SWE Bench Pro: 73.7%
- LiveCodeBench (Pass@1): 93.2%
- HumanEval: 93.1%
Mathematics and Logic
- MATH: 78.2%
- GSM8K: 95.8%
General Knowledge and Reasoning
- GPQA-Diamond: 95.5%
- MMLU: 86.7%
- HellaSwag: 91.2%
- ARC-Challenge: 94.5%
Methodology
These benchmarks are conducted using zero-shot and few-shot prompting techniques, evaluated independently. The orchestration layer adds a minimal latency overhead (typically 15-20%) compared to querying a single foundation model, but the accuracy gains in complex tasks offset this cost significantly.
Note: Benchmark numbers are subject to change as the underlying expert models in the Fugu Ultra orchestrator are continuously updated and refined by Sakana AI.