Fugu Ultra Benchmarks: SWE, GPQA, LiveCode

Review Sakana AI's reported Fugu Ultra benchmark scores for coding, reasoning, science, and agentic tasks, with methodology and validation notes.

Last updated: 2026-06-24

The Fugu Ultra Orchestrator is evaluated across a variety of industry-standard benchmarks to measure its reasoning, coding, and mathematical capabilities against other frontier AI systems.

Key Performance Indicators

Because Fugu Ultra dynamically routes requests to specialized expert agents, it consistently outperforms standard monolithic models in domain-specific tasks.

Coding and Software Engineering

SWE Bench Pro: 73.7%
LiveCodeBench (Pass@1): 93.2%
HumanEval: 93.1%

Mathematics and Logic

MATH: 78.2%
GSM8K: 95.8%

General Knowledge and Reasoning

GPQA-Diamond: 95.5%
MMLU: 86.7%
HellaSwag: 91.2%
ARC-Challenge: 94.5%

Methodology

These benchmarks are conducted using zero-shot and few-shot prompting techniques, evaluated independently. The orchestration layer adds a minimal latency overhead (typically 15-20%) compared to querying a single foundation model, but the accuracy gains in complex tasks offset this cost significantly.

Note: Benchmark numbers are subject to change as the underlying expert models in the Fugu Ultra orchestrator are continuously updated and refined by Sakana AI.