03 // Benchmarks

Evaluation datasets for frontier AI.

Six benchmark suites, designed by working specialists, used by leading AI labs. Tags below show what each suite is built to stress.

CLI Benchmarks

CLI Benchmarking

Multi-step command-line reasoning tasks with Docker environments and comprehensive test suites.

Tasks are containerised — each comes with a reproducible Docker image, a frozen test suite, and explicit pass/fail criteria. Models are graded both on whether the final state matches expectations and on the trajectory taken to reach it.

Coverage spans data wrangling, environment setup, package management, debugging, and multi-tool orchestration. We add new tasks as new shell-based failure modes appear in the wild.

Multi-Lang
Test Suite
Verifiable

Math Olympiad

Mathematical Reasoning

Competition-level mathematics requiring multi-step reasoning and creative problem-solving.

Authored by olympiad medalists and competition coaches. Problems are deep, not deep-and-wide — each one targets a specific reasoning capability that smaller models tend to skip past.

Every example ships with the canonical solution, a verifiable final answer, and a notes field explaining what the problem is actually testing.

Mathematics
Verifiable
Reasoning

PhD Reasoning

PhD-Level Reasoning

Verifiable problems requiring PhD-level expertise with deep reasoning across research sources.

PhDs in their field author problems that draw across multiple research sources, with answers that can be verified against the literature. Designed to defeat surface-level retrieval and lookup heuristics.

Covers domains where the answer is unambiguous if you know the field — and almost guessable if you don't.

Specialized
Verifiable
Research

Scientific QA

Expert-validated scientific questions spanning physics, chemistry, and biology.

Each question is written by a working scientist and validated by a second independent reviewer. Distractors are designed to look right to a model that's just pattern-matching surface vocabulary.

Tightly scoped across physics, chemistry, and biology — with explicit answer rationales and source citations.

STEM
Expert-validated
Closed-form

Code Generation

Complex programming challenges with test cases, edge cases, and performance benchmarks.

Realistic programming tasks across multiple languages. Each task ships with a public test suite, a hidden test suite, edge cases, and where relevant a performance benchmark.

Authored and reviewed by senior engineers — the same people who would grade the output if it landed in a code review at their day job.

Programming
Multi-Lang
Test Suite

Safety & Alignment

Red-teaming datasets and alignment benchmarks for testing model safety boundaries.

Adversarial prompts that probe the seams of a model's safety policy, paired with the policy-correct response. Built to defeat refusal-pattern shortcuts and surface real failure modes.

Calibrated against helpfulness so the resulting models don't over-refuse benign requests — we measure both axes.

Adversarial
Safety
Red-teaming

07/Methodology

How every suite is built.

Authored by working specialists

Every example is written by someone who would be qualified to grade it — not a generalist crowd-worker.

Independent two-tier review

Every example is reviewed by an independent second annotator. Disagreements arbitrate to a senior reviewer.

Verifiable where possible

Wherever the domain allows, examples carry a deterministic grader. Where it doesn't, we ship a calibrated rubric and an LLM-judge ensemble.

Versioned and dated

Suites carry version stamps so your team can compare model runs against the same frozen evaluation across time.

08 // Custom

Need something custom?

We build closed benchmarks tailored to the specific failure modes your team is closing. Tell us about the model and the gap.

Start a custom benchmark