Evaluation datasets for frontier AI.
Six benchmark suites, designed by working specialists, used by leading AI labs. Tags below show what each suite is built to stress.
CLI Benchmarking
Multi-step command-line reasoning tasks with Docker environments and comprehensive test suites.
Tasks are containerised — each comes with a reproducible Docker image, a frozen test suite, and explicit pass/fail criteria. Models are graded both on whether the final state matches expectations and on the trajectory taken to reach it.
Coverage spans data wrangling, environment setup, package management, debugging, and multi-tool orchestration. We add new tasks as new shell-based failure modes appear in the wild.
- Multi-Lang
- Test Suite
- Verifiable
Mathematical Reasoning
Competition-level mathematics requiring multi-step reasoning and creative problem-solving.
Authored by olympiad medalists and competition coaches. Problems are deep, not deep-and-wide — each one targets a specific reasoning capability that smaller models tend to skip past.
Every example ships with the canonical solution, a verifiable final answer, and a notes field explaining what the problem is actually testing.
- Mathematics
- Verifiable
- Reasoning
PhD-Level Reasoning
Verifiable problems requiring PhD-level expertise with deep reasoning across research sources.
PhDs in their field author problems that draw across multiple research sources, with answers that can be verified against the literature. Designed to defeat surface-level retrieval and lookup heuristics.
Covers domains where the answer is unambiguous if you know the field — and almost guessable if you don't.
- Specialized
- Verifiable
- Research
Scientific QA
Expert-validated scientific questions spanning physics, chemistry, and biology.
Each question is written by a working scientist and validated by a second independent reviewer. Distractors are designed to look right to a model that's just pattern-matching surface vocabulary.
Tightly scoped across physics, chemistry, and biology — with explicit answer rationales and source citations.
- STEM
- Expert-validated
- Closed-form
Code Generation
Complex programming challenges with test cases, edge cases, and performance benchmarks.
Realistic programming tasks across multiple languages. Each task ships with a public test suite, a hidden test suite, edge cases, and where relevant a performance benchmark.
Authored and reviewed by senior engineers — the same people who would grade the output if it landed in a code review at their day job.
- Programming
- Multi-Lang
- Test Suite
Safety & Alignment
Red-teaming datasets and alignment benchmarks for testing model safety boundaries.
Adversarial prompts that probe the seams of a model's safety policy, paired with the policy-correct response. Built to defeat refusal-pattern shortcuts and surface real failure modes.
Calibrated against helpfulness so the resulting models don't over-refuse benign requests — we measure both axes.
- Adversarial
- Safety
- Red-teaming
How every suite is built.
Authored by working specialists
Every example is written by someone who would be qualified to grade it — not a generalist crowd-worker.
Independent two-tier review
Every example is reviewed by an independent second annotator. Disagreements arbitrate to a senior reviewer.
Verifiable where possible
Wherever the domain allows, examples carry a deterministic grader. Where it doesn't, we ship a calibrated rubric and an LLM-judge ensemble.
Versioned and dated
Suites carry version stamps so your team can compare model runs against the same frozen evaluation across time.
Need something custom?
We build closed benchmarks tailored to the specific failure modes your team is closing. Tell us about the model and the gap.