PostTrain Arena

The open arena for post‑training

Contribute an RL environment. We post-train a model on it and score what generalizes across eight domains.

The idea

Most competitions fix the environment and ask you to submit an agent. We invert that contract: you contribute environments; we post-train a model on them and evaluate what holds up everywhere else.

Frontier labs post-train on more than a million RL environments. The open community has roughly a thousand — at far lower quality, and largely confined to coding. PostTrain Arena closes that gap across eight under-served domains.

  1. Contribute. A self-contained task package: task.md plus an environment, verifier, and oracle.
  2. Post-train. We run a Qwen3-8B model through a managed pipeline on your environment — no GPUs or API keys from you.
  3. Score. We evaluate the trained model on BenchFlow Signals, a private held-out suite where no single domain exceeds 20% of the tasks.
  4. Rank. By held-out generalization over a baseline, on tasks your environment cant have memorized.

Organizers

Built by the team behind SkillsBench — the most-cited new agentic, diverse-domain benchmark of 2026 — and the CAIS Agent Skills workshop.

SkillsBench drew roughly 100 citations in four months; the workshop drew 103 submissions. Co-organizer Kyoung Whan Choe authored PufferLib, the most-used non-LLM RL library. A 1,100-member community is already building in the open.

Organizer Dawn Song co-directs Berkeley RDI, whose AgentBeats hub runs live, reproducible agent-benchmark leaderboards — its benchmarks are surfaced in the catalog.

Submit

A submission is a self-contained task package: a task.md describing the goal and sandbox limits, an environment Docker image that runs it, a verifier that scores attempts, and an oracle that proves the task is solvable. Everything accepted is released openly.

What a submission contains

task.md
YAML frontmatter for limits and metadata; Markdown body for the prompt, with optional multi-scene / multi-role structure.
environment/
Sealed Dockerfile and any seed data. Built on a shared base image so a task only adds what is task-specific.
verifier/
Scoring logic that runs at the end of every trial and emits a numeric reward plus side info.
oracle/
A reference solution that achieves a passing reward, so reviewers can confirm the task is solvable.

How it works

  1. 1Read the spec. The full task.md reference covers the frontmatter schema, body sections, and the validation contract.
  2. 2Validate locally. Run scripts/run_local.sh <env> until the environment builds and your oracle scores 1.0 — just python3 and docker, nothing to install.
  3. 3Open a pull request. Submit your team entry against the posttrainarena repo. We review, run the managed pipeline on your corpus, and score the checkpoint on BenchFlow Signals.

Eight under-served domains

  • Sciences
  • Industrial & Energy Operations
  • Cybersecurity
  • Finance & Economics
  • Office & Knowledge Work
  • Media & Multimodal Content
  • AI/ML & Agentic Systems
  • Software Engineering
  1. Phase 0Warm-upLate Jun 2026
  2. Phase 1Full submissionsSep 1–Oct 21
  3. AwardsWinners announcedNov 7
  4. WorkshopNeurIPS showcaseDec 2026

Questions

Who can participate?

Anyone — researchers, engineers, students, and hobbyists. There's no affiliation requirement, and you can enter solo or as a team.

Do I need GPUs or API keys?

No. You contribute the environment; we run the post-training and evaluation on our compute.

What exactly do I submit?

A task package: a task.md (YAML frontmatter + Markdown prompt), an environment/ Docker build, a verifier/ that scores attempts, and an oracle/ that proves the task is solvable. The full schema is in the spec.

How do I submit?

Open a pull request against the posttrainarena repo with your team entry under submissions/— submissions are bounded by teams (a corpus of environments per entry, not single tasks). There’s no submission form — the PR is the submission. Run scripts/check_submission.py and scripts/run_local.sh locally first so the review loop is short.

How is my submission scored?

We run the managed SFT→GRPO pipeline on your team's environments and evaluate the resulting Qwen3-8B checkpoint on BenchFlow Signals, a private held-out suite. Your score is the delta over a fixed reference baseline trained with the identical recipe — on tasks your environments can't have memorized.

What happens to what I submit?

Everything accepted is released openly. The point is to grow the commons of high-quality, diverse RL environments, not to lock anything away.

When does it run?

Phase 0 opens in late June 2026, Phase 1 runs September–October, awards are announced November 7, and the workshop is in December 2026.