Catalog
Browse the open RL environment ecosystem
A working directory of 345 open-source environments — pulled from the hubs where they actually live, and mapped to where the open community is crowded, thin, or wide open.
Sourced from the Prime Intellect Environments Hub (109), Harbor (200 datasets), and a curated set of 36 anchor environments. A snapshot, not a live mirror — counts and links reflect each project at review time.
Where environments live
The center of gravity is no longer any single trainer — it is the registries and standards that let an environment be published once and reused anywhere.
- Prime Intellect — Environments Hub2,500+ envsPrime IntellectCommunity registry of RL/agentic environments as versioned Python packages — the de-facto verifiers format.
- OpenRewardopen standardGeneral ReasoningAn open standard (ORS) connecting agents to environments via tool-calling; NVIDIA, Nebius, Eigent as launch contributors.
- Harbor200+ datasetsStanford + LaudeHub of Dockerized evaluation datasets run by the Harbor harness, which also generates RL rollouts.
- OpenEnvGym-over-HTTPMeta PyTorch + Hugging FaceBSD-licensed interface running isolated Dockerized environments behind a Gymnasium HTTP API.
- verifiers4.2k★ libraryPrime IntellectThe standard Python library for building, training and evaluating LLMs in RL environments.
- Inspect Evals100+ evalsUK AISICommunity LLM evals on Inspect AI: cyber, coding, knowledge, multimodal, agentic.
- Gymnasium registry70+ third-partyFarama FoundationCurated registry of Gymnasium-compatible environments across robotics, games, energy, finance.
The directory
Search and filter the catalogue by source or capability. Every card links to the environment’s home — a Hub package, a repo, or a task suite.
- AgentBenchNotable8 containerized tasks: OS, DB, knowledge graph, card game, web, household.agenttool-use
- AgentGymNotable14 unified interactive envs: web, games, household, embodied, tools, coding.agentmulti-turn
- AndroidWorldNotable116 mobile-GUI tasks across 20 apps with millions of variants.mobilegui-agent
- AppWorldNotableStateful world of 9 simulated apps and 457 APIs with programmatic checks.officetool-use
- AviaryNotableScientific agent environments: biology, chemistry, literature search.sciencestool-use
- BountyBenchNotableDetect/exploit/patch across 25 real systems and 40 bug bounties.cybersecurity
- BrowserGymNotableGymnasium browser env unifying MiniWoB, WebArena, WorkArena, VisualWebArena.webgui-agentgym
- CityLearnNotableBuilding and district energy management with demand response.energycontrol
- Commit0NotableRebuild 57 Python libraries from scratch to pass full test/lint/type suites.codingswe
- CraftaxNotableJAX survival-crafting; fast open-ended exploration benchmark.embodiedjax
- CTF-DojoNotableAuto-builds Dockerized CTF environments (~0.5s/container) for training.cybersecurityproceduraldocker
- CyberGymNotable1,507 real vulnerabilities; agents generate proof-of-concept tests.cybersecuritydocker
- FinRLNotableLeading financial-RL library; hundreds of stock/crypto/portfolio markets.financegym
- GEMNotableGym-style suite for agentic LLMs across games, math, code, QA with tools.tool-usegymreasoning
- Grid2OpNotablePower-grid operation environment behind the L2RPN competitions.energycontrol
- GymnasiumNotableThe maintained successor to OpenAI Gym; the reset()/step() standard.classic-controlgym
- Isaac LabNotableNVIDIA's dominant high-throughput robot-learning stack.roboticssim
- JumanjiNotable22 JAX environments for NP-hard combinatorial optimization.industrialjaxoptimization
- KernelBenchNotable250 GPU-kernel generation tasks scored on correctness and speedup.kernelsml-engineering
- ManiSkillNotableGPU-parallel robot manipulation on the SAPIEN simulator.roboticsmanipulation
- Melting PotNotable80+ multi-agent social-dilemma and cooperation scenarios.multi-agentsocial
- MineDojoNotableOpen-ended Minecraft with thousands of language-specified tasks.embodiedopen-ended
- MLE-benchNotableEnd-to-end ML engineering on 75 Kaggle competitions, leaderboard-graded.ml-engineeringagent
- MLGymNotableGym-style framework for AI-research agents across 13 open-ended ML tasks.ml-engineeringgym
- OpenHandsNotableDockerized agent runtime + eval harnesses for external SWE benchmarks.codingagentruntime
- OpenSpielNotable70+ games + algorithms for general RL, search and planning.gamesmulti-agent
- OSWorldNotable369 real cross-app desktop tasks with 134 execution-based evaluators.computer-usegui-agentmultimodal
- PettingZooNotableMulti-agent counterpart to Gymnasium: ~70+ environments.multi-agentgamesgym
- R2E-GymNotable8,100+ Gym-like executable SWE environments with unit-test rewards.codingswegym
- Reasoning GymNotable100+ procedurally generated, algorithmically verifiable reasoning tasks.reasoningmathprocedural
- ScienceWorldNotableText science-lab environment, 30 task types, for grounded reasoning.sciencestext
- SkyRL-GymNotableGymnasium-style tool-use envs (math, code, search, SQL) in the SkyRL stack.tool-usegymcoding
- SWE-benchNotable2,294 real GitHub issue-resolution tasks with Dockerized per-instance evaluation.codingswedocker
- SWE-smithNotableTurns any repo into unlimited Dockerized SWE tasks; 52k-instance dataset.codingsweprocedural
- tau2-benchNotableTool-agent-with-simulated-user (airline/retail/telecom) with a Gym RL API.tool-usecustomer-servicegym
- WebArenaNotableSelf-hostable realistic web (shopping, forum, GitLab, CMS), 812 tasks.webagent
- Agency BenchPrimeHumanAgencyBench: Benchmark measuring AI assistants' support for human agency across 6 dimensions (3000 prompts, LLM-as-judge)benchmarkllm-as-judgehuman-agency
- Agent DojoPrimeBenchmark for agent robustness against prompt injection attacks in tool-use scenariossecurityprompt-injectiontool-useadversarial
- AgentharmPrimeAgentHarm environment to evaluate agentic reasoning and safetytrainsafetyagenttool-use
- Agentic MisalignmentPrimeThis is a port of Anthropic's Agentic Misalignment framework to PI env hubllm-as-judge
- AidanbenchPrimeAidanBench multi-turn environment for Verifiersaidanbenchmulti-turnjudgenovelty
- Aider PolyglotPrimeMulti-turn environment for testing coding abilities across multiple programming languages using Exercism exercisescodingmulti-turnpolyglot
- Allenai IfevalPrimeIFEval single-turn environment using AllenAI RLVR-IFevalifevalsingle-turnchatconstraints
- AndroidworldPrimeAndroidWorld benchmark for evaluating autonomous agents on real Android apps with 116 tasks across 20 appsmobileandroidmulti-turntool-use
- AntislopPrimeRank model on anti-slop scoresingle-turncreative-writingllm-judge
- ArcPrimeARC, Benchmark for Grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answeringarcarc-challengearc-easybenchmark
- Arc AgiPrimeARC-AGI 1 + 2 (Abstract and Reasoning Corpus)arc-agisingle-turnreasoningpuzzles
- Arc Agi ToolPrimeARC-AGI 1 + 2 with tool calling (Abstract and Reasoning Corpus)arc-agitool-usemulti-turnreasoning
- Art EPrimeART-E: a tool-using email research RL environment for Verifiersemailresearchtool-usellm-judge
- Ascii TreePrimeSingle-turn evaluation where the model generates ASCII tree diagrams from prompts.formattingasciisingle-turnxml
- AutodiffPrimeAutodifferentiation puzzles in Jax by Sasha Rush
- Backend BenchPrimeBackendBench environment for LLM kernel benchmarkingkernelssingle-turn
- Balrog BenchPrimeBALROG benchmark integration for verifiers: unified RL evaluation across game environments.
- Bigbench BbhPrimeBig Bench + BBH implementationbigbenchbbhevaluationnlp
- BixbenchPrimeBixBench scientific reasoning evaluation environmentscientific-reasoningmcqopen-answersingle-turn
- BoolqPrimeBinary question-answering task from BoolQ, where models predict True or False from a passage.reasoningqa
- Browsecomp PlusPrimeVerifiers environment for BrowseComp-Plus Deep-Research Agent Benchmark. Controlled agent/retriever evaluation on the fixed human-verified corpus.search-agentdeep-researchretrieverprimeintellect
- ClockbenchPrimeClockBench: multimodal clock reading and reasoning benchmark implemented for verifiers.clockbenchmultimodalvision
- CoconotPrimeContextual noncompliance evaluation using the AllenAI CoCoNot datasetcoconotsafetysingle-turnllm-judge
- ColfPrimeRun the Colf evalcode-golfjavascriptprompt-engineering
Mapped to the eight domains
How that catalogue lands across PostTrain Arena’s under-served domains. Four are effectively greenfield, two are thin, and two are mature — that asymmetry is the opportunity.