From benchmarks to environments
In our last blog, we showed how a combination of architectural choices and prompting strategies allowed us to beat state‑of‑the‑art models on the Online‑Mind2Web (Online‑M2W) benchmark. That post focused primarily on agent‑level decisions: how the planner reasoned, how tools were invoked, and how the browser was perceived.
This post zooms out one level.
Here, we describe the agentic environment that made those results possible in the first place. More importantly, we explain why this environment is now central to how we evaluate agents, reduce variance in online benchmarks, and generate labelled trajectory data that can be reused to train smaller, cheaper computer‑use agents.
Specifically, we’ll cover how our Agentic Environment playground lets us:
Agent benchmarks are not static question‑answer tasks. They are long‑horizon, multi‑step interactions with systems that are:
A single browser timeout or tool hiccup can invalidate an otherwise correct trajectory. Without durability, these failures don’t just reduce accuracy—they erase signal.
Early on, we found ourselves throwing away large numbers of partially successful runs simply because the environment failed near the end. That was unacceptable if our goal was to measure and improve agents rather than demo them.
Durability is what turns agent evaluation from best‑effort execution into repeatable measurement.