Building a Durable Agentic Environment

From benchmarks to environments

In our last blog, we showed how a combination of architectural choices and prompting strategies allowed us to beat state‑of‑the‑art models on the Online‑Mind2Web (Online‑M2W) benchmark. That post focused primarily on agent‑level decisions: how the planner reasoned, how tools were invoked, and how the browser was perceived.

This post zooms out one level.

Here, we describe the agentic environment that made those results possible in the first place. More importantly, we explain why this environment is now central to how we evaluate agents, reduce variance in online benchmarks, and generate labelled trajectory data that can be reused to train smaller, cheaper computer‑use agents.

Specifically, we’ll cover how our Agentic Environment playground lets us:

Run agents on live benchmarks durably
Run evaluations concurrently with lower variance
Turn benchmark runs into high‑fidelity synthetic trajectories for downstream training

Why durability matters for agents

Agent benchmarks are not static question‑answer tasks. They are long‑horizon, multi‑step interactions with systems that are:

Non‑deterministic
Externally stateful (browsers, websites, APIs)
Prone to transient failures

A single browser timeout or tool hiccup can invalidate an otherwise correct trajectory. Without durability, these failures don’t just reduce accuracy—they erase signal.

Early on, we found ourselves throwing away large numbers of partially successful runs simply because the environment failed near the end. That was unacceptable if our goal was to measure and improve agents rather than demo them.

Durability is what turns agent evaluation from best‑effort execution into repeatable measurement.

Why durability matters for agents

The core analogy: agents as probabilistic state machines