In our last blog, we showed how a multi-tier browser agents architecture helped us beat state-of-the-art models on the Online‑Mind2Web (Online‑M2W) benchmark. To make the work relevant for enterprises, the agents need to be durable, tolerant and resilient. This blog is focused on agent‑specific decisions: how the planner reasoned, which tools were invoked and why, and how the browser was perceived.

Here we describe the agentic environment that made those results possible and we explain why this environment is now central to how we evaluate agents, reduce variance in online benchmarks, and generate labelled trajectory data that can be reused to train smaller, cheaper computer‑use agents. Using durable agents to generate training data for computer-use workloads, has widespread implications when building other custom fine-tuned models.

accuracy_bar_chart.png

LLM inference accuracy on the OnlineMind2Web benchmark. Hyde’s results highlighted on the right hand-side, with the demonstrated impact of the introduction of durability in agents alone providing a 7% lift in accuracy.


Why durability matters for AI agents

Agent benchmarks are not static question‑answer tasks. They are long‑horizon, multi‑step interactions with systems that are:

A single browser timeout or tool hiccup can invalidate an otherwise correct trajectory. Without durability, these failures don’t just reduce accuracy - they erase signal. Durability turns agent evaluation from best‑effort execution efforts into repeatable task automation machines.


The core analogy: agents as probabilistic state machines

We can treat an agent as a probabilistic finite state machine (PFSM), and the platform that runs it as a state‑machine orchestration engine. Formally, a probabilistic finite state machine can be defined as:

$$ M = (S, \Sigma, \delta, s_0, F) $$

Where: