For enterprises, trust in a browser-use agent is centered on the agents’ ability to deliver accuracy at scale, repeatedly. Whether it be an integration with a SaaS software that doesn’t provide APIs, or it be automating a mission-critical legacy workflow involving a 20-year old ERP system, an enterprise-ready agent that integrates company data, speeds up decision making, and saves time on repetitive tasks is finding widespread adoption in some of the largest enterprises we work with.

In this blog, we trace the evolution of browser agents, track the benchmarks that are available currently, and talk about some of the core optimizations that we’ve been able to deliver for our partners, making their production-grade browser-use agents faster and more accurate.

LLM inference latency vs Accuracy graph on OnlineMind2Web benchmark. Hyde’s results highlighted in purple, with higher accuracy and reduced latency compared to out of the box SOTA models.

LLM inference latency vs Accuracy graph on OnlineMind2Web benchmark. Hyde’s results highlighted in purple, with higher accuracy and reduced latency compared to out of the box SOTA models.

Evolution of Browser agents using LLMs

Evaluation

Browser-agent benchmarks must resemble real user workflows, not demo tasks, and they must exercise a diverse set of skills so that high scores translate to real world usefulness.

Popular Benchmarks

Evaluating agents on these benchmarks is also an open-ended problem. For most tasks there are many valid trajectories, so comparing runs to a single reference trajectory is suboptimal.