Faster and more accurate browser-use agents

For enterprises, trust in a browser-use agent is centered on the agents’ ability to deliver accuracy at scale, repeatedly. Whether it be an integration with a SaaS software that doesn’t provide APIs, or it be automating a mission-critical legacy workflow involving a 20-year old ERP system, an enterprise-ready agent that integrates company data, speeds up decision making, and saves time on repetitive tasks is finding widespread adoption in some of the largest enterprises we work with.

In this blog, we trace the evolution of browser agents, track the benchmarks that are available currently, and talk about some of the core optimizations that we’ve been able to deliver for our partners, making their production-grade browser-use agents faster and more accurate.

LLM inference latency vs Accuracy graph on OnlineMind2Web benchmark. Hyde’s results highlighted in purple, with higher accuracy and reduced latency compared to out of the box SOTA models.

LLM inference latency vs Accuracy graph on OnlineMind2Web benchmark. Hyde’s results highlighted in purple, with higher accuracy and reduced latency compared to out of the box SOTA models.

Evolution of Browser agents using LLMs

Oct 2022 - Gur et al. show that LLMs understand HTML and can find out elements to make click or type actions on from HTML text of page. [1]
Jul 2023 - WebAgent combines two LLMs for planning and action while browsing. One fine-tuned on HTML which understands current state and plans ahead and another to generate Selenium code to take actions. 2
Oct 2023 - Set-of-Mark prompting is proposed for multimodal LLMs in which different regions of image are marked using an off-the-shelf segmentation model to improve visual grounding without fine-tuning. 3
Jan 2024 - WebVoyager agent employs an LLM which consumes screenshots annotated like set-of-marks technique. It completes tasks by interacting with websites using tools for primitive GUI actions like click, input, scroll. 4
Oct 2024 - SeeAct-V agent uses a visual grounding model to perceive the GUIs entirely visually, and take pixel-level operations on screens. 5
Oct 2024 - Claude 3.5 Sonnet adds Computer Use, inferring on‑screen coordinates from images. This was the first frontier model to have this capability. 6
Dec 2024 - BrowserUse achieves highest scores on WebVoyager benchmark using a DOM-first approach ie. the agent gets current browser state as DOM instead of screenshot. 7

Evaluation

Browser-agent benchmarks must resemble real user workflows, not demo tasks, and they must exercise a diverse set of skills so that high scores translate to real world usefulness.

Popular Benchmarks

WebArena - A standalone, self-hostable web environment spanning websites from popular categories. It has a set of tasks to be performed using this environment. This approach avoids challenges of open web like pop-ups, ads, captchas etc. 8
WebVoyager - Task suite on 15 representative websites. It lacks website diversity and recent agents get very high scores on this benchmark. 9
OnlineMind2Web - It has tasks spanning 136 websites. The authors note many tasks in WebVoyager are solvable via Google Search alone, so they propose harder tasks. 10

Evaluating agents on these benchmarks is also an open-ended problem. For most tasks there are many valid trajectories, so comparing runs to a single reference trajectory is suboptimal.