Hacker News

Web Bench: a new way to compare AI browser agents

by suchintanon 5/29/2025, 2:57:25 PM with 6 comments

by helsinkion 5/29/2025, 10:57:57 PM
Does anyone use Skyvern to build their websites? I’m wondering how I might benefit from using an agentic browser workflow instead of a playwright MCP server for building a web UI?
by neveroddorevenon 5/29/2025, 3:47:07 PM
I had no idea WebVoyager only spanned 15 websites lol... the 452 figure you have still seems a little low though - do you have plans to expand it? It seems like you'd want as many sites as possible to improve the real-world accuracy of agents due to the long tail nature of website traffic
by vasusenon 5/29/2025, 10:18:01 PM
Thank you so much for creating this folks! A browser navigation agent is key part of our AI QA setup at Donobu (https://donobu.com/). We found the WebVoyager benchmarks severely lacking for complex e2e test cases like logged-in dashboards, onboarding forms, etc.
While the extraction/2fa flows aren't super relevant to us, this saves us time from building our own set of benchmarks. Really appreciate it and hope we can contribute to make this a really large set.
by gitmagicon 5/29/2025, 9:21:43 PM
Would love to see how Nelly [0] performs on this benchmark.
[0] https://nelly.is
by pants2on 5/29/2025, 11:25:10 PM
Great work! Big fan of Skyvern.
Looking forward to the benchmarks on Claude 4 (and o3 CUA when that's released)
by wm2on 5/29/2025, 3:41:35 PM
super cool!