add webbench, chrome-based OS world, and ground truth to web voyager #1057

filip-michalsky · 2025-09-07T18:49:54Z

why

We want to build a best in class agent in stagehand.
Therefore, we need more eval benchmarks.

what changed

Added Web-bench evals dataset
Added a subset of OS World evals - those that can be run in a chrome browser (desktop-based tasks omitted)
added LICENSE noticed to the copied evals tasks
Added ground truth / expected result to some WebVoyager tasks using reference_answer.json from Browser Use public evals repo.

Improvements to pnpm run evals -man to better describe how to run evals.

test plan

Evals should run locally and bb for these new benchmarks.

changeset-bot · 2025-09-07T18:49:57Z

🦋 Changeset detected

Latest commit: 0b93932

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

filip-michalsky · 2025-09-07T19:59:09Z

evals/datasets/webbench/results/anthropicfinal.csv

IDK if we want to include these here or just refer to it when we build the evals graphs? cc @miguelg719

like dragging random CSVs in the repo is not great

resolved -will remove results

greptile-apps

Greptile Summary

This PR successfully integrates three major evaluation benchmarks into Stagehand: WebBench, OS World (Chrome tasks only), and WebVoyager with ground truth reference answers. The implementation includes proper licensing, well-structured adapters for data conversion, and comprehensive evaluation logic for each benchmark type.

Key improvements include:

Added WebBench evaluation dataset with 1000+ web automation tasks across 5 categories (READ, CREATE, UPDATE, DELETE, FILE_MANIPULATION)
Integrated 47 Chrome-compatible OS World tasks with proper evaluation criteria mapping
Enhanced WebVoyager with ground truth checker using reference answers from Browser Use public eval repository
Comprehensive configuration options for filtering and sampling tasks across all benchmarks
Proper Apache 2.0 and MIT licensing with attribution for external datasets

Confidence score: 4/5

This PR is safe to merge with minimal risk
The implementation is well-structured with proper error handling, licensing compliance, and follows existing patterns. One minor duplicate configuration issue was found but doesn't affect functionality
evals/evals.config.json has a duplicate task entry that should be corrected

_{65 files reviewed, no comments}

_{Edit Code Review Bot Settings | Greptile}

evals/datasets/webvoyager/groundTruthChecker.ts

evals/datasets/osworld/raw/2ad9387a-65d8-4e33-ad5b-7580065a27ca.json

evals/suites/webbench.ts

evals/tasks/agent/webbench.ts

evals/tasks/agent/webvoyager.ts

evals/tasks/agent/osworld.ts

tkattkat · 2025-09-15T19:00:20Z

evals/index.eval.ts

+          console.log(
+            `🏃 Running eval: ${input.name} with model: ${input.modelName}`,
+          );


evals/tasks/agent/webbench.ts

merge main

@tkattkat

This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated. # Releases ## @browserbasehq/stagehand@2.5.1 ### Patch Changes - [#1082](#1082) [`8c0fd01`](8c0fd01) Thanks [@tkattkat](https://github.com/tkattkat)! - Pass stagehand object to agent instead of stagehand page - [#1104](#1104) [`a1ad06c`](a1ad06c) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix logging for stagehand agent - [#1066](#1066) [`9daa584`](9daa584) Thanks [@tkattkat](https://github.com/tkattkat)! - Add playwright arguments to agent execute response - [#1077](#1077) [`7f38b3a`](7f38b3a) Thanks [@tkattkat](https://github.com/tkattkat)! - adds support for stagehand agent in the api - [#1032](#1032) [`bf2d0e7`](bf2d0e7) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix for zod peer dependency support - [#1014](#1014) [`6966201`](6966201) Thanks [@tkattkat](https://github.com/tkattkat)! - Replace operator handler with base of new agent - [#1089](#1089) [`536f366`](536f366) Thanks [@miguelg719](https://github.com/miguelg719)! - Fixed info logs on api session create - [#1103](#1103) [`889cb6c`](889cb6c) Thanks [@tkattkat](https://github.com/tkattkat)! - patch custom tool support in anthropic cua client - [#1056](#1056) [`6a002b2`](6a002b2) Thanks [@chrisreadsf](https://github.com/chrisreadsf)! - remove need for duplicate project id if already passed to Stagehand - [#1090](#1090) [`8ff5c5a`](8ff5c5a) Thanks [@miguelg719](https://github.com/miguelg719)! - Improve failed act error logs - [#1014](#1014) [`6966201`](6966201) Thanks [@tkattkat](https://github.com/tkattkat)! - replace operator agent with scaffold for new stagehand agent - [#1107](#1107) [`3ccf335`](3ccf335) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: url extraction not working inside an array - [#1102](#1102) [`a99aa48`](a99aa48) Thanks [@miguelg719](https://github.com/miguelg719)! - Add current page and date context to agent - [#1110](#1110) [`dda52f1`](dda52f1) Thanks [@miguelg719](https://github.com/miguelg719)! - Add support for new Gemini Computer Use models ## @browserbasehq/stagehand-evals@1.1.0 ### Minor Changes - [#1057](#1057) [`b7be89e`](b7be89e) Thanks [@filip-michalsky](https://github.com/filip-michalsky)! - added web voyager ground truth (optional), added web bench, and subset of OSWorld evals which run on a browser ### Patch Changes - [#1072](#1072) [`dc2d420`](dc2d420) Thanks [@filip-michalsky](https://github.com/filip-michalsky)! - improve evals screenshot service - add img hashing diff to add screenshots and change to screenshot intercepts from the agent - Updated dependencies \[[`8c0fd01`](8c0fd01), [`a1ad06c`](a1ad06c), [`9daa584`](9daa584), [`7f38b3a`](7f38b3a), [`bf2d0e7`](bf2d0e7), [`6966201`](6966201), [`536f366`](536f366), [`889cb6c`](889cb6c), [`6a002b2`](6a002b2), [`8ff5c5a`](8ff5c5a), [`6966201`](6966201), [`3ccf335`](3ccf335), [`a99aa48`](a99aa48), [`dda52f1`](dda52f1)]: - @browserbasehq/stagehand@2.5.1 ## @browserbasehq/stagehand-examples@1.0.10 ### Patch Changes - Updated dependencies \[[`8c0fd01`](8c0fd01), [`a1ad06c`](a1ad06c), [`9daa584`](9daa584), [`7f38b3a`](7f38b3a), [`bf2d0e7`](bf2d0e7), [`6966201`](6966201), [`536f366`](536f366), [`889cb6c`](889cb6c), [`6a002b2`](6a002b2), [`8ff5c5a`](8ff5c5a), [`6966201`](6966201), [`3ccf335`](3ccf335), [`a99aa48`](a99aa48), [`dda52f1`](dda52f1)]: - @browserbasehq/stagehand@2.5.1 Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

add webbench

280324b

filip-michalsky added 2 commits September 7, 2025 15:14

improve evals manual

dc02448

update manual for evals

d0447fc

filip-michalsky commented Sep 7, 2025

View reviewed changes

add chrome based OS world evals and licenses

9064586

filip-michalsky changed the title ~~add webbench~~ add webbench, chrome-based OS world, and ground truth to web voyager Sep 8, 2025

filip-michalsky added 2 commits September 7, 2025 22:16

add os world chrome tasks and web voyager ground truth

91cc95d

remove logging

a0007da

filip-michalsky requested a review from tkattkat September 8, 2025 16:30

filip-michalsky marked this pull request as ready for review September 8, 2025 16:30

greptile-apps bot reviewed Sep 8, 2025

View reviewed changes

remove results, make gt checking in webvoyager optional. default false

f66f7fa

filip-michalsky requested a review from miguelg719 September 10, 2025 17:18

filip-michalsky added 2 commits September 10, 2025 13:28

merge main

3ab5e56

update evals

90ade18

tkattkat reviewed Sep 12, 2025

View reviewed changes

evals/datasets/webvoyager/groundTruthChecker.ts Outdated Show resolved Hide resolved

tkattkat reviewed Sep 12, 2025

View reviewed changes

evals/datasets/osworld/raw/2ad9387a-65d8-4e33-ad5b-7580065a27ca.json Show resolved Hide resolved

tkattkat reviewed Sep 12, 2025

View reviewed changes

evals/suites/webbench.ts Outdated Show resolved Hide resolved

filip-michalsky added 2 commits September 12, 2025 22:23

use csv parser lib, simplify ground truth checker

150c69f

merge main

067d013

tkattkat reviewed Sep 15, 2025

View reviewed changes

evals/tasks/agent/webbench.ts Outdated Show resolved Hide resolved

tkattkat reviewed Sep 15, 2025

View reviewed changes

evals/tasks/agent/webvoyager.ts Outdated Show resolved Hide resolved

tkattkat reviewed Sep 15, 2025

View reviewed changes

evals/tasks/agent/osworld.ts Outdated Show resolved Hide resolved

tkattkat reviewed Sep 15, 2025

View reviewed changes

evals/tasks/agent/webbench.ts Show resolved Hide resolved

filip-michalsky added 4 commits September 18, 2025 18:14

update of passing agent to benchmarks

8a18457

remove not needed param

0bb0476

Merge branch 'main' into fm/stg-743-add-more-eval-benchmarks

cf6a832

merge main

update

1199fb0

update screenshot collector

b76b3ec

filip-michalsky requested a review from tkattkat September 18, 2025 17:00

filip-michalsky added 2 commits September 18, 2025 19:22

update os worls

3cd03c1

fix lint

0b93932

tkattkat approved these changes Sep 18, 2025

View reviewed changes

filip-michalsky merged commit b7be89e into main Sep 18, 2025
14 of 15 checks passed

This was referenced Sep 17, 2025

Version Packages #1062

Merged

Version Packages Malumbo21/stagehand#114

Open

Version Packages pchaganti/gx-stage-hand#1

Open

Version Packages CloudEngineHub/stagehand#1

Open

Version Packages erickirt/stagehand#72

Open

github-actions bot mentioned this pull request Aug 13, 2025

Version Packages aaag1980/stagehand#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add webbench, chrome-based OS world, and ground truth to web voyager #1057

add webbench, chrome-based OS world, and ground truth to web voyager #1057

Uh oh!

filip-michalsky commented Sep 7, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Sep 7, 2025 •

edited

Loading

Uh oh!

filip-michalsky Sep 7, 2025

Uh oh!

filip-michalsky Sep 7, 2025

Uh oh!

filip-michalsky Sep 8, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tkattkat Sep 15, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add webbench, chrome-based OS world, and ground truth to web voyager #1057

add webbench, chrome-based OS world, and ground truth to web voyager #1057

Uh oh!

Conversation

filip-michalsky commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Uh oh!

changeset-bot bot commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

filip-michalsky Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

filip-michalsky Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

filip-michalsky Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tkattkat Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

filip-michalsky commented Sep 7, 2025 •

edited

Loading

changeset-bot bot commented Sep 7, 2025 •

edited

Loading