Skip to content

Conversation

@filip-michalsky
Copy link
Collaborator

@filip-michalsky filip-michalsky commented Sep 7, 2025

why

We want to build a best in class agent in stagehand.
Therefore, we need more eval benchmarks.

what changed

  • Added Web-bench evals dataset
  • Added a subset of OS World evals - those that can be run in a chrome browser (desktop-based tasks omitted)
  • added LICENSE noticed to the copied evals tasks
  • Added ground truth / expected result to some WebVoyager tasks using reference_answer.json from Browser Use public evals repo.

Improvements to pnpm run evals -man to better describe how to run evals.

test plan

Evals should run locally and bb for these new benchmarks.

@changeset-bot
Copy link

changeset-bot bot commented Sep 7, 2025

🦋 Changeset detected

Latest commit: 0b93932

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK if we want to include these here or just refer to it when we build the evals graphs? cc @miguelg719

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like dragging random CSVs in the repo is not great

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved -will remove results

@filip-michalsky filip-michalsky changed the title add webbench add webbench, chrome-based OS world, and ground truth to web voyager Sep 8, 2025
@filip-michalsky filip-michalsky marked this pull request as ready for review September 8, 2025 16:30
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR successfully integrates three major evaluation benchmarks into Stagehand: WebBench, OS World (Chrome tasks only), and WebVoyager with ground truth reference answers. The implementation includes proper licensing, well-structured adapters for data conversion, and comprehensive evaluation logic for each benchmark type.

Key improvements include:

  • Added WebBench evaluation dataset with 1000+ web automation tasks across 5 categories (READ, CREATE, UPDATE, DELETE, FILE_MANIPULATION)
  • Integrated 47 Chrome-compatible OS World tasks with proper evaluation criteria mapping
  • Enhanced WebVoyager with ground truth checker using reference answers from Browser Use public eval repository
  • Comprehensive configuration options for filtering and sampling tasks across all benchmarks
  • Proper Apache 2.0 and MIT licensing with attribution for external datasets

Confidence score: 4/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-structured with proper error handling, licensing compliance, and follows existing patterns. One minor duplicate configuration issue was found but doesn't affect functionality
  • evals/evals.config.json has a duplicate task entry that should be corrected

65 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

Comment on lines +363 to +365
console.log(
`🏃 Running eval: ${input.name} with model: ${input.modelName}`,
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger

@filip-michalsky filip-michalsky merged commit b7be89e into main Sep 18, 2025
14 of 15 checks passed
miguelg719 pushed a commit that referenced this pull request Oct 7, 2025
This PR was opened by the [Changesets
release](https://github.com/changesets/action) GitHub action. When
you're ready to do a release, you can merge this and the packages will
be published to npm automatically. If you're not ready to do a release
yet, that's fine, whenever you add more changesets to main, this PR will
be updated.


# Releases
## @browserbasehq/stagehand@2.5.1

### Patch Changes

- [#1082](#1082)
[`8c0fd01`](8c0fd01)
Thanks [@tkattkat](https://github.com/tkattkat)! - Pass stagehand object
to agent instead of stagehand page

- [#1104](#1104)
[`a1ad06c`](a1ad06c)
Thanks [@miguelg719](https://github.com/miguelg719)! - Fix logging for
stagehand agent

- [#1066](#1066)
[`9daa584`](9daa584)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add playwright
arguments to agent execute response

- [#1077](#1077)
[`7f38b3a`](7f38b3a)
Thanks [@tkattkat](https://github.com/tkattkat)! - adds support for
stagehand agent in the api

- [#1032](#1032)
[`bf2d0e7`](bf2d0e7)
Thanks [@miguelg719](https://github.com/miguelg719)! - Fix for zod peer
dependency support

- [#1014](#1014)
[`6966201`](6966201)
Thanks [@tkattkat](https://github.com/tkattkat)! - Replace operator
handler with base of new agent

- [#1089](#1089)
[`536f366`](536f366)
Thanks [@miguelg719](https://github.com/miguelg719)! - Fixed info logs
on api session create

- [#1103](#1103)
[`889cb6c`](889cb6c)
Thanks [@tkattkat](https://github.com/tkattkat)! - patch custom tool
support in anthropic cua client

- [#1056](#1056)
[`6a002b2`](6a002b2)
Thanks [@chrisreadsf](https://github.com/chrisreadsf)! - remove need for
duplicate project id if already passed to Stagehand

- [#1090](#1090)
[`8ff5c5a`](8ff5c5a)
Thanks [@miguelg719](https://github.com/miguelg719)! - Improve failed
act error logs

- [#1014](#1014)
[`6966201`](6966201)
Thanks [@tkattkat](https://github.com/tkattkat)! - replace operator
agent with scaffold for new stagehand agent

- [#1107](#1107)
[`3ccf335`](3ccf335)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: url
extraction not working inside an array

- [#1102](#1102)
[`a99aa48`](a99aa48)
Thanks [@miguelg719](https://github.com/miguelg719)! - Add current page
and date context to agent

- [#1110](#1110)
[`dda52f1`](dda52f1)
Thanks [@miguelg719](https://github.com/miguelg719)! - Add support for
new Gemini Computer Use models

## @browserbasehq/stagehand-evals@1.1.0

### Minor Changes

- [#1057](#1057)
[`b7be89e`](b7be89e)
Thanks [@filip-michalsky](https://github.com/filip-michalsky)! - added
web voyager ground truth (optional), added web bench, and subset of
OSWorld evals which run on a browser

### Patch Changes

- [#1072](#1072)
[`dc2d420`](dc2d420)
Thanks [@filip-michalsky](https://github.com/filip-michalsky)! - improve
evals screenshot service - add img hashing diff to add screenshots and
change to screenshot intercepts from the agent

- Updated dependencies
\[[`8c0fd01`](8c0fd01),
[`a1ad06c`](a1ad06c),
[`9daa584`](9daa584),
[`7f38b3a`](7f38b3a),
[`bf2d0e7`](bf2d0e7),
[`6966201`](6966201),
[`536f366`](536f366),
[`889cb6c`](889cb6c),
[`6a002b2`](6a002b2),
[`8ff5c5a`](8ff5c5a),
[`6966201`](6966201),
[`3ccf335`](3ccf335),
[`a99aa48`](a99aa48),
[`dda52f1`](dda52f1)]:
    -   @browserbasehq/stagehand@2.5.1

## @browserbasehq/stagehand-examples@1.0.10

### Patch Changes

- Updated dependencies
\[[`8c0fd01`](8c0fd01),
[`a1ad06c`](a1ad06c),
[`9daa584`](9daa584),
[`7f38b3a`](7f38b3a),
[`bf2d0e7`](bf2d0e7),
[`6966201`](6966201),
[`536f366`](536f366),
[`889cb6c`](889cb6c),
[`6a002b2`](6a002b2),
[`8ff5c5a`](8ff5c5a),
[`6966201`](6966201),
[`3ccf335`](3ccf335),
[`a99aa48`](a99aa48),
[`dda52f1`](dda52f1)]:
    -   @browserbasehq/stagehand@2.5.1

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants