diff --git a/docs/configuration/evals.mdx b/docs/configuration/evals.mdx index b717ce9f2..af88eb336 100644 --- a/docs/configuration/evals.mdx +++ b/docs/configuration/evals.mdx @@ -25,33 +25,114 @@ Evaluations help you understand how well your automation performs, which models Evaluations help you systematically test and improve your automation workflows. Stagehand provides both built-in evaluations and tools to create your own. - -To run evals, you'll need to clone the [Stagehand repo](https://github.com/browserbase/stagehand) and run `npm install` to install the dependencies. - - -We have three types of evals: -1. **Deterministic Evals** - These are evals that are deterministic and can be run without any LLM inference. +We have 2 types of evals: +1. **Deterministic Evals** - These include unit tests, integration tests, and E2E tests that can be run without any LLM inference. 2. **LLM-based Evals** - These are evals that test the underlying functionality of Stagehand's AI primitives. -### LLM-based Evals +### Evals CLI +![Evals CLI](/media/evals-cli.png) -To run LLM evals, you'll need a [Braintrust account](https://www.braintrust.dev/docs/). +To run evals, you'll need to clone the [Stagehand repo](https://github.com/browserbase/stagehand) and set up the CLI. + +We recommend using [Braintrust](https://www.braintrust.dev/docs/) to help visualize evals results and metrics. -To run LLM-based evals, you can run `npm run evals` from within the Stagehand repo. This will test the functionality of the LLM primitives within Stagehand to make sure they're working as expected. +The Stagehand CLI provides a powerful interface for running evaluations. You can run specific evals, categories, or external benchmarks with customizable settings. -Evals are grouped into three categories: +Evals are grouped into: 1. **Act Evals** - These are evals that test the functionality of the `act` method. 2. **Extract Evals** - These are evals that test the functionality of the `extract` method. 3. **Observe Evals** - These are evals that test the functionality of the `observe` method. 4. **Combination Evals** - These are evals that test the functionality of the `act`, `extract`, and `observe` methods together. +5. **Experimental Evals** - These are experimental custom evals that test the functionality of the stagehand primitives. +6. **Agent Evals** - These are evals that test the functionality of `agent`. +7. **(NEW) External Benchmarks** - Run external benchmarks like WebBench, GAIA, WebVoyager, OnlineMind2Web, and OSWorld. + +#### Installation + + + +```bash +# From the stagehand root directory +pnpm install +``` + -#### Configuring and Running Evals -You can view the specific evals in [`evals/tasks`](https://github.com/browserbase/stagehand/tree/main/evals/tasks). Each eval is grouped into eval categories based on [`evals/evals.config.json`](https://github.com/browserbase/stagehand/blob/main/evals/evals.config.json). You can specify models to run and other general task config in [`evals/taskConfig.ts`](https://github.com/browserbase/stagehand/blob/main/evals/taskConfig.ts). + +```bash +pnpm run build:cli +``` + -To run a specific eval, you can run `npm run evals `, or run all evals in a category with `npm run evals category `. + +```bash +evals help +``` + + + +#### CLI Commands and Options + +##### Basic Commands + +```bash +# Run all evals +evals run all + +# Run specific category +evals run act +evals run extract +evals run observe +evals run agent + +# Run specific eval +evals run extract/extract_text + +# List available evals +evals list +evals list --detailed + +# Configure defaults +evals config +evals config set env browserbase +evals config set trials 5 +``` + +##### Command Options + +- **`-e, --env`**: Environment (`local` or `browserbase`) +- **`-t, --trials`**: Number of trials per eval (default: 3) +- **`-c, --concurrency`**: Max parallel sessions (default: 10) +- **`-m, --model`**: Model override +- **`-p, --provider`**: Provider override +- **`--api`**: Use Stagehand API instead of SDK + +##### Running External Benchmarks + +The CLI supports several industry-standard benchmarks: + +```bash +# WebBench with filters +evals run benchmark:webbench -l 10 -f difficulty=easy -f category=READ + +# GAIA benchmark +evals run b:gaia -s 100 -l 25 -f level=1 + +# WebVoyager +evals run b:webvoyager -l 50 + +# OnlineMind2Web +evals run b:onlineMind2Web + +# OSWorld +evals run b:osworld -f source=Mind2Web +``` + +#### Configuration Files + +You can view the specific evals in [`evals/tasks`](https://github.com/browserbase/stagehand/tree/main/evals/tasks). Each eval is grouped into eval categories based on [`evals/evals.config.json`](https://github.com/browserbase/stagehand/blob/main/evals/evals.config.json). #### Viewing eval results @@ -65,7 +146,7 @@ You can use the Braintrust UI to filter by model/eval and aggregate results acro ### Deterministic Evals -To run deterministic evals, you can just run `npm run e2e` from within the Stagehand repo. This will test the functionality of Playwright within Stagehand to make sure it's working as expected. +To run deterministic evals, you can run `npm run e2e` from within the Stagehand repo. This will test the functionality of Playwright within Stagehand to make sure it's working as expected. These tests are in [`evals/deterministic`](https://github.com/browserbase/stagehand/tree/main/evals/deterministic) and test on both Browserbase browsers and local headless Chromium browsers. @@ -139,10 +220,13 @@ Update `evals/evals.config.json`: ```bash # Test your custom evaluation -npm run evals custom_task_name +evals run custom_task_name # Run the entire custom category -npm run evals category custom +evals run custom + +# Run with specific settings +evals run custom_task_name -e browserbase -t 5 -m gpt-4o ``` diff --git a/docs/media/evals-cli.png b/docs/media/evals-cli.png new file mode 100644 index 000000000..221596e2a Binary files /dev/null and b/docs/media/evals-cli.png differ diff --git a/evals/README.md b/evals/README.md index 6cedd0e99..ded087285 100644 --- a/evals/README.md +++ b/evals/README.md @@ -15,7 +15,7 @@ pnpm run build:cli The evals CLI provides a clean, intuitive interface for running evaluations: ```bash -pnpm evals [options] +evals [options] ``` ## Commands @@ -26,18 +26,18 @@ Run custom evals or external benchmarks. ```bash # Run all custom evals -pnpm evals run all +evals run all # Run specific category -pnpm evals run act -pnpm evals run extract -pnpm evals run observe +evals run act +evals run extract +evals run observe # Run specific eval by name -pnpm evals run extract/extract_text +evals run extract/extract_text # Run external benchmarks -pnpm evals run benchmark:gaia +evals run benchmark:gaia ``` ### `list` - View available evals @@ -46,10 +46,10 @@ List all available evaluations and benchmarks. ```bash # List all categories and benchmarks -pnpm evals list +evals list # Show detailed task list -pnpm evals list --detailed +evals list --detailed ``` ### `config` - Manage defaults @@ -58,22 +58,22 @@ Configure default settings for all eval runs. ```bash # View current configuration -pnpm evals config +evals config # Set default values -pnpm evals config set env browserbase -pnpm evals config set trials 5 -pnpm evals config set concurrency 10 +evals config set env browserbase +evals config set trials 5 +evals config set concurrency 10 # Reset to defaults -pnpm evals config reset -pnpm evals config reset trials # Reset specific key +evals config reset +evals config reset trials # Reset specific key ``` ### `help` - Show help ```bash -pnpm evals help +evals help ``` ## Options @@ -99,26 +99,26 @@ pnpm evals help ```bash # Run with custom settings -pnpm evals run act -e browserbase -t 5 -c 10 +evals run act -e browserbase -t 5 -c 10 # Run with specific model -pnpm evals run observe -m gpt-4o -p openai +evals run observe -m gpt-4o -p openai # Run using API -pnpm evals run extract --api +evals run extract --api ``` ### Running Benchmarks ```bash # WebBench with filters -pnpm evals run b:webbench -l 10 -f difficulty=easy -f category=READ +evals run b:webbench -l 10 -f difficulty=easy -f category=READ # GAIA with sampling -pnpm evals run b:gaia -s 100 -l 25 -f level=1 +evals run b:gaia -s 100 -l 25 -f level=1 # WebVoyager with limit -pnpm evals run b:webvoyager -l 50 +evals run b:webvoyager -l 50 ``` ## Available Benchmarks @@ -176,7 +176,7 @@ While the CLI reduces the need for environment variables, some are still support 1. Create your eval file in `evals/tasks//` 2. Add it to `evals.config.json` under the `tasks` array -3. Run with: `pnpm evals run /` +3. Run with: `evals run /` ## Troubleshooting