browserbase · miguelg719 · Sep 26, 2025 · Sep 25, 2025
diff --git a/docs/configuration/evals.mdx b/docs/configuration/evals.mdx
@@ -25,33 +25,114 @@ Evaluations help you understand how well your automation performs, which models
 
 Evaluations help you systematically test and improve your automation workflows. Stagehand provides both built-in evaluations and tools to create your own.
 
-<Tip>
-To run evals, you'll need to clone the [Stagehand repo](https://github.com/browserbase/stagehand) and run `npm install` to install the dependencies.
-</Tip>
-
-We have three types of evals:
-1. **Deterministic Evals** - These are evals that are deterministic and can be run without any LLM inference.
+We have 2 types of evals:
+1. **Deterministic Evals** - These include unit tests, integration tests, and E2E tests that can be run without any LLM inference.
 2. **LLM-based Evals** - These are evals that test the underlying functionality of Stagehand's AI primitives.
 
 
-### LLM-based Evals
+### Evals CLI
+![Evals CLI](/media/evals-cli.png)
 
 <Tip>
-To run LLM evals, you'll need a [Braintrust account](https://www.braintrust.dev/docs/).
+To run evals, you'll need to clone the [Stagehand repo](https://github.com/browserbase/stagehand) and set up the CLI.
+
+We recommend using [Braintrust](https://www.braintrust.dev/docs/) to help visualize evals results and metrics.
 </Tip>
 
-To run LLM-based evals, you can run `npm run evals` from within the Stagehand repo. This will test the functionality of the LLM primitives within Stagehand to make sure they're working as expected.
+The Stagehand CLI provides a powerful interface for running evaluations. You can run specific evals, categories, or external benchmarks with customizable settings.
 
-Evals are grouped into three categories:
+Evals are grouped into:
 1. **Act Evals** - These are evals that test the functionality of the `act` method.
 2. **Extract Evals** - These are evals that test the functionality of the `extract` method.
 3. **Observe Evals** - These are evals that test the functionality of the `observe` method.
 4. **Combination Evals** - These are evals that test the functionality of the `act`, `extract`, and `observe` methods together.
+5. **Experimental Evals** - These are experimental custom evals that test the functionality of the stagehand primitives.
+6. **Agent Evals** - These are evals that test the functionality of `agent`.
+7. **(NEW) External Benchmarks** - Run external benchmarks like WebBench, GAIA, WebVoyager, OnlineMind2Web, and OSWorld.
+
+#### Installation
+
+<Steps> 
+<Step title="Install Dependencies">
+```bash
+# From the stagehand root directory
+pnpm install
+```
+</Step>
 
-#### Configuring and Running Evals
-You can view the specific evals in [`evals/tasks`](https://github.com/browserbase/stagehand/tree/main/evals/tasks). Each eval is grouped into eval categories based on [`evals/evals.config.json`](https://github.com/browserbase/stagehand/blob/main/evals/evals.config.json). You can specify models to run and other general task config in [`evals/taskConfig.ts`](https://github.com/browserbase/stagehand/blob/main/evals/taskConfig.ts).
+<Step title="Build the CLI">
+```bash
+pnpm run build:cli
+```
+</Step>
 
-To run a specific eval, you can run `npm run evals <eval>`, or run all evals in a category with `npm run evals category <category>`.
+<Step title="Verify Installation">
+```bash
+evals help
+```
+</Step>
+</Steps>
+
+#### CLI Commands and Options
+
+##### Basic Commands
+
+```bash
+# Run all evals
+evals run all
+
+# Run specific category
+evals run act
+evals run extract
+evals run observe
+evals run agent
+
+# Run specific eval
+evals run extract/extract_text
+
+# List available evals
+evals list
+evals list --detailed
+
+# Configure defaults
+evals config
+evals config set env browserbase
+evals config set trials 5
+```
+
+##### Command Options
+
+- **`-e, --env`**: Environment (`local` or `browserbase`)
+- **`-t, --trials`**: Number of trials per eval (default: 3)
+- **`-c, --concurrency`**: Max parallel sessions (default: 10)
+- **`-m, --model`**: Model override
+- **`-p, --provider`**: Provider override
+- **`--api`**: Use Stagehand API instead of SDK
+
+##### Running External Benchmarks
+
+The CLI supports several industry-standard benchmarks:
+
+```bash
+# WebBench with filters
+evals run benchmark:webbench -l 10 -f difficulty=easy -f category=READ
+
+# GAIA benchmark
+evals run b:gaia -s 100 -l 25 -f level=1
+
+# WebVoyager
+evals run b:webvoyager -l 50
+
+# OnlineMind2Web
+evals run b:onlineMind2Web
+
+# OSWorld
+evals run b:osworld -f source=Mind2Web
+```
+
+#### Configuration Files
+
+You can view the specific evals in [`evals/tasks`](https://github.com/browserbase/stagehand/tree/main/evals/tasks). Each eval is grouped into eval categories based on [`evals/evals.config.json`](https://github.com/browserbase/stagehand/blob/main/evals/evals.config.json).
 
 
 #### Viewing eval results
@@ -65,7 +146,7 @@ You can use the Braintrust UI to filter by model/eval and aggregate results acro
 
 ### Deterministic Evals
 
-To run deterministic evals, you can just run `npm run e2e` from within the Stagehand repo. This will test the functionality of Playwright within Stagehand to make sure it's working as expected.
+To run deterministic evals, you can run `npm run e2e` from within the Stagehand repo. This will test the functionality of Playwright within Stagehand to make sure it's working as expected.
 
 These tests are in [`evals/deterministic`](https://github.com/browserbase/stagehand/tree/main/evals/deterministic) and test on both Browserbase browsers and local headless Chromium browsers.
 
@@ -139,10 +220,13 @@ Update `evals/evals.config.json`:
 <Step title="Run Your Evaluation">
 ```bash
 # Test your custom evaluation
-npm run evals custom_task_name
+evals run custom_task_name
 
 # Run the entire custom category
-npm run evals category custom
+evals run custom
+
+# Run with specific settings
+evals run custom_task_name -e browserbase -t 5 -m gpt-4o
 ```
 </Step>
 </Steps>

diff --git a/docs/media/evals-cli.png b/docs/media/evals-cli.png
diff --git a/evals/README.md b/evals/README.md
@@ -15,7 +15,7 @@ pnpm run build:cli
 The evals CLI provides a clean, intuitive interface for running evaluations:
 
 ```bash
-pnpm evals <command> <target> [options]
+evals <command> <target> [options]
 ```
 
 ## Commands
@@ -26,18 +26,18 @@ Run custom evals or external benchmarks.
 
 ```bash
 # Run all custom evals
-pnpm evals run all
+evals run all
 
 # Run specific category
-pnpm evals run act
-pnpm evals run extract
-pnpm evals run observe
+evals run act
+evals run extract
+evals run observe
 
 # Run specific eval by name
-pnpm evals run extract/extract_text
+evals run extract/extract_text
 
 # Run external benchmarks
-pnpm evals run benchmark:gaia
+evals run benchmark:gaia
 ```
 
 ### `list` - View available evals
@@ -46,10 +46,10 @@ List all available evaluations and benchmarks.
 
 ```bash
 # List all categories and benchmarks
-pnpm evals list
+evals list
 
 # Show detailed task list
-pnpm evals list --detailed
+evals list --detailed
 ```
 
 ### `config` - Manage defaults
@@ -58,22 +58,22 @@ Configure default settings for all eval runs.
 
 ```bash
 # View current configuration
-pnpm evals config
+evals config
 
 # Set default values
-pnpm evals config set env browserbase
-pnpm evals config set trials 5
-pnpm evals config set concurrency 10
+evals config set env browserbase
+evals config set trials 5
+evals config set concurrency 10
 
 # Reset to defaults
-pnpm evals config reset
-pnpm evals config reset trials  # Reset specific key
+evals config reset
+evals config reset trials  # Reset specific key
 ```
 
 ### `help` - Show help
 
 ```bash
-pnpm evals help
+evals help
 ```
 
 ## Options
@@ -99,26 +99,26 @@ pnpm evals help
 
 ```bash
 # Run with custom settings
-pnpm evals run act -e browserbase -t 5 -c 10
+evals run act -e browserbase -t 5 -c 10
 
 # Run with specific model
-pnpm evals run observe -m gpt-4o -p openai
+evals run observe -m gpt-4o -p openai
 
 # Run using API
-pnpm evals run extract --api
+evals run extract --api
 ```
 
 ### Running Benchmarks
 
 ```bash
 # WebBench with filters
-pnpm evals run b:webbench -l 10 -f difficulty=easy -f category=READ
+evals run b:webbench -l 10 -f difficulty=easy -f category=READ
 
 # GAIA with sampling
-pnpm evals run b:gaia -s 100 -l 25 -f level=1
+evals run b:gaia -s 100 -l 25 -f level=1
 
 # WebVoyager with limit
-pnpm evals run b:webvoyager -l 50
+evals run b:webvoyager -l 50
 ```
 
 ## Available Benchmarks
@@ -176,7 +176,7 @@ While the CLI reduces the need for environment variables, some are still support
 
 1. Create your eval file in `evals/tasks/<category>/`
 2. Add it to `evals.config.json` under the `tasks` array
-3. Run with: `pnpm evals run <category>/<eval_name>`
+3. Run with: `evals run <category>/<eval_name>`
 
 ## Troubleshooting