diff --git a/docs/configuration/evals.mdx b/docs/configuration/evals.mdx
index b717ce9f2..af88eb336 100644
--- a/docs/configuration/evals.mdx
+++ b/docs/configuration/evals.mdx
@@ -25,33 +25,114 @@ Evaluations help you understand how well your automation performs, which models
Evaluations help you systematically test and improve your automation workflows. Stagehand provides both built-in evaluations and tools to create your own.
-
-To run evals, you'll need to clone the [Stagehand repo](https://github.com/browserbase/stagehand) and run `npm install` to install the dependencies.
-
-
-We have three types of evals:
-1. **Deterministic Evals** - These are evals that are deterministic and can be run without any LLM inference.
+We have 2 types of evals:
+1. **Deterministic Evals** - These include unit tests, integration tests, and E2E tests that can be run without any LLM inference.
2. **LLM-based Evals** - These are evals that test the underlying functionality of Stagehand's AI primitives.
-### LLM-based Evals
+### Evals CLI
+
-To run LLM evals, you'll need a [Braintrust account](https://www.braintrust.dev/docs/).
+To run evals, you'll need to clone the [Stagehand repo](https://github.com/browserbase/stagehand) and set up the CLI.
+
+We recommend using [Braintrust](https://www.braintrust.dev/docs/) to help visualize evals results and metrics.
-To run LLM-based evals, you can run `npm run evals` from within the Stagehand repo. This will test the functionality of the LLM primitives within Stagehand to make sure they're working as expected.
+The Stagehand CLI provides a powerful interface for running evaluations. You can run specific evals, categories, or external benchmarks with customizable settings.
-Evals are grouped into three categories:
+Evals are grouped into:
1. **Act Evals** - These are evals that test the functionality of the `act` method.
2. **Extract Evals** - These are evals that test the functionality of the `extract` method.
3. **Observe Evals** - These are evals that test the functionality of the `observe` method.
4. **Combination Evals** - These are evals that test the functionality of the `act`, `extract`, and `observe` methods together.
+5. **Experimental Evals** - These are experimental custom evals that test the functionality of the stagehand primitives.
+6. **Agent Evals** - These are evals that test the functionality of `agent`.
+7. **(NEW) External Benchmarks** - Run external benchmarks like WebBench, GAIA, WebVoyager, OnlineMind2Web, and OSWorld.
+
+#### Installation
+
+
+
+```bash
+# From the stagehand root directory
+pnpm install
+```
+
-#### Configuring and Running Evals
-You can view the specific evals in [`evals/tasks`](https://github.com/browserbase/stagehand/tree/main/evals/tasks). Each eval is grouped into eval categories based on [`evals/evals.config.json`](https://github.com/browserbase/stagehand/blob/main/evals/evals.config.json). You can specify models to run and other general task config in [`evals/taskConfig.ts`](https://github.com/browserbase/stagehand/blob/main/evals/taskConfig.ts).
+
+```bash
+pnpm run build:cli
+```
+
-To run a specific eval, you can run `npm run evals `, or run all evals in a category with `npm run evals category `.
+
+```bash
+evals help
+```
+
+
+
+#### CLI Commands and Options
+
+##### Basic Commands
+
+```bash
+# Run all evals
+evals run all
+
+# Run specific category
+evals run act
+evals run extract
+evals run observe
+evals run agent
+
+# Run specific eval
+evals run extract/extract_text
+
+# List available evals
+evals list
+evals list --detailed
+
+# Configure defaults
+evals config
+evals config set env browserbase
+evals config set trials 5
+```
+
+##### Command Options
+
+- **`-e, --env`**: Environment (`local` or `browserbase`)
+- **`-t, --trials`**: Number of trials per eval (default: 3)
+- **`-c, --concurrency`**: Max parallel sessions (default: 10)
+- **`-m, --model`**: Model override
+- **`-p, --provider`**: Provider override
+- **`--api`**: Use Stagehand API instead of SDK
+
+##### Running External Benchmarks
+
+The CLI supports several industry-standard benchmarks:
+
+```bash
+# WebBench with filters
+evals run benchmark:webbench -l 10 -f difficulty=easy -f category=READ
+
+# GAIA benchmark
+evals run b:gaia -s 100 -l 25 -f level=1
+
+# WebVoyager
+evals run b:webvoyager -l 50
+
+# OnlineMind2Web
+evals run b:onlineMind2Web
+
+# OSWorld
+evals run b:osworld -f source=Mind2Web
+```
+
+#### Configuration Files
+
+You can view the specific evals in [`evals/tasks`](https://github.com/browserbase/stagehand/tree/main/evals/tasks). Each eval is grouped into eval categories based on [`evals/evals.config.json`](https://github.com/browserbase/stagehand/blob/main/evals/evals.config.json).
#### Viewing eval results
@@ -65,7 +146,7 @@ You can use the Braintrust UI to filter by model/eval and aggregate results acro
### Deterministic Evals
-To run deterministic evals, you can just run `npm run e2e` from within the Stagehand repo. This will test the functionality of Playwright within Stagehand to make sure it's working as expected.
+To run deterministic evals, you can run `npm run e2e` from within the Stagehand repo. This will test the functionality of Playwright within Stagehand to make sure it's working as expected.
These tests are in [`evals/deterministic`](https://github.com/browserbase/stagehand/tree/main/evals/deterministic) and test on both Browserbase browsers and local headless Chromium browsers.
@@ -139,10 +220,13 @@ Update `evals/evals.config.json`:
```bash
# Test your custom evaluation
-npm run evals custom_task_name
+evals run custom_task_name
# Run the entire custom category
-npm run evals category custom
+evals run custom
+
+# Run with specific settings
+evals run custom_task_name -e browserbase -t 5 -m gpt-4o
```
diff --git a/docs/media/evals-cli.png b/docs/media/evals-cli.png
new file mode 100644
index 000000000..221596e2a
Binary files /dev/null and b/docs/media/evals-cli.png differ
diff --git a/evals/README.md b/evals/README.md
index 6cedd0e99..ded087285 100644
--- a/evals/README.md
+++ b/evals/README.md
@@ -15,7 +15,7 @@ pnpm run build:cli
The evals CLI provides a clean, intuitive interface for running evaluations:
```bash
-pnpm evals [options]
+evals [options]
```
## Commands
@@ -26,18 +26,18 @@ Run custom evals or external benchmarks.
```bash
# Run all custom evals
-pnpm evals run all
+evals run all
# Run specific category
-pnpm evals run act
-pnpm evals run extract
-pnpm evals run observe
+evals run act
+evals run extract
+evals run observe
# Run specific eval by name
-pnpm evals run extract/extract_text
+evals run extract/extract_text
# Run external benchmarks
-pnpm evals run benchmark:gaia
+evals run benchmark:gaia
```
### `list` - View available evals
@@ -46,10 +46,10 @@ List all available evaluations and benchmarks.
```bash
# List all categories and benchmarks
-pnpm evals list
+evals list
# Show detailed task list
-pnpm evals list --detailed
+evals list --detailed
```
### `config` - Manage defaults
@@ -58,22 +58,22 @@ Configure default settings for all eval runs.
```bash
# View current configuration
-pnpm evals config
+evals config
# Set default values
-pnpm evals config set env browserbase
-pnpm evals config set trials 5
-pnpm evals config set concurrency 10
+evals config set env browserbase
+evals config set trials 5
+evals config set concurrency 10
# Reset to defaults
-pnpm evals config reset
-pnpm evals config reset trials # Reset specific key
+evals config reset
+evals config reset trials # Reset specific key
```
### `help` - Show help
```bash
-pnpm evals help
+evals help
```
## Options
@@ -99,26 +99,26 @@ pnpm evals help
```bash
# Run with custom settings
-pnpm evals run act -e browserbase -t 5 -c 10
+evals run act -e browserbase -t 5 -c 10
# Run with specific model
-pnpm evals run observe -m gpt-4o -p openai
+evals run observe -m gpt-4o -p openai
# Run using API
-pnpm evals run extract --api
+evals run extract --api
```
### Running Benchmarks
```bash
# WebBench with filters
-pnpm evals run b:webbench -l 10 -f difficulty=easy -f category=READ
+evals run b:webbench -l 10 -f difficulty=easy -f category=READ
# GAIA with sampling
-pnpm evals run b:gaia -s 100 -l 25 -f level=1
+evals run b:gaia -s 100 -l 25 -f level=1
# WebVoyager with limit
-pnpm evals run b:webvoyager -l 50
+evals run b:webvoyager -l 50
```
## Available Benchmarks
@@ -176,7 +176,7 @@ While the CLI reduces the need for environment variables, some are still support
1. Create your eval file in `evals/tasks//`
2. Add it to `evals.config.json` under the `tasks` array
-3. Run with: `pnpm evals run /`
+3. Run with: `evals run /`
## Troubleshooting