Skip to content

[Integration]: add mongodb integration #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,16 @@ Enhance your Vercel applications with web-browsing capabilities. Build Generativ
#### [**Braintrust Integration**](./examples/integrations/braintrust/)
Integrate Browserbase with Braintrust for evaluation and testing of AI agent performance in web environments. Monitor, measure, and improve your browser automation workflows.

#### [**MongoDB Integration**](./examples/integrations/mongodb/README.md)
**Intelligent Web Scraping & Data Storage** - Extract structured data from e-commerce websites using Stagehand and store it in MongoDB for analysis. Perfect for building data pipelines, market research, and competitive analysis workflows.

**Capabilities:**
- AI-powered web scraping with Stagehand
- Structured data extraction with schema validation
- MongoDB storage for persistence and querying
- Built-in data analysis and reporting
- Robust error handling for production use

## 🏗️ Monorepo Structure

```
Expand All @@ -80,6 +90,7 @@ integrations/
│ ├── langchain/ # LangChain framework integration
│ ├── browser-use/ # Simplified browser automation
│ ├── braintrust/ # Evaluation and testing tools
│ ├── mongodb/ # MongoDB data extraction & storage
│ └── agentkit/ # AgentKit implementations
└── README.md # This file
```
Expand Down
140 changes: 140 additions & 0 deletions examples/integrations/mongodb/.cursorrules
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Stagehand Project

This is a project that uses Stagehand, which amplifies Playwright with `act`, `extract`, and `observe` added to the Page class.

`Stagehand` is a class that provides config, a `StagehandPage` object via `stagehand.page`, and a `StagehandContext` object via `stagehand.context`.

`Page` is a class that extends the Playwright `Page` class and adds `act`, `extract`, and `observe` methods.
`Context` is a class that extends the Playwright `BrowserContext` class.

Use the following rules to write code for this project.

- To take an action on the page like "click the sign in button", use Stagehand `act` like this:

```typescript
await page.act("Click the sign in button");
```

- To plan an instruction before taking an action, use Stagehand `observe` to get the action to execute.

```typescript
const [action] = await page.observe("Click the sign in button");
```

- The result of `observe` is an array of `ObserveResult` objects that can directly be used as params for `act` like this:

```typescript
const [action] = await page.observe("Click the sign in button");
await page.act(action);
```

- When writing code that needs to extract data from the page, use Stagehand `extract`. Explicitly pass the following params by default:

```typescript
const { someValue } = await page.extract({
instruction: the instruction to execute,
schema: z.object({
someValue: z.string(),
}), // The schema to extract
});
```

## Initialize

```typescript
import { Stagehand } from "@browserbasehq/stagehand";
import StagehandConfig from "./stagehand.config";

const stagehand = new Stagehand(StagehandConfig);
await stagehand.init();

const page = stagehand.page; // Playwright Page with act, extract, and observe methods
const context = stagehand.context; // Playwright BrowserContext
```

## Act

You can cache the results of `observe` and use them as params for `act` like this:

```typescript
const instruction = "Click the sign in button";
const cachedAction = await getCache(instruction);

if (cachedAction) {
await page.act(cachedAction);
} else {
try {
const results = await page.observe(instruction);
await setCache(instruction, results);
await page.act(results[0]);
} catch (error) {
await page.act(instruction); // If the action is not cached, execute the instruction directly
}
}
```

Be sure to cache the results of `observe` and use them as params for `act` to avoid unexpected DOM changes. Using `act` without caching will result in more unpredictable behavior.

Act `action` should be as atomic and specific as possible, i.e. "Click the sign in button" or "Type 'hello' into the search input".
AVOID actions that are more than one step, i.e. "Order me pizza" or "Type in the search bar and hit enter".

## Extract

If you are writing code that needs to extract data from the page, use Stagehand `extract`.

```typescript
const signInButtonText = await page.extract("extract the sign in button text");
```

You can also pass in params like an output schema in Zod, and a flag to use text extraction:

```typescript
const data = await page.extract({
instruction: "extract the sign in button text",
schema: z.object({
text: z.string(),
}),
});
```

`schema` is a Zod schema that describes the data you want to extract. To extract an array, make sure to pass in a single object that contains the array, as follows:

```typescript
const data = await page.extract({
instruction: "extract the text inside all buttons",
schema: z.object({
text: z.array(z.string()),
}),
useTextExtract: true, // Set true for larger-scale extractions (multiple paragraphs), or set false for small extractions (name, birthday, etc)
});
```

## Agent

Use the `agent` method to automonously execute larger tasks like "Get the stock price of NVDA"

```typescript
// Navigate to a website
await stagehand.page.goto("https://www.google.com");

const agent = stagehand.agent({
// You can use either OpenAI or Anthropic
provider: "openai",
// The model to use (claude-3-7-sonnet-20250219 or claude-3-5-sonnet-20240620 for Anthropic)
model: "computer-use-preview",

// Customize the system prompt
instructions: `You are a helpful assistant that can use a web browser.
Do not ask follow up questions, the user will trust your judgement.`,

// Customize the API key
options: {
apiKey: process.env.OPENAI_API_KEY,
},
});

// Execute the agent
await agent.execute(
"Apply for a library card at the San Francisco Public Library"
);
```
15 changes: 15 additions & 0 deletions examples/integrations/mongodb/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# MongoDB Connection

# Local MongoDB instance
# MONGO_URI=mongodb://localhost:27017

# MongoDB Atlas connection string format:
# MONGO_URI=mongodb+srv://<username>:<password>@<cluster>.mongodb.net/<database>?retryWrites=true&w=majority

# Database name
DB_NAME=scraper_db

BROWSERBASE_PROJECT_ID="YOUR_BROWSERBASE_PROJECT_ID"
BROWSERBASE_API_KEY="YOUR_BROWSERBASE_API_KEY"
OPENAI_API_KEY="THIS_IS_OPTIONAL_WITH_ANTHROPIC_KEY"
ANTHROPIC_API_KEY="THIS_IS_OPTIONAL_WITH_OPENAI_KEY"
7 changes: 7 additions & 0 deletions examples/integrations/mongodb/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.env
node_modules
tmp
downloads
.DS_Store
dist
cache.json
7 changes: 7 additions & 0 deletions examples/integrations/mongodb/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Copyright 2025 Browserbase, Inc

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
99 changes: 99 additions & 0 deletions examples/integrations/mongodb/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Stagehand MongoDB Scraper

A web scraping project that uses Stagehand to extract structured data from e-commerce websites and store it in MongoDB for analysis.

## Features

- **Web Scraping**: Uses Stagehand (built on Playwright) for intelligent web scraping
- **Data Extraction**: Extracts structured product data using AI-powered instructions
- **MongoDB Storage**: Stores scraped data in MongoDB for persistence and querying
- **Schema Validation**: Uses Zod for schema validation and TypeScript interfaces
- **Error Handling**: Robust error handling to prevent crashes during scraping
- **Data Analysis**: Built-in MongoDB queries for data analysis

## Prerequisites

- Node.js 16 or higher
- MongoDB installed locally or MongoDB Atlas account
- Stagehand API key

## Installation

1. Clone the repository:
```
git clone <repository-url>
cd stagehand-mongodb-scraper
```

2. Install dependencies:
```
npm install
```

3. Set up environment variables:
```
# Create a .env file with the following variables
MONGO_URI=mongodb://localhost:27017
DB_NAME=scraper_db
```

## Usage

1. Start MongoDB locally:
```
mongod
```

2. Run the scraper:
```
npm start
```

3. The script will:
- Scrape product listings from Amazon
- Extract detailed information for the first 3 products
- Extract reviews for each product
- Store all data in MongoDB
- Run analysis queries on the collected data showing:
- Collection counts
- Products by category
- Top-rated products

## Project Structure

The project has a simple structure with a single file containing all functionality:

- `index.ts`: Contains the complete implementation including:
- MongoDB connection and data operations
- Schema definitions
- Scraping functions
- Data analysis
- Main execution logic
- `stagehand.config.js`: Stagehand configuration
- `.env.example`: Example environment variables

## Data Models

The project uses the following data models:

- **Product**: Individual product information
- **ProductList**: List of products from a category page
- **Review**: Product reviews

## MongoDB Collections

Data is stored in the following MongoDB collections:

- **products**: Individual product information
- **product_lists**: Lists of products from category pages
- **reviews**: Product reviews

## License

MIT

## Acknowledgements

- [Stagehand](https://docs.stagehand.dev/) for the powerful web scraping capabilities
- [MongoDB](https://www.mongodb.com/) for the flexible document database
- [Zod](https://zod.dev/) for runtime schema validation
Loading