⭐ Teracrawl

High-performance web crawler & scraper API optimized for LLMs.

Powered by Browser.cash remote browsers.

Features • Quick Start • API Reference • Configuration • Docker

⚠️ Important: Search functionality (`/crawl`) requires a running instance of browser-serp.

📊 Benchmarks

Teracrawl achieves #1 coverage (84.2%) across 14 scraping providers on the scrape-evals benchmark, an open evaluation framework that tests web scrapers against 1,000 diverse URLs for success rate and content quality.

🚀 What is Teracrawl?

Teracrawl is a production-ready API designed to turn websites into clean, LLM-ready Markdown. It handles the complexity of JavaScript rendering, anti-bot measures, and parallel execution allowing AI systems to access real-time data quickly.

Unlike simple HTML scrapers, Teracrawl uses real managed Chrome browsers, ensuring high success rates even across protected sites.

Why use Teracrawl?

🤖 LLM-Optimized Output: Converts complex HTML into clean, semantic Markdown perfect for RAG and context windows.
⚡ Smart Two-Phase Crawling:
- Fast Mode: Optimized for static/SSR pages (reuses contexts, blocks heavy assets).
- Dynamic Mode: Automatic fallback for complex SPAs (waits for hydration/rendering).
🔍 Search & Scrape: Single endpoint to query Google and scrape the top results in parallel.
🏎️ High Concurrency: Built on a robust session pool to handle multiple pages simultaneously.

✨ Features

Search + Scrape: Query Google and scrape top N results in a single API call.
Direct Scraping: Convert any specific URL to Markdown.
Smart Content Extraction: Automatically detects main content areas (article, main, etc.) and removes clutter (scripts, styles, navs).
Safety & Performance:
- Blocks ads, trackers, and analytics.
- Removes base64 images to save token count.
- Automatic timeout handling and error recovery.
Docker Ready: Deploy anywhere with a lightweight container.

🛠️ Quick Start

Prerequisites

Node.js 18+ installed.
A Browser.cash API Key.
A running SERP service like browser-serp on port 8080 (optional, only for /crawl endpoint).

Installation

# Clone the repository
git clone https://github.com/BrowserCash/teracrawl.git
cd teracrawl

# Install dependencies
npm install

Configuration

Copy the example environment file and configure your settings:

cp .env.example .env

Open .env and set your BROWSER_API_KEY:

BROWSER_API_KEY=your_browser_cash_api_key_here

Running the Server

# Development mode
npm run dev

# Production build & start
npm run build
npm start

The server will start at http://0.0.0.0:8085.

📚 API Reference

1. Search & Crawl

Performs a Google search and scrapes the content of the top results.

Endpoint: POST /crawl

CURL Request:

curl -X POST http://localhost:8085/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "q": "What is the capital of France?",
    "count": 3
  }'

Field	Type	Default	Description
`q`	`string`	Required	The search query.
`count`	`number`	`3`	Number of results to scrape (max 20).

Response:

{
  "query": "What is the capital of France?",
  "results": [
    {
      "url": "https://en.wikipedia.org/wiki/Paris",
      "title": "Paris - Wikipedia",
      "markdown": "# Paris\n\nParis is the capital and most populous city of France...",
      "status": "success"
    },
    {
      "url": "https://...",
      "status": "error",
      "error": "Timeout exceeded"
    }
  ]
}

2. Single Page Scrape

Scrapes a specific URL and converts it to Markdown.

Endpoint: POST /scrape

CURL Request:

curl -X POST http://localhost:8085/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post-1"
  }'

Response:

{
  "url": "https://example.com/blog/post-1",
  "title": "My Blog Post",
  "markdown": "# My Blog Post\n\nContent of the post...",
  "status": "success"
}

3. SERP Search Only

Proxies a search request to the underlying SERP service without scraping content.

Endpoint: POST /serp/search

CURL Request:

curl -X POST http://localhost:8085/serp/search \
  -H "Content-Type: application/json" \
  -d '{
    "q": "browser automation",
    "count": 5
  }'

Response:

{
  "results": [
    {
      "url": "https://...",
      "title": "Result Title",
      "description": "Result description..."
    }
  ]
}

4. Health Check

Endpoint: GET /health

CURL Request:

curl http://localhost:8085/health

Response:

{
  "ok": true
}

⚙️ Configuration

Server & Infrastructure

Variable	Default	Description
`BROWSER_API_KEY`	Required	Your Browser.cash API key.
`PORT`	`8085`	Port for the API server.
`HOST`	`0.0.0.0`	Host to bind to.
`SERP_SERVICE_URL`	`http://localhost:8080`	URL of the upstream SERP/Search service.
`POOL_SIZE`	`1`	Number of concurrent browser sessions to maintain.
`DEBUG_LOG`	`false`	Enable verbose logging for debugging.
`DATALAB_API_KEY`	Optional	Datalab API key for PDF-to-Markdown conversion.

Crawler Tuning

Variable	Default	Description
`CRAWL_TABS_PER_SESSION`	`8`	Max concurrent tabs per browser session.
`CRAWL_MIN_CONTENT_LENGTH`	`200`	Minimum markdown char length to consider a scrape successful.
`CRAWL_NAVIGATION_TIMEOUT_MS`	`10000`	Timeout for "Fast" scraping mode (ms).
`CRAWL_SLOW_TIMEOUT_MS`	`20000`	Timeout for "Slow" scraping mode (ms).
`CRAWL_JITTER_MS`	`0`	Max random delay (ms) between requests to avoid thundering herd.

🐳 Docker

You can run Teracrawl easily using Docker.

Build & Run

# Build the image
docker build -t teracrawl .

# Run with env file
docker run -p 8085:8085 --env-file .env teracrawl

Docker Compose

version: "3.8"
services:
  teracrawl:
    build: .
    ports:
      - "8085:8085"
    environment:
      - BROWSER_API_KEY=${BROWSER_API_KEY}
      - SERP_SERVICE_URL=http://serp:8080
    depends_on:
      - serp

  serp:
    image: ghcr.io/mega-tera/browser-serp:latest
    ports:
      - "8080:8080"

🤝 Contributing

Contributions are welcome! We appreciate your help in making Teracrawl better.

How to Contribute

Fork the Project: click the 'Fork' button at the top right of this page.
Create your Feature Branch: git checkout -b feature/AmazingFeature
Commit your Changes: git commit -m 'Add some AmazingFeature'
Push to the Branch: git push origin feature/AmazingFeature
Open a Pull Request: Submit your changes for review.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
scrape-evals.png		scrape-evals.png
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⭐ Teracrawl

📊 Benchmarks

🚀 What is Teracrawl?

Why use Teracrawl?

✨ Features

🛠️ Quick Start

Prerequisites

Installation

Configuration

Running the Server

📚 API Reference

1. Search & Crawl

2. Single Page Scrape

3. SERP Search Only

4. Health Check

⚙️ Configuration

Server & Infrastructure

Crawler Tuning

🐳 Docker

Build & Run

Docker Compose

🤝 Contributing

How to Contribute

📄 License

About

Uh oh!

Releases 1

Packages

Contributors 3

Uh oh!

Languages

License

BrowserCash/teracrawl

Folders and files

Latest commit

History

Repository files navigation

⭐ Teracrawl

📊 Benchmarks

🚀 What is Teracrawl?

Why use Teracrawl?

✨ Features

🛠️ Quick Start

Prerequisites

Installation

Configuration

Running the Server

📚 API Reference

1. Search & Crawl

2. Single Page Scrape

3. SERP Search Only

4. Health Check

⚙️ Configuration

Server & Infrastructure

Crawler Tuning

🐳 Docker

Build & Run

Docker Compose

🤝 Contributing

How to Contribute

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Uh oh!

Languages

Packages