Skip to content

High-performance web crawler API optimized for LLMs. Turn any search or website into clean Markdown using remote browsers.

License

Notifications You must be signed in to change notification settings

BrowserCash/teracrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

⭐ Teracrawl

High-performance web crawler & scraper API optimized for LLMs.

Powered by Browser.cash remote browsers.

Features β€’ Quick Start β€’ API Reference β€’ Configuration β€’ Docker

License Node.js Version TypeScript Visit Browser.cash

Follow on X Follow on LinkedIn Join our Discord


⚠️ Important: Search functionality (`/crawl`) requires a running instance of browser-serp.


πŸ“Š Benchmarks

Teracrawl achieves #1 coverage at 82.1%

Teracrawl achieves #1 coverage (84.2%) across 14 scraping providers on the scrape-evals benchmark, an open evaluation framework that tests web scrapers against 1,000 diverse URLs for success rate and content quality.


πŸš€ What is Teracrawl?

Teracrawl is a production-ready API designed to turn websites into clean, LLM-ready Markdown. It handles the complexity of JavaScript rendering, anti-bot measures, and parallel execution allowing AI systems to access real-time data quickly.

Unlike simple HTML scrapers, Teracrawl uses real managed Chrome browsers, ensuring high success rates even across protected sites.

Why use Teracrawl?

  • πŸ€– LLM-Optimized Output: Converts complex HTML into clean, semantic Markdown perfect for RAG and context windows.
  • ⚑ Smart Two-Phase Crawling:
    • Fast Mode: Optimized for static/SSR pages (reuses contexts, blocks heavy assets).
    • Dynamic Mode: Automatic fallback for complex SPAs (waits for hydration/rendering).
  • πŸ” Search & Scrape: Single endpoint to query Google and scrape the top results in parallel.
  • 🏎️ High Concurrency: Built on a robust session pool to handle multiple pages simultaneously.

✨ Features

  • Search + Scrape: Query Google and scrape top N results in a single API call.
  • Direct Scraping: Convert any specific URL to Markdown.
  • Smart Content Extraction: Automatically detects main content areas (article, main, etc.) and removes clutter (scripts, styles, navs).
  • Safety & Performance:
    • Blocks ads, trackers, and analytics.
    • Removes base64 images to save token count.
    • Automatic timeout handling and error recovery.
  • Docker Ready: Deploy anywhere with a lightweight container.

πŸ› οΈ Quick Start

Prerequisites

  1. Node.js 18+ installed.
  2. A Browser.cash API Key.
  3. A running SERP service like browser-serp on port 8080 (optional, only for /crawl endpoint).

Installation

# Clone the repository
git clone https://github.com/BrowserCash/teracrawl.git
cd teracrawl

# Install dependencies
npm install

Configuration

Copy the example environment file and configure your settings:

cp .env.example .env

Open .env and set your BROWSER_API_KEY:

BROWSER_API_KEY=your_browser_cash_api_key_here

Running the Server

# Development mode
npm run dev

# Production build & start
npm run build
npm start

The server will start at http://0.0.0.0:8085.

πŸ“š API Reference

1. Search & Crawl

Performs a Google search and scrapes the content of the top results.

Endpoint: POST /crawl

CURL Request:

curl -X POST http://localhost:8085/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "q": "What is the capital of France?",
    "count": 3
  }'
Field Type Default Description
q string Required The search query.
count number 3 Number of results to scrape (max 20).

Response:

{
  "query": "What is the capital of France?",
  "results": [
    {
      "url": "https://en.wikipedia.org/wiki/Paris",
      "title": "Paris - Wikipedia",
      "markdown": "# Paris\n\nParis is the capital and most populous city of France...",
      "status": "success"
    },
    {
      "url": "https://...",
      "status": "error",
      "error": "Timeout exceeded"
    }
  ]
}

2. Single Page Scrape

Scrapes a specific URL and converts it to Markdown.

Endpoint: POST /scrape

CURL Request:

curl -X POST http://localhost:8085/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post-1"
  }'

Response:

{
  "url": "https://example.com/blog/post-1",
  "title": "My Blog Post",
  "markdown": "# My Blog Post\n\nContent of the post...",
  "status": "success"
}

3. SERP Search Only

Proxies a search request to the underlying SERP service without scraping content.

Endpoint: POST /serp/search

CURL Request:

curl -X POST http://localhost:8085/serp/search \
  -H "Content-Type: application/json" \
  -d '{
    "q": "browser automation",
    "count": 5
  }'

Response:

{
  "results": [
    {
      "url": "https://...",
      "title": "Result Title",
      "description": "Result description..."
    }
  ]
}

4. Health Check

Endpoint: GET /health

CURL Request:

curl http://localhost:8085/health

Response:

{
  "ok": true
}

βš™οΈ Configuration

Server & Infrastructure

Variable Default Description
BROWSER_API_KEY Required Your Browser.cash API key.
PORT 8085 Port for the API server.
HOST 0.0.0.0 Host to bind to.
SERP_SERVICE_URL http://localhost:8080 URL of the upstream SERP/Search service.
POOL_SIZE 1 Number of concurrent browser sessions to maintain.
DEBUG_LOG false Enable verbose logging for debugging.
DATALAB_API_KEY Optional Datalab API key for PDF-to-Markdown conversion.

Crawler Tuning

Variable Default Description
CRAWL_TABS_PER_SESSION 8 Max concurrent tabs per browser session.
CRAWL_MIN_CONTENT_LENGTH 200 Minimum markdown char length to consider a scrape successful.
CRAWL_NAVIGATION_TIMEOUT_MS 10000 Timeout for "Fast" scraping mode (ms).
CRAWL_SLOW_TIMEOUT_MS 20000 Timeout for "Slow" scraping mode (ms).
CRAWL_JITTER_MS 0 Max random delay (ms) between requests to avoid thundering herd.

🐳 Docker

You can run Teracrawl easily using Docker.

Build & Run

# Build the image
docker build -t teracrawl .

# Run with env file
docker run -p 8085:8085 --env-file .env teracrawl

Docker Compose

version: "3.8"
services:
  teracrawl:
    build: .
    ports:
      - "8085:8085"
    environment:
      - BROWSER_API_KEY=${BROWSER_API_KEY}
      - SERP_SERVICE_URL=http://serp:8080
    depends_on:
      - serp

  serp:
    image: ghcr.io/mega-tera/browser-serp:latest
    ports:
      - "8080:8080"

🀝 Contributing

Contributions are welcome! We appreciate your help in making Teracrawl better.

How to Contribute

  1. Fork the Project: click the 'Fork' button at the top right of this page.
  2. Create your Feature Branch: git checkout -b feature/AmazingFeature
  3. Commit your Changes: git commit -m 'Add some AmazingFeature'
  4. Push to the Branch: git push origin feature/AmazingFeature
  5. Open a Pull Request: Submit your changes for review.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.