High-performance web crawler & scraper API optimized for LLMs.
Powered by Browser.cash remote browsers.
Features β’ Quick Start β’ API Reference β’ Configuration β’ Docker
Teracrawl achieves #1 coverage (84.2%) across 14 scraping providers on the scrape-evals benchmark, an open evaluation framework that tests web scrapers against 1,000 diverse URLs for success rate and content quality.
Teracrawl is a production-ready API designed to turn websites into clean, LLM-ready Markdown. It handles the complexity of JavaScript rendering, anti-bot measures, and parallel execution allowing AI systems to access real-time data quickly.
Unlike simple HTML scrapers, Teracrawl uses real managed Chrome browsers, ensuring high success rates even across protected sites.
- π€ LLM-Optimized Output: Converts complex HTML into clean, semantic Markdown perfect for RAG and context windows.
- β‘ Smart Two-Phase Crawling:
- Fast Mode: Optimized for static/SSR pages (reuses contexts, blocks heavy assets).
- Dynamic Mode: Automatic fallback for complex SPAs (waits for hydration/rendering).
- π Search & Scrape: Single endpoint to query Google and scrape the top results in parallel.
- ποΈ High Concurrency: Built on a robust session pool to handle multiple pages simultaneously.
- Search + Scrape: Query Google and scrape top N results in a single API call.
- Direct Scraping: Convert any specific URL to Markdown.
- Smart Content Extraction: Automatically detects main content areas (article, main, etc.) and removes clutter (scripts, styles, navs).
- Safety & Performance:
- Blocks ads, trackers, and analytics.
- Removes base64 images to save token count.
- Automatic timeout handling and error recovery.
- Docker Ready: Deploy anywhere with a lightweight container.
- Node.js 18+ installed.
- A Browser.cash API Key.
- A running SERP service like browser-serp on port 8080 (optional, only for
/crawlendpoint).
# Clone the repository
git clone https://github.com/BrowserCash/teracrawl.git
cd teracrawl
# Install dependencies
npm installCopy the example environment file and configure your settings:
cp .env.example .envOpen .env and set your BROWSER_API_KEY:
BROWSER_API_KEY=your_browser_cash_api_key_here# Development mode
npm run dev
# Production build & start
npm run build
npm startThe server will start at http://0.0.0.0:8085.
Performs a Google search and scrapes the content of the top results.
Endpoint: POST /crawl
CURL Request:
curl -X POST http://localhost:8085/crawl \
-H "Content-Type: application/json" \
-d '{
"q": "What is the capital of France?",
"count": 3
}'| Field | Type | Default | Description |
|---|---|---|---|
q |
string |
Required | The search query. |
count |
number |
3 |
Number of results to scrape (max 20). |
Response:
{
"query": "What is the capital of France?",
"results": [
{
"url": "https://en.wikipedia.org/wiki/Paris",
"title": "Paris - Wikipedia",
"markdown": "# Paris\n\nParis is the capital and most populous city of France...",
"status": "success"
},
{
"url": "https://...",
"status": "error",
"error": "Timeout exceeded"
}
]
}Scrapes a specific URL and converts it to Markdown.
Endpoint: POST /scrape
CURL Request:
curl -X POST http://localhost:8085/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/post-1"
}'Response:
{
"url": "https://example.com/blog/post-1",
"title": "My Blog Post",
"markdown": "# My Blog Post\n\nContent of the post...",
"status": "success"
}Proxies a search request to the underlying SERP service without scraping content.
Endpoint: POST /serp/search
CURL Request:
curl -X POST http://localhost:8085/serp/search \
-H "Content-Type: application/json" \
-d '{
"q": "browser automation",
"count": 5
}'Response:
{
"results": [
{
"url": "https://...",
"title": "Result Title",
"description": "Result description..."
}
]
}Endpoint: GET /health
CURL Request:
curl http://localhost:8085/healthResponse:
{
"ok": true
}| Variable | Default | Description |
|---|---|---|
BROWSER_API_KEY |
Required | Your Browser.cash API key. |
PORT |
8085 |
Port for the API server. |
HOST |
0.0.0.0 |
Host to bind to. |
SERP_SERVICE_URL |
http://localhost:8080 |
URL of the upstream SERP/Search service. |
POOL_SIZE |
1 |
Number of concurrent browser sessions to maintain. |
DEBUG_LOG |
false |
Enable verbose logging for debugging. |
DATALAB_API_KEY |
Optional | Datalab API key for PDF-to-Markdown conversion. |
| Variable | Default | Description |
|---|---|---|
CRAWL_TABS_PER_SESSION |
8 |
Max concurrent tabs per browser session. |
CRAWL_MIN_CONTENT_LENGTH |
200 |
Minimum markdown char length to consider a scrape successful. |
CRAWL_NAVIGATION_TIMEOUT_MS |
10000 |
Timeout for "Fast" scraping mode (ms). |
CRAWL_SLOW_TIMEOUT_MS |
20000 |
Timeout for "Slow" scraping mode (ms). |
CRAWL_JITTER_MS |
0 |
Max random delay (ms) between requests to avoid thundering herd. |
You can run Teracrawl easily using Docker.
# Build the image
docker build -t teracrawl .
# Run with env file
docker run -p 8085:8085 --env-file .env teracrawlversion: "3.8"
services:
teracrawl:
build: .
ports:
- "8085:8085"
environment:
- BROWSER_API_KEY=${BROWSER_API_KEY}
- SERP_SERVICE_URL=http://serp:8080
depends_on:
- serp
serp:
image: ghcr.io/mega-tera/browser-serp:latest
ports:
- "8080:8080"Contributions are welcome! We appreciate your help in making Teracrawl better.
- Fork the Project: click the 'Fork' button at the top right of this page.
- Create your Feature Branch:
git checkout -b feature/AmazingFeature - Commit your Changes:
git commit -m 'Add some AmazingFeature' - Push to the Branch:
git push origin feature/AmazingFeature - Open a Pull Request: Submit your changes for review.
This project is licensed under the MIT License - see the LICENSE file for details.