Skip to content

Conversation

@Brenden2008
Copy link

Added Teracrawl benchmarks, an open source web scraping API powered by Browser.cash.

The Browser Cash team is happy to collaborate on maintaining datasets and creating new evals. Reach us at alex@megatera.ai

We achieved a success score of 84.2% and an F1 score of 62.7%.

image

Added:

  • Teracrawl engine benchmarks
  • Edited README to include the benchmark results

Reproduction guide:

Here's the env config we used for Teracrawl:

# Browser.cash API Key (Required)
# Get one at https://browser.cash
BROWSER_API_KEY=

# Datalab.to API Key for PDF processing (Enabled for the benchmark)
DATALAB_API_KEY=

# Server Configuration
PORT=8085
HOST=0.0.0.0
DEBUG_LOG=false

# Services
SERP_SERVICE_URL=http://localhost:8080

# Session Pool Config
POOL_SIZE=10

# Crawler Tuning
CRAWL_TABS_PER_SESSION=5
CRAWL_MIN_CONTENT_LENGTH=200
CRAWL_NAVIGATION_TIMEOUT_MS=10000
CRAWL_SLOW_TIMEOUT_MS=20000
CRAWL_JITTER_MS=0

MAX_CONCURRENT_BATCHES=10

And the command we ran the scrape evals with

uv run run_eval.py --scrape_engine teracrawl_api --suite quality --output-dir runs/results --dataset datasets/1-0-0.csv --max-workers 1

We ran Teracrawl and the benchmark on the same node, so you don't need to make any config changes to the env on the scrape-evals side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant