Before we head deeper into the features, let's recap on how Crawlee scrapers work. In short:
---
title: Basic web scraping flow
---
flowchart TB
URL -- turned into --> Request -- URL loaded --> Webpage --> data[Extracted data]
Webpage -- Add URLs to scrape --> URL
-
You define a list of URLs to be scraped, and configure the scraper.
-
You then start a scraper
Run
. Your URLs are turned intoRequests
, and put into theRequestQueue
. -
A single
Run
may start, in parallel, several scraperInstances
. -
Each
Instance
takes aRequest
fromRequestQueue
, and loads the URL. Based on the URL loaded:- The data is scraped from the page and sent to the
Dataset
. - New URLs are found, turned into
Requests
, and sent toRequestQueue
.
- The data is scraped from the page and sent to the
-
Step 4. continues for as long as there are
Requests
inRequestQueue
. -
In CrawleeOne, the
Instance
processesRequests
inBatches
.Requests
in a singleBatch
share the same state and browser instance.
---
title: Scraper run
---
flowchart TB
URLs --> reqq[RequestQueue]
reqq --> run[Scraper run]
run --> instance
run --> instance
run --> instance
subgraph instance[Scraper instance]
instance_note["Single run may include several instances.<br/>Each instance has separate state."]
end
subgraph batch[Batch of Requests]
batch_note[Each instance processes a batch of URLs.<br/> Then it restarts it's state.]
end
instance --> batch
batch --> urlflow
batch --> urlflow
batch --> urlflow
subgraph urlflow[For each URL]
direction TB
URL -- turned into --> Request -- URL loaded --> Webpage --> data[Extracted data]
Webpage -- Add URLs to scrape<br/>to RequestQueue --> URL
end