Feature request: Add middleware for crawling websites with headless browser #16

HQarroum · 2024-01-30T20:12:17Z

Use case

Make it possible for customers to crawl one or multiple websites using a headless browser to forward the HTML associated with web pages to other middlewares.

Solution/User Experience

No response

Alternative solutions

No response

moltar · 2024-02-17T11:04:49Z

A good candidate could be Crawlee framework.

HQarroum · 2024-02-17T15:31:48Z

Hey @moltar.

Yes, Crawlee, along the PlaywrightCrawler is definitely our target. Our plan is to release the Web Crawler middleware as a Fargate cluster that can spawn headless browsers to crawl websites using user-defined strategies.

I've come up with a draft API for this future middleware. Feel free to give us feedbacks on it and on your use-cases to ensure we cover them.

Middleware API

const crawler = new WebCrawler.Builder()
  .withScope(this)
  .withIdentifier('WebCrawler')
  .withCacheStorage(cacheStorage)
  // Browser engine options (optional).
  .withEngineOptions(new EngineOptions.Builder()
    .withUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/58.0.3029.110')
    .withUseIncognitoPages(true)
    .withUseExperimentalContainers(true)
    .build()
  )
  // Crawler options (optional).
  .withCrawlerOptions(new CrawlerOptions.Builder()
    .withRequestHandlerTimeoutSecs(30)
    .withHandleRequestTimeoutSecs(30)
    .withMaxRequestsPerCrawl(100)
    .withMaxRequestRetries(5)
    .withSameDomainDelaySecs(1)
    .withMaxSessionRotations(5)
    .withMinConcurrency(1)
    .withMaxConcurrency(5)
    .withMaxRequestsPerMinute(100)
    .withKeepAlive(true)
    .withUseSessionPool(true)
    .withStatusMessageLoggingInterval(10)
    .withRetryOnBlocked(true)
    .withEnqueuePolicy('same-domain' | 'all' | 'same-origin' | 'none')
    // By default, the `Web Crawler` will only crawl HTML documents, however
    // customers may opt to crawl additional data types and send them to other
    // middlewares in their document processing pipelines.
    .withCapturedDocumentTypes('html', 'pdf', 'docx')
  )
  .build();

moltar · 2024-02-17T19:59:09Z

I'd prefer an alternative to an ECS cluster.

Unless we are talking about tasks that scale to zero.

It is possible to run Playwright in a Lambda.

HQarroum · 2024-02-19T07:39:22Z

Got it! And yes, the tasks will scale to zero as for every existing middleware :).

Playwright will run in Lambda, but we're afraid that 15 minutes might not be enough time to crawl bigger websites.

moltar · 2024-02-19T09:31:32Z

15 minutes might not be enough time to crawl bigger websites

Shouldn't the crawling process be distributed anyway?

I'd crawl the entry page and then put back all of the found sub-pages to crawl back into the queue. The crawler handler processes one fetch at a time.

Also, if there's unbound crawling, this can lead to run-away costs, in a way a timeout could be the forcing function to prevent foot guns :)

HQarroum added triage new-middleware A label associated with a new middleware. labels Jan 30, 2024

HQarroum self-assigned this Jan 30, 2024

HQarroum added this to Project Lakechain Jan 30, 2024

HQarroum moved this to Backlog in Project Lakechain Jan 30, 2024

HQarroum moved this from Backlog to In review in Project Lakechain Jan 30, 2024

HQarroum moved this from In review to Planned in Project Lakechain Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Add middleware for crawling websites with headless browser #16

Feature request: Add middleware for crawling websites with headless browser #16

HQarroum commented Jan 30, 2024

moltar commented Feb 17, 2024

HQarroum commented Feb 17, 2024

moltar commented Feb 17, 2024

HQarroum commented Feb 19, 2024

moltar commented Feb 19, 2024

Feature request: Add middleware for crawling websites with headless browser #16

Feature request: Add middleware for crawling websites with headless browser #16

Comments

HQarroum commented Jan 30, 2024

Use case

Solution/User Experience

Alternative solutions

moltar commented Feb 17, 2024

HQarroum commented Feb 17, 2024

moltar commented Feb 17, 2024

HQarroum commented Feb 19, 2024

moltar commented Feb 19, 2024