Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Add middleware for crawling websites with headless browser #16

Open
HQarroum opened this issue Jan 30, 2024 · 5 comments
Assignees
Labels
new-middleware A label associated with a new middleware. triage

Comments

@HQarroum
Copy link
Contributor

Use case

Make it possible for customers to crawl one or multiple websites using a headless browser to forward the HTML associated with web pages to other middlewares.

Solution/User Experience

No response

Alternative solutions

No response

@HQarroum HQarroum added triage new-middleware A label associated with a new middleware. labels Jan 30, 2024
@HQarroum HQarroum self-assigned this Jan 30, 2024
@HQarroum HQarroum moved this to Backlog in Project Lakechain Jan 30, 2024
@HQarroum HQarroum moved this from Backlog to In review in Project Lakechain Jan 30, 2024
@moltar
Copy link

moltar commented Feb 17, 2024

A good candidate could be Crawlee framework.

@HQarroum
Copy link
Contributor Author

Hey @moltar.

Yes, Crawlee, along the PlaywrightCrawler is definitely our target. Our plan is to release the Web Crawler middleware as a Fargate cluster that can spawn headless browsers to crawl websites using user-defined strategies.

I've come up with a draft API for this future middleware. Feel free to give us feedbacks on it and on your use-cases to ensure we cover them.

Middleware API

const crawler = new WebCrawler.Builder()
  .withScope(this)
  .withIdentifier('WebCrawler')
  .withCacheStorage(cacheStorage)
  // Browser engine options (optional).
  .withEngineOptions(new EngineOptions.Builder()
    .withUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/58.0.3029.110')
    .withUseIncognitoPages(true)
    .withUseExperimentalContainers(true)
    .build()
  )
  // Crawler options (optional).
  .withCrawlerOptions(new CrawlerOptions.Builder()
    .withRequestHandlerTimeoutSecs(30)
    .withHandleRequestTimeoutSecs(30)
    .withMaxRequestsPerCrawl(100)
    .withMaxRequestRetries(5)
    .withSameDomainDelaySecs(1)
    .withMaxSessionRotations(5)
    .withMinConcurrency(1)
    .withMaxConcurrency(5)
    .withMaxRequestsPerMinute(100)
    .withKeepAlive(true)
    .withUseSessionPool(true)
    .withStatusMessageLoggingInterval(10)
    .withRetryOnBlocked(true)
    .withEnqueuePolicy('same-domain' | 'all' | 'same-origin' | 'none')
    // By default, the `Web Crawler` will only crawl HTML documents, however
    // customers may opt to crawl additional data types and send them to other
    // middlewares in their document processing pipelines.
    .withCapturedDocumentTypes('html', 'pdf', 'docx')
  )
  .build();

@moltar
Copy link

moltar commented Feb 17, 2024

I'd prefer an alternative to an ECS cluster.

Unless we are talking about tasks that scale to zero.

It is possible to run Playwright in a Lambda.

@HQarroum
Copy link
Contributor Author

Got it! And yes, the tasks will scale to zero as for every existing middleware :).

Playwright will run in Lambda, but we're afraid that 15 minutes might not be enough time to crawl bigger websites.

@moltar
Copy link

moltar commented Feb 19, 2024

15 minutes might not be enough time to crawl bigger websites

Shouldn't the crawling process be distributed anyway?

I'd crawl the entry page and then put back all of the found sub-pages to crawl back into the queue. The crawler handler processes one fetch at a time.

Also, if there's unbound crawling, this can lead to run-away costs, in a way a timeout could be the forcing function to prevent foot guns :)

@HQarroum HQarroum moved this from In review to Planned in Project Lakechain Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-middleware A label associated with a new middleware. triage
Projects
Status: Planned
Development

No branches or pull requests

2 participants