-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Add middleware for crawling websites with headless browser #16
Comments
A good candidate could be Crawlee framework. |
Hey @moltar. Yes, Crawlee, along the I've come up with a draft API for this future middleware. Feel free to give us feedbacks on it and on your use-cases to ensure we cover them. Middleware API const crawler = new WebCrawler.Builder()
.withScope(this)
.withIdentifier('WebCrawler')
.withCacheStorage(cacheStorage)
// Browser engine options (optional).
.withEngineOptions(new EngineOptions.Builder()
.withUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/58.0.3029.110')
.withUseIncognitoPages(true)
.withUseExperimentalContainers(true)
.build()
)
// Crawler options (optional).
.withCrawlerOptions(new CrawlerOptions.Builder()
.withRequestHandlerTimeoutSecs(30)
.withHandleRequestTimeoutSecs(30)
.withMaxRequestsPerCrawl(100)
.withMaxRequestRetries(5)
.withSameDomainDelaySecs(1)
.withMaxSessionRotations(5)
.withMinConcurrency(1)
.withMaxConcurrency(5)
.withMaxRequestsPerMinute(100)
.withKeepAlive(true)
.withUseSessionPool(true)
.withStatusMessageLoggingInterval(10)
.withRetryOnBlocked(true)
.withEnqueuePolicy('same-domain' | 'all' | 'same-origin' | 'none')
// By default, the `Web Crawler` will only crawl HTML documents, however
// customers may opt to crawl additional data types and send them to other
// middlewares in their document processing pipelines.
.withCapturedDocumentTypes('html', 'pdf', 'docx')
)
.build(); |
I'd prefer an alternative to an ECS cluster. Unless we are talking about tasks that scale to zero. It is possible to run Playwright in a Lambda. |
Got it! And yes, the tasks will scale to zero as for every existing middleware :). Playwright will run in Lambda, but we're afraid that 15 minutes might not be enough time to crawl bigger websites. |
Shouldn't the crawling process be distributed anyway? I'd crawl the entry page and then put back all of the found sub-pages to crawl back into the queue. The crawler handler processes one fetch at a time. Also, if there's unbound crawling, this can lead to run-away costs, in a way a timeout could be the forcing function to prevent foot guns :) |
Use case
Make it possible for customers to crawl one or multiple websites using a headless browser to forward the HTML associated with web pages to other middlewares.
Solution/User Experience
No response
Alternative solutions
No response
The text was updated successfully, but these errors were encountered: