How to stop crawling a web site when a goal is reached #506

sebpiq · 2024-09-06T14:45:42Z

sebpiq
Sep 6, 2024

I have a list of web sites from which I am trying to scrape a given piece of info. For each site, once I have found that info, I want to stop and move on to the next (with several sites being scraped concurrently).

I have tried the following approach (emptying the request queue when my goal is found) :

request_queue = await RequestQueue.open()
crawler = PlaywrightCrawler(
    request_provider=request_queue,
    headless=True,  # Show the browser window.
    browser_type='firefox',  # Use the Firefox browser.
)
    
await crawler.add_requests([root_url])

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    # ...
    if found:
        await request_queue.drop()

But that's actually raising an error :

 ValueError: Request queue with id "default" does not exist.

Any idea how I should proceed to have a finer control over the request queue ? Thanks !

Answered by vdusek

Nov 11, 2024

We will implement this, see #651 for more information.

View full answer

sebpiq · 2024-09-06T16:40:41Z

sebpiq
Sep 6, 2024
Author

Also, should RequestQueue have a delete_request function to allow removing requests from the queue when they're not needed anymore?

1 reply

janbuchar Sep 9, 2024
Maintainer

The short answer is that the typical use case does not require this.

janbuchar · 2024-09-09T08:12:01Z

janbuchar
Sep 9, 2024
Maintainer

Let me get the requirements straight. There are several websites, and you want to crawl each of them until you find something specific. Do you use the same request handler for each page? I'm asking because it would make a lot of sense to keep the same crawler instance so that parallelism works well.

4 replies

sebpiq Sep 9, 2024
Author

it would make a lot of sense to keep the same crawler instance so that parallelism works well

Yes, I was thinking also that it would make sense to use the same crawler.

There are several websites, and you want to crawl each of them until you find something specific

Yes that's it. In each of these websites, I am looking for a legal document. Once I have found it I can just move on. But so yes, this is the exact same handler for each url.

janbuchar Sep 12, 2024
Maintainer

So, a simple solution that might be good enough would be to keep a set of processed websites outside of the crawler and then do something like this in your request handler:

from urllib.parse import urlparse

finished_hostnames = set[str]()
crawler = PlaywrightCrawler()

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
  hostname = urlparse(context.request.url).hostname
  if hostname in finished_hostnames:
    return

  # ... keep crawling ...

  if found:
    finished_hostnames.add(hostname)

Of course, pages from finished websites that were enqueued before you marked them as finished will be opened anyway, but at least the crawler won't go deeper from them.

EngineerKhan Sep 13, 2024

but at least the crawler won't go deeper from them.

I have a bit similar question: is there a way to specify the maximum depth to crawl in? (I am using BeautifulSoupCrawler).

janbuchar Sep 13, 2024
Maintainer

@EngineerKhan I don't see how this is similar, but see #460

vdusek · 2024-11-11T10:36:52Z

vdusek
Nov 11, 2024
Maintainer

We will implement this, see #651 for more information.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to stop crawling a web site when a goal is reached #506

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to stop crawling a web site when a goal is reached #506

sebpiq Sep 6, 2024

Replies: 3 comments · 5 replies

sebpiq Sep 6, 2024 Author

janbuchar Sep 9, 2024 Maintainer

janbuchar Sep 9, 2024 Maintainer

sebpiq Sep 9, 2024 Author

janbuchar Sep 12, 2024 Maintainer

EngineerKhan Sep 13, 2024

janbuchar Sep 13, 2024 Maintainer

vdusek Nov 11, 2024 Maintainer

sebpiq
Sep 6, 2024

Replies: 3 comments 5 replies

sebpiq
Sep 6, 2024
Author

janbuchar Sep 9, 2024
Maintainer

janbuchar
Sep 9, 2024
Maintainer

sebpiq Sep 9, 2024
Author

janbuchar Sep 12, 2024
Maintainer

janbuchar Sep 13, 2024
Maintainer

vdusek
Nov 11, 2024
Maintainer