Skip to content

Cleanup of PlaywrightCrawler keeps zombie processes #1072

@ROYOSTI

Description

@ROYOSTI

I use PlaywrightCrawler with headless=True.
The package that I use is: crawlee[playwright]==0.6.3

In my code I have my own batching system in place.
But I noticed that my memory slowly starts to increase on each batch.
After some investigation I saw that ps -fC headless_shell gave me a lot headless_shell with CMD: <defunct> (zombie processes).
I had hoped that this issue was related to this fix: Remove tmp folder for PlaywrightCrawler in non-headless mode and it automatically would fix the problem. But unfortunately the issue still exists.
When I look up the parent PID I just see the entry point of my code: python al_crawler/crawler.py run_mongo_crawler --crawler_name=blabla

Below you can see my code for the batching system:

    # Create key values stores for batches
    scheduled_batches = await prepare_requests_from_mongo(crawler_name)
    processed_batches = await KeyValueStore.open(
        name=f'{crawler_name}-processed_batches'
    )

    # Create crawler
    crawler = await create_playwright_crawler(crawler_name)

    # Iterate over the batches
    async for key_info in scheduled_batches.iterate_keys():
        urls: List[str] = await scheduled_batches.get_value(key_info.key)
        requests = [
            Request.from_url(
                url,
                user_data={
                    'page_tags': [PageTag.HOME.value],
                    'chosen_page_tag': PageTag.HOME.value,
                    'label': PageTag.HOME.value,
                },
            )
            for url in urls
        ]
        LOGGER.info(f'Processing batch {key_info.key}')
        await crawler.run(requests)
        await scheduled_batches.set_value(key_info.key, None)
        await processed_batches.set_value(key_info.key, urls)

I assume a possible fix would be to create on each batch a new crawler? But I would like to reuse the same crawler.

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions