-
Notifications
You must be signed in to change notification settings - Fork 540
Description
I use PlaywrightCrawler with headless=True.
The package that I use is: crawlee[playwright]==0.6.3
In my code I have my own batching system in place.
But I noticed that my memory slowly starts to increase on each batch.
After some investigation I saw that ps -fC headless_shell gave me a lot headless_shell with CMD: <defunct> (zombie processes).
I had hoped that this issue was related to this fix: Remove tmp folder for PlaywrightCrawler in non-headless mode and it automatically would fix the problem. But unfortunately the issue still exists.
When I look up the parent PID I just see the entry point of my code: python al_crawler/crawler.py run_mongo_crawler --crawler_name=blabla
Below you can see my code for the batching system:
# Create key values stores for batches
scheduled_batches = await prepare_requests_from_mongo(crawler_name)
processed_batches = await KeyValueStore.open(
name=f'{crawler_name}-processed_batches'
)
# Create crawler
crawler = await create_playwright_crawler(crawler_name)
# Iterate over the batches
async for key_info in scheduled_batches.iterate_keys():
urls: List[str] = await scheduled_batches.get_value(key_info.key)
requests = [
Request.from_url(
url,
user_data={
'page_tags': [PageTag.HOME.value],
'chosen_page_tag': PageTag.HOME.value,
'label': PageTag.HOME.value,
},
)
for url in urls
]
LOGGER.info(f'Processing batch {key_info.key}')
await crawler.run(requests)
await scheduled_batches.set_value(key_info.key, None)
await processed_batches.set_value(key_info.key, urls)
I assume a possible fix would be to create on each batch a new crawler? But I would like to reuse the same crawler.