Closed
Description
I am trying to write a simple function to crawl a website and I don't want crawlee to cache anything (each time I call this function it will do everything from scratch).
here is my attempt so far, I tried with persist_storage=False
and purge_on_start=True
in the configuration, and with removing the storage directory entirely, but I keep getting either a concatenated result of all the requests or and empty result in case I delete the storage directory.
async def main(
website: str,
include_links: list[str],
exclude_links: list[str],
depth: int = 5,
) -> str:
crawler = BeautifulSoupCrawler(
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=depth,
)
dataset = await Dataset.open(
configuration=Configuration(
persist_storage=False,
purge_on_start=True,
),
)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None: # type: ignore
# Extract data from the page.
text = context.soup.get_text()
await dataset.push_data({"content": text})
# Enqueue all links found on the page.
await context.enqueue_links(
include=[Glob(url) for url in include_links],
exclude=[Glob(url) for url in exclude_links],
)
# Run the crawler with the initial list of URLs.
await crawler.run([website])
data = await dataset.get_data()
content = "\n".join([item["content"] for item in data.items]) # type: ignore
return content
also is there a way to simple get the result of the crawl as a string, and not use Dataset
?
any help is appreciated 🤗 thank you in advance !
Activity
janbuchar commentedon Jul 29, 2024
Hello and thank you for your interest in Crawlee! This seems closely related to #351. Could you please re-check that you get an empty string if you run this after removing the storage directory? I can imagine getting an empty string on a second run without deleting the storage (because of both
persist_storage=False
andpurge_on_start
functioning incorrectly), but what you're describing sounds strange.tlinhart commentedon Sep 24, 2024
After some debugging I found a workaround to avoid re-using the cache. Basically we have to ensure that each time the crawler runs it uses a different request queue, e.g. like this:
It would be great if we could actually disable caching at all but this works for now.
vdusek commentedon Sep 25, 2024
@tlinhart Thanks. I will also link #541 here as it provides additional context.
tlinhart commentedon Sep 25, 2024
Thanks. If it helps, I found out during debugging that the problem seems to be in that there’s the same instance of
MemoryStorageClient
used across runs. There must be some reference left out after the first run.janbuchar commentedon Sep 25, 2024
Yes, that is the case. We're carrying a lot o historical baggage here and maybe this mechanism won't even be necessary in the end. Until then, I'm happy that you found a workaround.
service_locator
#69113 remaining items