Skip to content

How can I disable cache completely? #369

Closed
@1hachem

Description

@1hachem

I am trying to write a simple function to crawl a website and I don't want crawlee to cache anything (each time I call this function it will do everything from scratch).

here is my attempt so far, I tried with persist_storage=False and purge_on_start=True in the configuration, and with removing the storage directory entirely, but I keep getting either a concatenated result of all the requests or and empty result in case I delete the storage directory.

async def main(
    website: str,
    include_links: list[str],
    exclude_links: list[str],
    depth: int = 5,
) -> str:
    crawler = BeautifulSoupCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=depth,
    )
    dataset = await Dataset.open(
        configuration=Configuration(
            persist_storage=False,
            purge_on_start=True,
        ),
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:  # type: ignore
        # Extract data from the page.
        text = context.soup.get_text()

        await dataset.push_data({"content": text})

        # Enqueue all links found on the page.
        await context.enqueue_links(
            include=[Glob(url) for url in include_links],
            exclude=[Glob(url) for url in exclude_links],
        )

    # Run the crawler with the initial list of URLs.
    await crawler.run([website])
    data = await dataset.get_data()

    content = "\n".join([item["content"] for item in data.items])  # type: ignore

    return content

also is there a way to simple get the result of the crawl as a string, and not use Dataset ?

any help is appreciated 🤗 thank you in advance !

Activity

janbuchar

janbuchar commented on Jul 29, 2024

@janbuchar
Collaborator

Hello and thank you for your interest in Crawlee! This seems closely related to #351. Could you please re-check that you get an empty string if you run this after removing the storage directory? I can imagine getting an empty string on a second run without deleting the storage (because of both persist_storage=False and purge_on_start functioning incorrectly), but what you're describing sounds strange.

added
t-toolingIssues with this label are in the ownership of the tooling team.
on Jul 31, 2024
tlinhart

tlinhart commented on Sep 24, 2024

@tlinhart

After some debugging I found a workaround to avoid re-using the cache. Basically we have to ensure that each time the crawler runs it uses a different request queue, e.g. like this:

import uuid

...
config = Configuration.get_global_configuration()
config.default_request_queue_id = uuid.uuid4().hex
...

It would be great if we could actually disable caching at all but this works for now.

vdusek

vdusek commented on Sep 25, 2024

@vdusek
Collaborator

@tlinhart Thanks. I will also link #541 here as it provides additional context.

self-assigned this
on Sep 25, 2024
tlinhart

tlinhart commented on Sep 25, 2024

@tlinhart

Thanks. If it helps, I found out during debugging that the problem seems to be in that there’s the same instance of MemoryStorageClient used across runs. There must be some reference left out after the first run.

janbuchar

janbuchar commented on Sep 25, 2024

@janbuchar
Collaborator

Thanks. If it helps, I found out during debugging that the problem seems to be in that there’s the same instance of MemoryStorageClient used across runs. There must be some reference left out after the first run.

Yes, that is the case. We're carrying a lot o historical baggage here and maybe this mechanism won't even be necessary in the end. Until then, I'm happy that you found a workaround.

removed this from the 100th sprint - Tooling team milestone on Oct 21, 2024
removed their assignment
on Nov 1, 2024

13 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Relationships

None yet

    Development

    Participants

    @janbuchar@fnesveda@tlinhart@vdusek@amindadgar

    Issue actions

      How can I disable cache completely? · Issue #369 · apify/crawlee-python