Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to remove specific url from dupefilter set on redis? #302

Open
matheus-rossi opened this issue Nov 18, 2024 · 4 comments
Open

Comments

@matheus-rossi
Copy link

Description

Today, I use web scraping and LLM models to extract data from URLs. If, for some reason, I encounter an error with the LLM, my plan is to retry processing during the next scheduled run (I schedule URLs every 15 minutes).

The dupefilter is very helpful, but if a URL has already been parsed by Scrapy, I need to download it again just for that specific URL.

Is there a way to remove a specific URL from the dupefilter redis set?

@rmax
Copy link
Owner

rmax commented Nov 18, 2024

Did you try the dont_filter request meta parameter?

@matheus-rossi
Copy link
Author

It works...

But as I schedule the same site to be crawled every hour, it's easier for me to capture any errors and remove them from the set. This way, on the next run, it will automatically try again.

Otherwise, I will need to check each previous link somewhere else to look for a status and pass it to the request.

@rmax
Copy link
Owner

rmax commented Nov 20, 2024

Looks like you need a slightly different implementation.

The dupefilter assumes a single crawl session and by default it uses the spider name as redis key prefix.

But it seems you want to start a new crawl session every hour, so you will need to customize the dupefilter key for each new crawl session. Or delete the entire dupefilter key set from redis before starting a new crawl.

Does it make sense?

@matheus-rossi
Copy link
Author

When I use a specific spider for a website, I generally just delete the entire dupefilter, and it works fine.

However, we have two 'general spiders' that are shared across more than 1000 URLs. In this case, if I delete the entire dupefilter, the impact on the other URLs will be huge. This is where it would be helpful to delete only the specific URLs.

Perhaps the solution is to customize the dupefilter logic...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants