-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: How to remove specific url from dupefilter set on redis? #302
Comments
Did you try the |
It works... But as I schedule the same site to be crawled every hour, it's easier for me to capture any errors and remove them from the set. This way, on the next run, it will automatically try again. Otherwise, I will need to check each previous link somewhere else to look for a status and pass it to the request. |
Looks like you need a slightly different implementation. The dupefilter assumes a single crawl session and by default it uses the spider name as redis key prefix. But it seems you want to start a new crawl session every hour, so you will need to customize the dupefilter key for each new crawl session. Or delete the entire dupefilter key set from redis before starting a new crawl. Does it make sense? |
When I use a specific spider for a website, I generally just delete the entire dupefilter, and it works fine. However, we have two 'general spiders' that are shared across more than 1000 URLs. In this case, if I delete the entire dupefilter, the impact on the other URLs will be huge. This is where it would be helpful to delete only the specific URLs. Perhaps the solution is to customize the dupefilter logic... |
Description
Today, I use web scraping and LLM models to extract data from URLs. If, for some reason, I encounter an error with the LLM, my plan is to retry processing during the next scheduled run (I schedule URLs every 15 minutes).
The dupefilter is very helpful, but if a URL has already been parsed by Scrapy, I need to download it again just for that specific URL.
Is there a way to remove a specific URL from the dupefilter redis set?
The text was updated successfully, but these errors were encountered: