Question: How to remove specific url from dupefilter set on redis? #302

matheus-rossi · 2024-11-18T17:31:34Z

Description

Today, I use web scraping and LLM models to extract data from URLs. If, for some reason, I encounter an error with the LLM, my plan is to retry processing during the next scheduled run (I schedule URLs every 15 minutes).

The dupefilter is very helpful, but if a URL has already been parsed by Scrapy, I need to download it again just for that specific URL.

Is there a way to remove a specific URL from the dupefilter redis set?

rmax · 2024-11-18T17:56:37Z

Did you try the dont_filter request meta parameter?

matheus-rossi · 2024-11-20T14:57:57Z

It works...

But as I schedule the same site to be crawled every hour, it's easier for me to capture any errors and remove them from the set. This way, on the next run, it will automatically try again.

Otherwise, I will need to check each previous link somewhere else to look for a status and pass it to the request.

rmax · 2024-11-20T16:01:08Z

Looks like you need a slightly different implementation.

The dupefilter assumes a single crawl session and by default it uses the spider name as redis key prefix.

But it seems you want to start a new crawl session every hour, so you will need to customize the dupefilter key for each new crawl session. Or delete the entire dupefilter key set from redis before starting a new crawl.

Does it make sense?

matheus-rossi · 2024-11-26T09:50:20Z

When I use a specific spider for a website, I generally just delete the entire dupefilter, and it works fine.

However, we have two 'general spiders' that are shared across more than 1000 URLs. In this case, if I delete the entire dupefilter, the impact on the other URLs will be huge. This is where it would be helpful to delete only the specific URLs.

Perhaps the solution is to customize the dupefilter logic...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: How to remove specific url from dupefilter set on redis? #302

Question: How to remove specific url from dupefilter set on redis? #302

matheus-rossi commented Nov 18, 2024

rmax commented Nov 18, 2024

matheus-rossi commented Nov 20, 2024

rmax commented Nov 20, 2024

matheus-rossi commented Nov 26, 2024

Question: How to remove specific url from dupefilter set on redis? #302

Question: How to remove specific url from dupefilter set on redis? #302

Comments

matheus-rossi commented Nov 18, 2024

Description

rmax commented Nov 18, 2024

matheus-rossi commented Nov 20, 2024

rmax commented Nov 20, 2024

matheus-rossi commented Nov 26, 2024