-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queueing same url on multiple workers in cluster with Redis cache results in duplicates #293
Comments
I have the same problem, I confirm |
Have you tried enabling the I believe that the current behavior is that it will crawl duplicate urls because it finds them as different request/response pairs. But if you enable those options it should make more duplicate requests. |
@BubuAnabelas I set it up with skipDuplicates and skipRequestedRedirect but it's still able to be reproduced issue for me. Have a feeling it's because of differing 'extraHeaders' maybe? Any guidance here would be appreciated; am redis noob, and moreover just want to make my crawler more efficient and not hit the same pages once per worker. |
Just posting here hoping this would help someone. This is true it crawls duplicate URLs when concurrency > 1. So here is what I did.
|
What is the current behavior?
Using a Redis cache for the queue and a cluster of processes crawling, the crawler is repeating requests.
If the current behavior is a bug, please provide the steps to reproduce
Create a cluster in which each workers process starts crawling the same url (on a crawler using Redis cache)
What is the expected behavior?
Even if the same url is added multiple times, I would expect there to be no duplicates. Should this be the case?
Please tell us about your environment:
The text was updated successfully, but these errors were encountered: