Queueing same url on multiple workers in cluster with Redis cache results in duplicates #293

lioreshai · 2018-07-12T17:02:57Z

What is the current behavior?

Using a Redis cache for the queue and a cluster of processes crawling, the crawler is repeating requests.

If the current behavior is a bug, please provide the steps to reproduce

Create a cluster in which each workers process starts crawling the same url (on a crawler using Redis cache)

What is the expected behavior?

Even if the same url is added multiple times, I would expect there to be no duplicates. Should this be the case?

Please tell us about your environment:

Version: 1.8.0
Platform / OS version: Windows
Node.js version: 8.11.3

Minyar2004 · 2018-08-02T09:42:13Z

I have the same problem, I confirm

BubuAnabelas · 2018-10-17T04:23:31Z

Have you tried enabling the skipDuplicates and skipRequestedRedirect options in the queue options?

I believe that the current behavior is that it will crawl duplicate urls because it finds them as different request/response pairs. But if you enable those options it should make more duplicate requests.
Please confirm if your problem was fixed this way.

maschad96 · 2019-06-25T22:25:35Z

@BubuAnabelas I set it up with skipDuplicates and skipRequestedRedirect but it's still able to be reproduced issue for me.

Have a feeling it's because of differing 'extraHeaders' maybe?

Any guidance here would be appreciated; am redis noob, and moreover just want to make my crawler more efficient and not hit the same pages once per worker.

iamprageeth · 2022-06-19T06:37:12Z

Just posting here hoping this would help someone. This is true it crawls duplicate URLs when concurrency > 1. So here is what I did.

First created a sqlite database.
Then in RequestStarted event, insert the current url.
In preRequest function (You can pass this function along with options object) , just check whether there is a record of current url. If it is there that means url has crawler or still crawling. so return false. It will skip the url
In RequestRetried, RequestFailed events, delete the url. So that will allows crawler to try it again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queueing same url on multiple workers in cluster with Redis cache results in duplicates #293

Queueing same url on multiple workers in cluster with Redis cache results in duplicates #293

lioreshai commented Jul 12, 2018

Minyar2004 commented Aug 2, 2018

BubuAnabelas commented Oct 17, 2018

maschad96 commented Jun 25, 2019 •

edited

Loading

iamprageeth commented Jun 19, 2022

Queueing same url on multiple workers in cluster with Redis cache results in duplicates #293

Queueing same url on multiple workers in cluster with Redis cache results in duplicates #293

Comments

lioreshai commented Jul 12, 2018

Minyar2004 commented Aug 2, 2018

BubuAnabelas commented Oct 17, 2018

maschad96 commented Jun 25, 2019 • edited Loading

iamprageeth commented Jun 19, 2022

maschad96 commented Jun 25, 2019 •

edited

Loading