deduplicate based on url #299

ghost · 2018-07-25T14:14:16Z

When i set: crawler.queue({skipDuplicates:true}), (which is supposed to be true by default.)
According to the doc: " The request is considered to be the same if URL, userAgent, device, and extraHeaders are strictly the same.",
is there a way to deduplicate based only on the URL?
I only need to export URLs of a certain domain.

Minyar2004 · 2018-10-01T12:49:26Z

same issue for me

kulikalov · 2020-10-17T07:23:23Z

Hey @Devhercule! Could you elaborate a little?

userAgent, device, and extraHeaders are not changing randomly, so the only thing that is new on every page is the URL. So, I don't see issues here

kulikalov · 2020-10-26T04:49:44Z

closing due to inactivity

BubuAnabelas mentioned this issue Oct 20, 2018

duplicated url are crawled twice #302

Open

kulikalov closed this as completed Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deduplicate based on url #299

deduplicate based on url #299

ghost commented Jul 25, 2018

Minyar2004 commented Oct 1, 2018

kulikalov commented Oct 17, 2020

kulikalov commented Oct 26, 2020

deduplicate based on url #299

deduplicate based on url #299

Comments

ghost commented Jul 25, 2018

Minyar2004 commented Oct 1, 2018

kulikalov commented Oct 17, 2020

kulikalov commented Oct 26, 2020