Compare similar ads to check for duplicates #631

dhowe · 2016-11-18T23:23:59Z

Some ads, like these two:

http://pagead2.googlesyndication.com/pagead/imgad?id=CICAgKDj2__aTRABGAEyCG3qbJztYSV0

https://tpc.googlesyndication.com/pagead/imgad?id=CICAgKDj2__aTRABGAEyCG3qbJztYSV0

are actually the same image served from different sources (with different URLs).

We want them stacked (as duplicates) to prevent cases like this:

We might try to compare them first by dimensions, then by file size, possibly even by their ID (as seen in this example, though the subdomain is different, the ID is identical), and finally, if these are not reliable enough, compare them bitwise to make sure they're one and the same.

The text was updated successfully, but these errors were encountered:

speedstyle · 2017-06-21T21:19:23Z

If we are to store all of the ads locally anyway, why don't we just hash the image during storage and add it to the metadata along with site linked to and site found on? Then we can check ads against each other upon addition to the database. Alternatively, we could have users report duplicates and then have a server list with URLs which are the same as each other and auto-generated rules (e.g. [http://*]=[https://*],
[pagead2.googlesyndication.*]=[tpc.googlesyndication.*])

dhowe · 2017-06-22T11:13:06Z

Interesting -- you are suggesting storing image hashes in JSON files?

speedstyle · 2017-06-26T09:31:55Z

I have little to no actual programming experience - I can create solutions to simple logical tasks in Python but not a fully-fledged application - I was just suggesting a number of methods to accomplish your suggestion. Given that some ads are actually the same, but in different shapes and sizes or even different text/pictures (see attachments) I think that adding a user button to mark adverts as duplicates would be the most useful solution - it would save the processing power used to hash the images, but allow those who view their vault frequently (and so actually care about it - many users no doubt are just users and are not interested in viewing ads on websites or in their own free time) to tidy it up.

dhowe added this to the Possible Futures milestone Nov 18, 2016

dhowe added PRIORITY: Low Question labels Nov 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare similar ads to check for duplicates #631

Compare similar ads to check for duplicates #631

dhowe commented Nov 18, 2016 •

edited

Loading

speedstyle commented Jun 21, 2017 •

edited

Loading

dhowe commented Jun 22, 2017

speedstyle commented Jun 26, 2017

Compare similar ads to check for duplicates #631

Compare similar ads to check for duplicates #631

Comments

dhowe commented Nov 18, 2016 • edited Loading

speedstyle commented Jun 21, 2017 • edited Loading

dhowe commented Jun 22, 2017

speedstyle commented Jun 26, 2017

dhowe commented Nov 18, 2016 •

edited

Loading

speedstyle commented Jun 21, 2017 •

edited

Loading