Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare similar ads to check for duplicates #631

Open
dhowe opened this issue Nov 18, 2016 · 3 comments
Open

Compare similar ads to check for duplicates #631

dhowe opened this issue Nov 18, 2016 · 3 comments

Comments

@dhowe
Copy link
Owner

dhowe commented Nov 18, 2016

Some ads, like these two:

http://pagead2.googlesyndication.com/pagead/imgad?id=CICAgKDj2__aTRABGAEyCG3qbJztYSV0

https://tpc.googlesyndication.com/pagead/imgad?id=CICAgKDj2__aTRABGAEyCG3qbJztYSV0

are actually the same image served from different sources (with different URLs).

We want them stacked (as duplicates) to prevent cases like this:
screen shot 2015-01-09 at 10 17 45 am

We might try to compare them first by dimensions, then by file size, possibly even by their ID (as seen in this example, though the subdomain is different, the ID is identical), and finally, if these are not reliable enough, compare them bitwise to make sure they're one and the same.

@speedstyle
Copy link

speedstyle commented Jun 21, 2017

If we are to store all of the ads locally anyway, why don't we just hash the image during storage and add it to the metadata along with site linked to and site found on? Then we can check ads against each other upon addition to the database. Alternatively, we could have users report duplicates and then have a server list with URLs which are the same as each other and auto-generated rules (e.g. [http://*]=[https://*],
[pagead2.googlesyndication.*]=[tpc.googlesyndication.*])

@dhowe
Copy link
Owner Author

dhowe commented Jun 22, 2017

Interesting -- you are suggesting storing image hashes in JSON files?

@speedstyle
Copy link

I have little to no actual programming experience - I can create solutions to simple logical tasks in Python but not a fully-fledged application - I was just suggesting a number of methods to accomplish your suggestion. Given that some ads are actually the same, but in different shapes and sizes or even different text/pictures (see attachments) I think that adding a user button to mark adverts as duplicates would be the most useful solution - it would save the processing power used to hash the images, but allow those who view their vault frequently (and so actually care about it - many users no doubt are just users and are not interested in viewing ads on websites or in their own free time) to tidy it up.
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants