Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Detect duplicate samples when adding new data to tensors (images) #1757

Open
michelemoretti opened this issue Jun 25, 2022 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@michelemoretti
Copy link

🚨🚨 Feature Request

  • A new implementation (Improvement, Extension)

Is it possible to discard samples in case they are already present in the dataset? If not, would this be something interesting to implement? I feel like this would make the dataset extension pipeline much easier to use and implement

@michelemoretti michelemoretti added the enhancement New feature or request label Jun 25, 2022
@davidbuniat
Copy link
Member

hi @michelemoretti, thanks for the feature request! yes, this is a great idea but would add some overhead for computing hashes of the data while ingesting (assuming they are exactly the same images).

Can you tell us more about how this would simplify the dataset extension pipeline for you (maybe just illustrating by an example)? The answer would help us with prioritizing.

@michelemoretti
Copy link
Author

Hi David,
I was thinking of my most frequent use case, in which I receive the dataset in batches as it is collected.
Updating the dataset would require me to change the script (that appends the samples to the dataset) to target only the new samples received.
Having a syntax to update the dataset by feeding it all of the samples would allow me to use the same script for both creating the original dataset and updating it by appending all of the samples (but on the backend only the new samples would be appended).
Hope that was clear enough.

@davidbuniat
Copy link
Member

davidbuniat commented Jul 3, 2022

Got it, @michelemoretti, just to make sure we are on the same page, are those repeating images pixel-perfect exactly same or still there could be some minor changes between those?

@michelemoretti
Copy link
Author

Absolutely. We're talking about identical files/images.

@protocolog
Copy link

I want to work on this issue. Please assign me this issue. Thanks.

@protocolog
Copy link

I want to work on this issue. Please assign me this issue. Thanks. @michelemoretti @davidbuniat @sgrove @jraman

@mikayelh
Copy link
Collaborator

mikayelh commented Sep 8, 2022

hey @protocolog , thanks a lot for your contribution, and apologies for the late reply. Assigned the issue! You can join the Activeloop community slack (slack.activeloop.ai) to ask questions. :)

@protocolog
Copy link

Please assign #1757 issue, You assigned me but my profile is showing not assigned. your slack link is not working, please give the alternate source of contact@davidbuniat @mikayelh @michelemoretti @sgrove

@mikayelh
Copy link
Collaborator

mikayelh commented Sep 8, 2022

@protocolog apologies, fixed the link. Please refrain from tagging people who are not involved in this conversation to spare their inboxes. Thanks. :)

@protocolog
Copy link

I am unable to join the workspace on slack. Please help me out. My slack ID is h20220047@goa.bits-pilani.ac.in , Thanks
@davidbuniat @mikayelh @michelemoretti

@nmichlo
Copy link

nmichlo commented Oct 23, 2022

This is an interesting problem that I face, but not just for identical images, but near-identical images. Have not actually tested this workflow, but I imagine this could be done by generating a perceptual hash (or normal hash) of the image (eg. with the imagehash lib) and store this in a separate tensor that corresponds to your main image. Then on ingest you can query against this to skip duplicates or near duplicates based on the hash approach that you choose. You could even adjust this for KNN matching too or embeddings instead of hashes.

@mikayelh
Copy link
Collaborator

mikayelh commented Nov 1, 2022

thanks a lot @nmichlo for chiming in here and the suggestion! @protocolog I've re-sent you an invite to our slack but I noticed that you joined. Let me know if you have other questions:)

I'm also tagging @istranic here in case he thinks this can be included on the roadmap. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants