You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear all,
I have a document set that returns a duplicate accorind to unique():
len(docset) -> 1014
len(docset.unique()) -> 1013
However, len(docset-docset.unique()) -> 0
I found this when I wanted to output the title of the duplicate that is supposedly eliminated by unique, however I do not get any since the difference has zero documents.
Best, Lars.
The text was updated successfully, but these errors were encountered:
larsgrobe
changed the title
Dj
Different results from unique() and difference of deduplicated set
Dec 8, 2023
I've given this some careful thought, and although this behavior might seem counter-intutive, it is indeed correct.
The - operator relies on "fuzzy" matching to determine which documents from the left-hand set should be excluded, based on the right-hand set. In the case you described, where there are two identical documents, docset-docset.unique() results in an empty set. This happens because the "fuzzy" matching treats the same document as present in both sets (likely due to matching DOI).
Nonetheless, I can see how it is odd that there is no way to retrieve which documents were removed by unique.
Would it work for you if we were to add a duplicates() method? This method would specifically return the duplicate documents, ensuring that len(docset) = len(docset.unique()) + len(docset.duplicates()).
Dear all,
I have a document set that returns a duplicate accorind to unique():
len(docset) -> 1014
len(docset.unique()) -> 1013
However, len(docset-docset.unique()) -> 0
I found this when I wanted to output the title of the duplicate that is supposedly eliminated by unique, however I do not get any since the difference has zero documents.
Best, Lars.
The text was updated successfully, but these errors were encountered: