Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALA not recognising duplicate records #96

Open
nickdos opened this issue Dec 7, 2015 · 2 comments
Open

ALA not recognising duplicate records #96

nickdos opened this issue Dec 7, 2015 · 2 comments

Comments

@nickdos
Copy link
Contributor

nickdos commented Dec 7, 2015

From @Mesibov on December 7, 2015 3:7

This issue was reported to ALA staff by email on 21 September 2015 but I post it here for general comment.

While auditing ALA's Lepidoptera records I found thousands of duplicate record pairs from ANIC, Australian Museum, Queensland Museum and South Australian Museum. By 'duplicate' I mean that the same specimen lot in the same repository with the same catalog number is listed twice in ALA, not that the two records are absolutely identical. The duplicates are still online, and for an example see

http://biocache.ala.org.au/occurrences/2e87583d-7240-4c67-adde-b23c8e509921
http://biocache.ala.org.au/occurrences/55c237a0-1e7b-44b2-9678-048b4b4d1c45

Both records were provided to ALA by QM through OZCAM.

In September I sent to ALA as text files the 1266 duplicate pairs I found from ANIC, 3791 pairs from AM, 491 pairs from QM and 113 pairs from SAM.

Most of the duplication is apparently the result of ALA adding a second version of a record without checking for and deleting the previous version. There is no flag advising the user as to which of the records in a duplicate pair is the more correct or recent version. At the time of download (17 July 2015) there was a data quality field 'Inferred duplicate record'. Only a tiny fraction of the duplicated records were flagged 'true'.

Copied from original issue: AtlasOfLivingAustralia/biocache-service#78

@nickdos
Copy link
Contributor Author

nickdos commented Dec 7, 2015

This will be useful in enhancing our duplicate detection code, which has not caught these records.

I note a missing record date in one of those records is probably why our duplicate detection failed.

@Mesibov
Copy link

Mesibov commented Dec 7, 2015

The beetles dataset I downloaded on 2 December 2015 contains 9895 duplicate pairs from GBIF, Australian Museum, Queensland Museum and South Australian Museum. Example:

Record ID Catalog Number Matched Scientific Name Institution Code
06dba16a-f635-4cfd-950f-98381a486722 1005575 Clivina sellata EME
c0c38827-ec1d-45b2-8cd5-1b1b8d10f831 1005575 Clivina sellata EME

The full list of 19790 records is attached.
dupes.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants