Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GBIF occurrences duplicates #3459

Closed
Dina-Sharafeldeen opened this issue Jun 10, 2021 · 3 comments
Closed

GBIF occurrences duplicates #3459

Dina-Sharafeldeen opened this issue Jun 10, 2021 · 3 comments

Comments

@Dina-Sharafeldeen
Copy link

Hi,
We are working on one occurrences dataset downloaded from GBIF.
We noticed that some species with the same attributes' values have different gbifID.
kindly find a sample of this case under the following link:
https://drive.google.com/file/d/1D7WfjvKbBCzRj2uBz_f0_qkDnHinhRUs/view?usp=sharing
Should we consider this as duplicates? if we consider that as duplicates, which ones we should keep and which ones we should delete?
Thanks
Dina Sharafeldeen

@MortenHofft MortenHofft transferred this issue from gbif/portal16 Jun 11, 2021
@MortenHofft
Copy link
Member

MortenHofft commented Jun 11, 2021

That is how the data is published, the catalog number and the organismId is different (for the example i looked at). So according to the dataset publisher this are different organisms that happen to be collected at the same time same day same species. At least that is how it appears to me.

@rukayaj you can perhaps answer this? Better than I can at least.

@rukayaj
Copy link

rukayaj commented Jun 11, 2021

@Dina-Sharafeldeen I think the records you refer to are from this dataset, right? https://www.gbif.org/dataset/264e6a66-9c9e-4115-9aec-29d694c68097#description
The data curator added some explanatory text in the dataset description, here: https://www.gbif.org/dataset/264e6a66-9c9e-4115-9aec-29d694c68097#description

An accession generally refers to a given individual at a specific place and time; for each accession, one or more items may exist. Examples of item types include preserved skins, feathers or eggshells, blood, tissue or sperm samples, footage of live sperm motility, microscope slides prepared for sperm morphometric analyses, testes and extracted DNA.

I think it would be clearer if the text included some of the information we have in our internal wiki - I will talk to the data curators and reword/add things, but here I have pasted verbatim from the wiki for the meantime:

  • The records in the DNA datasets (* apart from the Mammal and Bird datasets) may possibly have some duplicate records in the non-DNA UiO NHM datasets. This is because the DNA collections are partially made up of DNA samples from the specimens in the main collections, which may sometimes be published separately on GBIF. Additionally, some DNA samples are taken from living organisms in the wild, which are then released and do not become museum specimens.

  • Multiple tissue samples are often taken from a specimen:

    • A great tit captured in a mist net at Tøyen on 19.06.2020 will be registered as one accession (=occurrence record) in our collection database; this record will hold info about species, locality, date, ring number, etc.

    • Let’s say I took a blood sample, a sperm sample and a feather sample from the bird before I released it; each of these samples will then be registered as “sub-records” on the main record created above (what we call “items”)

    • This one bird will then appear as three points on the map in Artskart/GBIF

  • Tissue samples are curated as separate collection objects from the specimen they are taken from (kept in different locations etc), and so are assigned individual occurrenceIDs - i.e. identifiers for each individual item (tissue sample, DNA extract, etc) belonging to a record/accession.

    • This means that each sample has a row in the occurrence table. Think of occurrenceID more as identifying a piece of evidence for a species occurrence than the actual species occurrence itself.
  • Occurrence records from the "same individual" (i.e. the same occurrence of an organism in space and time) have the same organismID in the simpledwc file and are also linked together through the resource relationship file.

  • The resource relationship file may seem to sometimes contain discrepancies. However, these can usually be explained:

    • AccNo NHMO-DFH-782 have totally 7 items registered in Corema; 6 different tissue samples and 1 DNA extract
    • The DNA extract, NHMO-DFH-782/7-D, has been extracted from NHMO-DFH-782/6-T, and hence this relationship appears in the resourceRelationship; however, as there is no link between any of the other items, apart from them all belonging to the same accession/individual, no further entries appear in the resourceRelationship
    • Further, item 6 (.../6-T) was consumed during the extraction and therefore have status = Empty. As the data exported to GBIF (and others) only contain information on existing items, this one item is not included in the dataset – hence the apparent non-existance of this item

Also worth noting is this massive thread tdwg/dwc#314 on the nature of MaterialSamples vs Occurrences - this is something GBIF Norway is following closely and we will of course adapt to what the community decides are the best practices.

So anyway, the records you are seeing may possibly from the same individual, can you look at the organismID and see what that says? I will have time to look into this more deeply later next week.

@jhnwllr jhnwllr pinned this issue Jun 15, 2021
@dnoesgaard dnoesgaard unpinned this issue Jul 12, 2021
@ManonGros
Copy link

The question seem to have been answer. Let us know if the issue needs to be reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants