Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export identification status (validated, dubious, predicted) to DWCA #764

Open
jiho opened this issue Jan 21, 2022 · 7 comments
Open

Export identification status (validated, dubious, predicted) to DWCA #764

jiho opened this issue Jan 21, 2022 · 7 comments
Assignees
Labels
dwca DarwinCore archive related

Comments

@jiho
Copy link
Contributor

jiho commented Jan 21, 2022

Currently, we export only validated objects in DWCA (@grololo06, can you confirm?)

A proposal is underway (by @PatriciaCabrera) to use the DarwinCore field identificationVerificationStatus to indicate the status : "Verified by human", "Dubious according to human", "Predicted by machine".

This maps directly to the statuses in EcoTaxa. 🥳

But an occurrence in the occurrence.txt file of a DWCA (i.e. a line) can only have one identificationVerificationStatus; this means that, to use this field, the abundances/concentrations/biovolumes would need to be summed by sample + taxon + status; then for a taxon that has objects of the three statuses, there would be three lines in occurrences.txt and 3 lines in emof.txt, each the with concentration corresponding to the objects with the given status. Then it would be the responsibility fo the user of the data to decide if he/she wants to sum all three (and risk mistakes), keep only the validated (and risk underestimating concentration), etc.

@jiho jiho added page-export Everything related to export functionality dwca DarwinCore archive related labels Jan 21, 2022
@jiho
Copy link
Contributor Author

jiho commented Jan 21, 2022

Also tagging @rubenpp7

@PatriciaCabrera
Copy link

Update: To indicate the status of the id, in the DarwinCore field identificationVerificationStatus: in EurOBIS we will not use "Dubious according to human", only: "Predicted by machine" and "Verified by human"

@grololo06 grololo06 removed their assignment Apr 20, 2022
@grololo06 grololo06 removed the page-export Everything related to export functionality label May 8, 2022
@grololo06
Copy link
Member

Indeed, as of today, what is not Verified by human is just filtered out. I guess that the present issue needs to be exposed to users (via API). E.g. do we want to do it always or as a choice? Are there variations in such choice?

@grololo06
Copy link
Member

Code browsing:

  • Just found out that the taxonomic coverage production code does not filter Validated, without consequences as, so far, all collections were validated.
  • abundance (occurence.txt), concentrations and biovolume (emof) indeed filter out the non-validated objects
  • the validated objects are read from a clearly identified data source, so it's "just" building another one.

@grololo06
Copy link
Member

grololo06 commented Feb 7, 2023

Doc browsing:

  • it looks like identifiedBy field is needed for validated images. I guess it's all people involved in identification of any object in this taxon. Could be quite long.
  • For not-validated images, identificationReferences has to contain, I guess, some information on the ML used for automatic classification.
  • associatedMedia is optional but can be filled in for EcoTaxa (url to project+sample)

grololo06 added a commit to ecotaxa/ecotaxa_back that referenced this issue Feb 7, 2023
grololo06 added a commit to ecotaxa/ecotaxa_back that referenced this issue Feb 8, 2023
grololo06 added a commit to ecotaxa/ecotaxa_back that referenced this issue Feb 9, 2023
grololo06 added a commit to ecotaxa/ecotaxa_back that referenced this issue Feb 12, 2023
grololo06 added a commit to ecotaxa/ecotaxa_back that referenced this issue Feb 22, 2023
@grololo06
Copy link
Member

An example with mix of Predicted and Validated occurrences. The corresponding Emofs distinguish the 2 different occurrences inside the same sample.

@jiho
Copy link
Contributor Author

jiho commented Apr 14, 2023

  • it looks like identifiedBy field is needed for validated images. I guess it's all people involved in identification of any object in this taxon. Could be quite long.

We decide to only mention the latest validator, who has the authority on the validation. This field is therefore used to "know who to blame" 😉
Previous validators will be "thanked" through the co-authorship of the dataset.

Since one occurence corresponds to one or more objects in EcoTaxa, this should be the concatenated list of all validators (separated by | )

  • For not-validated images, identificationReferences has to contain, I guess, some information on the ML used for automatic classification.

When validated, this should be a paper/book. For us it would be the future EcoTaxoGuide. Storing this for each object seems like a waste of bits.

When predicted, the best practices document mentions that it should be a reference to the model. We don't store those and even if we did, they would not guarantee reproducibility.

=> We do not use this field for the moment.

  • associatedMedia is optional but can be filled in for EcoTaxa (url to project+sample)

Giving the links to all images is not realistic. Giving the link to the project is (i) not guaranteed to work forever, (ii) redundant with the link back to EcoTaxa at the level of the whole dataset.

=> We do not use this field for the moment.

grololo06 added a commit to ecotaxa/ecotaxa_back that referenced this issue Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dwca DarwinCore archive related
Projects
Development

No branches or pull requests

4 participants