Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please add Darwin Core support #6243

Closed
kamil386 opened this issue Oct 2, 2019 · 18 comments
Closed

Please add Darwin Core support #6243

kamil386 opened this issue Oct 2, 2019 · 18 comments
Assignees
Labels
Feature: Metadata Type: Feature a feature request User Role: Depositor Creates datasets, uploads data, etc.

Comments

@kamil386
Copy link

kamil386 commented Oct 2, 2019

I downloaded the spreadsheet https://docs.google.com/spreadsheets/d/1P9xvaRLhCKsYmjz9eXXVl0T9d2U34UgynbvxDp-2Bjc/edit#gid=1331272861 as TSV

and run:
curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file /tmp/Comparative\ Zoology\ _\ Darwin\ Core\ Metadata\ -\ Sheet2.tsv
but received error response:
{"status":"ERROR","message":"For input string: \"\""}

That custom metadata block for Darwin Core does not work.
Comparative Zoology _ Darwin Core Metadata - Sheet2.zip

@jggautier
Copy link
Contributor

jggautier commented Oct 2, 2019

Hi @kamil386. I don't know what this error message means.

There are a few issues I see with the spreadsheet you linked to, but I don't know if any of them are what's causing that error response:

  • The column called showabovefold was renamed to displayoncreate. But maybe the script for uploading TSVs doesn't care what the name of each column is? That's what I gather from this comment. So maybe that isn't actually an issue?

  • For the field called elevation, there's no advancedSearch value. Not sure if this is a problem

    Screen Shot 2019-10-02 at 12 02 25 PM

This Github issue title first made me think you were asking if Darwin Core could be added as a metadata block that comes with Dataverse (instead of a custom metadatablock that an installation would add). Is that the case? Or are you only asking for help to add the metadatablock to a Dataverse installation you're setting up?

@djbrooke
Copy link
Contributor

djbrooke commented Oct 2, 2019

Thanks @jggautier !

@kamil386
Copy link
Author

kamil386 commented Oct 3, 2019

I were asking if Darwin Core could be added as a metadata block that comes with Dataverse, more precisely if Dataverse could support Darwin Core natively. Dataverse with DarwinCore could be much more easily adopted by many public Institutions dealing with Biodiversity.
It will be also nice to mark as completed one of the planned feature :)
https://dataverse.org/files/dataverseorg/files/iassistposter2016ecastro.pdf

But it would be a good starting point if adding this custom medatablock to a Dataverse installation would work, even for testing purpose in the meantime.

@pdurbin
Copy link
Member

pdurbin commented Oct 3, 2019

The list of list of five metadata blocks that ship with Dataverse at http://guides.dataverse.org/en/4.16/user/appendix.html has been frozen in time for a while. Recently for #3976 we added a six to that list, a "Journals" metadata block which has been available for a while but never included in the list.

The idea was never that the list of metadata blocks that ship with Dataverse would be frozen. The idea was that we'd put a few out there that are important to our original user base (social science) and others that we have some experience with (astronomy, etc.) to show what's possible. Then we would work with the community to add addition metadata blocks as "official" blocks that ship with Dataverse.

@kamil386 my question to you is, is the Darwin Core support that you found in that spreadsheet good enough? Should we ship it? Or do you think it needs more work? Thanks!

@jggautier
Copy link
Contributor

jggautier commented Oct 3, 2019

@kamil386, I thought at first that you created that spreadsheet, but @pdurbin made me take a second look and I see now that someone on the Dataverse team did. Sorry about that. I second @pdurbin's interest in learning what you think about the metadata. It seems to be a subset of Darwin Core. Do you think it's an appropriate subset? Are fields missing?

Also, were you trying to add it to your own installation so that you could see what it looks like in the UI?

@kamil386
Copy link
Author

kamil386 commented Oct 7, 2019

First of all, thank you for your interest in this topic.

I think that the Natural History Museum in London have the reference DarwinCore subset of fields we should follow, as they have probably the biggest collection of specimens in full range of biodiversity (Botany, Entomology, Zoology, Palaeontology, Mineralogy). NHM have choosen 71 fields from DarwinCore, but based on my knowledge even they didn't use all of them.

The scientists around the Bialowieza Forest have some precious and unique collection of specimens they want to digitalize. They have worked on DC schema since some time and haven't yet choosen sufficient subset of fields.

@pdurbin @jggautier This DC subset isn't good enough and it needs more work, can we postpone this issue for a while? I think that we can collaborate together on the appropriate subset. What's most important, the Dataverses logic with templates allows further to choose another subset of subset that will suit better for adding customized metadata i.e to collections of mushrooms or skulls.

@jggautier I were trying to add it to our installation, but due to this string error any of the fields didn't appear in the UI. As you notice, we should probably keep the number of fields to the necessary minimum for UI, what do you think about that?

We have got also plans to build additional tool on top of the Dataverse that will be useful for Biodiversity portals, that will also follow the NHM approach.
@pdurbin @jggautier If the fields from the new DarwinCore metadata will be potentially searchable by API or OAI-PMH?

@pdurbin
Copy link
Member

pdurbin commented Oct 7, 2019

@kamil386 thanks for your continued interest in Darwin Core support! I told @jggautier last week and I meant to tell you that you shouldn't feel like you have to clean up the proposed metadata block alone. I would suggest emailing https://groups.google.com/forum/#!forum/dataverse-community and asking if anyone would like to help you take a look at the speadsheet you found and help you make that Darwin Core custom metadata block production ready.

Yes, Darwin Core fields will be searchable. No, fields will not be harvestable via OAI-PM (some future work on the code side would be needed, please feel free to open an issue with a title like "allow fields from custom metadata blocks to be harvestable via OAI-PMH"). Thanks!

@pdurbin
Copy link
Member

pdurbin commented Dec 4, 2019

@kamil386 hi! I'm just checking in. If you're willing, I still think it would be good for someone to email the dataverse-community list to try to get some discussion going.

I did mentioned Darwin Core in a recent post I made to that list called "custom metadata blocks now easier to spin up and evaluate" at https://groups.google.com/d/msg/dataverse-community/uKretKox_io/4FyPVAMYBgAJ

@kamil386
Copy link
Author

Olga Kurek from Mammal Research Institute in Bialowieza (MRIPAS) created a DarwinCore Schema and after some tweaking, testing and multiple rollback of DB I hope it should be now production ready :)

Version without groups:
https://docs.google.com/spreadsheets/d/1P2y8Kz9pDJlZhPiZiT5EoJAeZGvUOPR1gZh8T3gIMw0/edit?usp=sharing

Version with groups (parent):
https://docs.google.com/spreadsheets/d/1p_myNEdbV-afBaF7D-I__-3CyybY8oiR3zP2VsyrKqU/edit?usp=sharing

Why it isn't possible to deselect single fields in the group of fields (the same parent). If this is by Dataverse design, we would like to contribute the DwC version without groups.
Could you ship the new metadata schema block in some future release of Dataverse?

@jggautier And here is a screenshot how the DwC schema looks like in Dataverse UI:
image

@jggautier
Copy link
Contributor

Great news! Thanks @kamil386 and Olga!

Why it isn't possible to deselect single fields in the group of fields (the same parent).

Could you write more about what you mean?

I wonder if some of the fields are duplications of existing fields, such as License and Rights Holder, and if this duplication will confuse depositors and lessen the amount of metadata that Dataverse exports in other metadata standards (or make mapping to those standards a little more complicated).

@kamil386
Copy link
Author

I mean that you can't deselect single subfield from the "group" (strictly parent field), as shown on the printscreen. I can select/deselect the whole group "Geographic Coverage" with all the subfields belonging to the group, but I can't deselect i.e "Geographic Coverage Other" subfield from this group. There is no possibility to select only a few necessery subfields from the group of i.e 50 subfields. The workaround is to use DwC schema without group/parent if Dataverse can't handle this case.

image

Yes, License and Rights Holder and a few other are duplicates (not technically - names are globally unique in SOLR) or similar of existing fields, but we think it should be compatible and consistent with original DarwinCore schema. Of course, due to the great feature of Dataverse, users will be able to create their own subset of fields for their datasets, regarding to their needs. We are open to discuss and find the best solution.

@jggautier
Copy link
Contributor

jggautier commented Jan 21, 2020

Thanks for the screenshot. I see what you mean now about not being able to deselect a subfield. So in the example you gave, you imagine that a dataverse admin might want to hide/deselect the City field, so that dataset depositors don't see that field in the Geographic Coverage "parent" or compound field.

Screen Shot 2020-01-21 at 10 50 51 AM

You're right - it's not possible right now to deselect or hide a subfield of a parent field, and I couldn't find a GitHub issue that requests this functionality, so perhaps we could open an issue? I could see why it would be important for this Darwin Core metadata block, since one parent field has 21 subfields, and another has 44, and a depositor could be overwhelmed by the number of fields that she may not need to be concerned about.

But the Coverage group (or compound field) is a good example to talk about how Dataverse knows when each of these subfields are part of the same parent field, in this case Coverage. If each subfield is instead its own parent field, currently Dataverse won't know that a given Country/Nation and State/Province is part of the same Coverage. So the structure of that metadata should look something like:

  • Coverage 1
    • Country/Nation
    • State/Province
    • Other

But instead will be flat, like:

  • Country/Nation
  • State/Province
  • Other

(This structure is actually lost in some of Dataverse's metadata exports, which I consider a bug and I think is reported in other GitHub issues.)

The tsv with the groups or compound fields includes a parent field called Occurrence that has 21 subfields and another parent field called Location with 44 subfields, and multiple values are not allowed for any of those parent fields or their subfields (allowMultiples is set to FALSE), so maybe this won't be a problem for that metadata. That is, maybe losing the relationship between parent and subfields won't be an issue if each dataset is only ever describing one "Occurrence" or one "Location." Does that make sense?

Regarding duplicate fields, I think it's optimistic to think that dataverse owners will know that duplicate fields exist, especially in self-curated repositories, so for example it's optimistic to think that dataverse owners would know that the License field in the Darwin Core metadata block can practically store the same information (or conflicting information) as the CC0 Waiver or Terms of Use fields in the Terms metadata tab, and that they will take that into account when customizing their dataverse's metadata fields and giving their depositors instructions. But this is just my hunch after seeing how metadata is entered in a repository with a lot of "self-curated" datasets.

A more solid problem is that anything entered in the new Darwin Core fields isn't mapped to fields in the metadata that Dataverse exports. For example, right now anything entered in DWC License field won't be included in the related fields in exports like Dublin Core, Schema.org and DataCite. How Dataverse metadata are mapped to fields in other standards would need to be adjusted, and then we would need to decide how to handle cases where someone uses CC-BY in the DWC License field and keeps CC0 in the Terms tab, whose fields Dataverse admins aren't able to hide/deselect.

@jggautier
Copy link
Contributor

jggautier commented Jan 21, 2020

Could you write more about the metadata block being being compatible and consistent with the original DarwinCore schema.

@kamil386
Copy link
Author

You're right - it's not possible right now to deselect or hide a subfield of a parent field, and I couldn't find a GitHub issue that requests this functionality, so perhaps we could open an issue?

It would be a great feature so I'll open an issue.

I could see why it would be important for this Darwin Core metadata block, since one parent field has 21 subfields, and another has 44, and a depositor could be overwhelmed by the number of fields that she may not need to be concerned about.

I couldn't describe it more clearly, that's why we need that functionality. We even created the DwC schema without groups and consider using it as a workaround. What's more, even scientists and researchers that will create metadata still didn't finally decide on full range of DwC fields they will use. It seems to me that it's not an easy task with biodiversity data. I think that different data will need a different scope of DwC fields in dataverses. That can be seen on multiple objects in Darwin Core view in NHM Data Portal as example.

(This structure is actually lost in some of Dataverse's metadata exports, which I consider a bug and I think is reported in other GitHub issues.)

Yes it's lost, except the JSON, OAI_ORE and Schema.org JSON-LD.

That is, maybe losing the relationship between parent and subfields won't be an issue if each dataset is only ever describing one "Occurrence" or one "Location." Does that make sense?

The schema with groups is necessary, as then we can easily add another group of fields with one click, and this hierarchy will be reflected in metadata exports (that's why we set that workaround without groups doesn't allow multiple values). We need also some changes in UI, because right now Dataverse print this fields of groups are "flatten" i.e:
Geographic Coverage
Country Province City
Country City

But this is just my hunch after seeing how metadata is entered in a repository with a lot of "self-curated" datasets.

You're absolutely right, thanks for pointing that. DwC schema needs some more work to handle this case and some other if appears, we need to take care of the details. It is true, depositors instructions will not work, there would be a mismatch in this fields.

A more solid problem is that anything entered in the new Darwin Core fields isn't mapped to fields in the metadata that Dataverse exports.

The new DwC schema metadata block is included only in JSON and OAI_ORE. Currently that's enough for us to build some additional tool on top of the Dataverse similar to NHM. For the future it will be nice if DwC (which is extension of DublinCore) would be included in DublinCore metadata export and other, especially for google and machines, which predictably will use that metadata in future..

Could you write more about the metadata block being being compatible and consistent with the original DarwinCore schema.

Olga Kurek copied all the fields from DarwinCore schema (https://dwc.tdwg.org/) to excel spreadsheet, so it's 1:1 match. Categories/Class are mapped as groups/parent in TSV (in "DwC schema with groups"). Only dwc namespace is without colon char, because it's restricted char in SOLR.

@jggautier
Copy link
Contributor

Thanks @kamil386! Is it right to say that you're proposing:

  • Revising the tsv file to avoid including fields that are already in Dataverse's citation metadata block and Terms tab?
  • Addressing any remaining export concerns later on?

I think I would be okay with both of these things if this is planned to be a metadata block that's added to Dataverse's "standard" metadata blocks.

When you mentioned that the metadata was tested, did that mean tested to make sure it worked technically and didn't cause bugs or tested with depositors or both?

@kamil386
Copy link
Author

@jggautier Thanks for a good summary.

Yes, I'm proposing both and I hope second proposal won't break any compatibilty in the future, but it's not a blocker for us and can wait. It will also mandatory require at least #6588 and preferably #6589, without this we'll need to revert to DwC without groups.
Jim Myers recently told me a lot about metadata and we need to change the DwC schema according to his instructions:
https://groups.google.com/forum/#!topic/dataverse-community/uKretKox_io

BTW this probably won't be a problem anymore:

Only dwc namespace is without colon char, because it's restricted char in SOLR.

I tested it to make sure it worked technically and didn't cause bugs, but I can arrange some test by scientists with real data.

@pdurbin
Copy link
Member

pdurbin commented Oct 10, 2022

@kamil386 we now list "experimental metadata" in the appendix of the user guide like this: https://guides.dataverse.org/en/5.12/user/appendix.html#experimental-metadata

Screen Shot 2022-10-10 at 10 39 49 AM

Are you interested in advertising the Darwin Core metadata block here? If so, would you like to make a pull request? Thanks.

@cmbz
Copy link

cmbz commented Sep 30, 2024

2024/09/30: @kamil386 please let us know if the suggestion here will work for you: #6243 (comment). We are closing the issue.

@cmbz cmbz closed this as not planned Won't fix, can't repro, duplicate, stale Sep 30, 2024
@github-project-automation github-project-automation bot moved this from 🔍 Interest to Done in Recherche Data Gouv Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Metadata Type: Feature a feature request User Role: Depositor Creates datasets, uploads data, etc.
Projects
Status: Done
Development

No branches or pull requests

6 participants