-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please add Darwin Core support #6243
Comments
Hi @kamil386. I don't know what this error message means. There are a few issues I see with the spreadsheet you linked to, but I don't know if any of them are what's causing that error response:
This Github issue title first made me think you were asking if Darwin Core could be added as a metadata block that comes with Dataverse (instead of a custom metadatablock that an installation would add). Is that the case? Or are you only asking for help to add the metadatablock to a Dataverse installation you're setting up? |
Thanks @jggautier ! |
I were asking if Darwin Core could be added as a metadata block that comes with Dataverse, more precisely if Dataverse could support Darwin Core natively. Dataverse with DarwinCore could be much more easily adopted by many public Institutions dealing with Biodiversity. But it would be a good starting point if adding this custom medatablock to a Dataverse installation would work, even for testing purpose in the meantime. |
The list of list of five metadata blocks that ship with Dataverse at http://guides.dataverse.org/en/4.16/user/appendix.html has been frozen in time for a while. Recently for #3976 we added a six to that list, a "Journals" metadata block which has been available for a while but never included in the list. The idea was never that the list of metadata blocks that ship with Dataverse would be frozen. The idea was that we'd put a few out there that are important to our original user base (social science) and others that we have some experience with (astronomy, etc.) to show what's possible. Then we would work with the community to add addition metadata blocks as "official" blocks that ship with Dataverse. @kamil386 my question to you is, is the Darwin Core support that you found in that spreadsheet good enough? Should we ship it? Or do you think it needs more work? Thanks! |
@kamil386, I thought at first that you created that spreadsheet, but @pdurbin made me take a second look and I see now that someone on the Dataverse team did. Sorry about that. I second @pdurbin's interest in learning what you think about the metadata. It seems to be a subset of Darwin Core. Do you think it's an appropriate subset? Are fields missing? Also, were you trying to add it to your own installation so that you could see what it looks like in the UI? |
First of all, thank you for your interest in this topic. I think that the Natural History Museum in London have the reference DarwinCore subset of fields we should follow, as they have probably the biggest collection of specimens in full range of biodiversity (Botany, Entomology, Zoology, Palaeontology, Mineralogy). NHM have choosen 71 fields from DarwinCore, but based on my knowledge even they didn't use all of them. The scientists around the Bialowieza Forest have some precious and unique collection of specimens they want to digitalize. They have worked on DC schema since some time and haven't yet choosen sufficient subset of fields. @pdurbin @jggautier This DC subset isn't good enough and it needs more work, can we postpone this issue for a while? I think that we can collaborate together on the appropriate subset. What's most important, the Dataverses logic with templates allows further to choose another subset of subset that will suit better for adding customized metadata i.e to collections of mushrooms or skulls. @jggautier I were trying to add it to our installation, but due to this string error any of the fields didn't appear in the UI. As you notice, we should probably keep the number of fields to the necessary minimum for UI, what do you think about that? We have got also plans to build additional tool on top of the Dataverse that will be useful for Biodiversity portals, that will also follow the NHM approach. |
@kamil386 thanks for your continued interest in Darwin Core support! I told @jggautier last week and I meant to tell you that you shouldn't feel like you have to clean up the proposed metadata block alone. I would suggest emailing https://groups.google.com/forum/#!forum/dataverse-community and asking if anyone would like to help you take a look at the speadsheet you found and help you make that Darwin Core custom metadata block production ready. Yes, Darwin Core fields will be searchable. No, fields will not be harvestable via OAI-PM (some future work on the code side would be needed, please feel free to open an issue with a title like "allow fields from custom metadata blocks to be harvestable via OAI-PMH"). Thanks! |
@kamil386 hi! I'm just checking in. If you're willing, I still think it would be good for someone to email the dataverse-community list to try to get some discussion going. I did mentioned Darwin Core in a recent post I made to that list called "custom metadata blocks now easier to spin up and evaluate" at https://groups.google.com/d/msg/dataverse-community/uKretKox_io/4FyPVAMYBgAJ |
Olga Kurek from Mammal Research Institute in Bialowieza (MRIPAS) created a DarwinCore Schema and after some tweaking, testing and multiple rollback of DB I hope it should be now production ready :) Version without groups: Version with groups (parent): Why it isn't possible to deselect single fields in the group of fields (the same parent). If this is by Dataverse design, we would like to contribute the DwC version without groups. @jggautier And here is a screenshot how the DwC schema looks like in Dataverse UI: |
Great news! Thanks @kamil386 and Olga!
Could you write more about what you mean? I wonder if some of the fields are duplications of existing fields, such as License and Rights Holder, and if this duplication will confuse depositors and lessen the amount of metadata that Dataverse exports in other metadata standards (or make mapping to those standards a little more complicated). |
I mean that you can't deselect single subfield from the "group" (strictly parent field), as shown on the printscreen. I can select/deselect the whole group "Geographic Coverage" with all the subfields belonging to the group, but I can't deselect i.e "Geographic Coverage Other" subfield from this group. There is no possibility to select only a few necessery subfields from the group of i.e 50 subfields. The workaround is to use DwC schema without group/parent if Dataverse can't handle this case. Yes, License and Rights Holder and a few other are duplicates (not technically - names are globally unique in SOLR) or similar of existing fields, but we think it should be compatible and consistent with original DarwinCore schema. Of course, due to the great feature of Dataverse, users will be able to create their own subset of fields for their datasets, regarding to their needs. We are open to discuss and find the best solution. |
Thanks for the screenshot. I see what you mean now about not being able to deselect a subfield. So in the example you gave, you imagine that a dataverse admin might want to hide/deselect the City field, so that dataset depositors don't see that field in the Geographic Coverage "parent" or compound field. You're right - it's not possible right now to deselect or hide a subfield of a parent field, and I couldn't find a GitHub issue that requests this functionality, so perhaps we could open an issue? I could see why it would be important for this Darwin Core metadata block, since one parent field has 21 subfields, and another has 44, and a depositor could be overwhelmed by the number of fields that she may not need to be concerned about. But the Coverage group (or compound field) is a good example to talk about how Dataverse knows when each of these subfields are part of the same parent field, in this case Coverage. If each subfield is instead its own parent field, currently Dataverse won't know that a given Country/Nation and State/Province is part of the same Coverage. So the structure of that metadata should look something like:
But instead will be flat, like:
(This structure is actually lost in some of Dataverse's metadata exports, which I consider a bug and I think is reported in other GitHub issues.) The tsv with the groups or compound fields includes a parent field called Occurrence that has 21 subfields and another parent field called Location with 44 subfields, and multiple values are not allowed for any of those parent fields or their subfields (allowMultiples is set to FALSE), so maybe this won't be a problem for that metadata. That is, maybe losing the relationship between parent and subfields won't be an issue if each dataset is only ever describing one "Occurrence" or one "Location." Does that make sense? Regarding duplicate fields, I think it's optimistic to think that dataverse owners will know that duplicate fields exist, especially in self-curated repositories, so for example it's optimistic to think that dataverse owners would know that the License field in the Darwin Core metadata block can practically store the same information (or conflicting information) as the CC0 Waiver or Terms of Use fields in the Terms metadata tab, and that they will take that into account when customizing their dataverse's metadata fields and giving their depositors instructions. But this is just my hunch after seeing how metadata is entered in a repository with a lot of "self-curated" datasets. A more solid problem is that anything entered in the new Darwin Core fields isn't mapped to fields in the metadata that Dataverse exports. For example, right now anything entered in DWC License field won't be included in the related fields in exports like Dublin Core, Schema.org and DataCite. How Dataverse metadata are mapped to fields in other standards would need to be adjusted, and then we would need to decide how to handle cases where someone uses CC-BY in the DWC License field and keeps CC0 in the Terms tab, whose fields Dataverse admins aren't able to hide/deselect. |
Could you write more about the metadata block being being compatible and consistent with the original DarwinCore schema. |
It would be a great feature so I'll open an issue.
I couldn't describe it more clearly, that's why we need that functionality. We even created the DwC schema without groups and consider using it as a workaround. What's more, even scientists and researchers that will create metadata still didn't finally decide on full range of DwC fields they will use. It seems to me that it's not an easy task with biodiversity data. I think that different data will need a different scope of DwC fields in dataverses. That can be seen on multiple objects in Darwin Core view in NHM Data Portal as example.
Yes it's lost, except the JSON, OAI_ORE and Schema.org JSON-LD.
The schema with groups is necessary, as then we can easily add another group of fields with one click, and this hierarchy will be reflected in metadata exports (that's why we set that workaround without groups doesn't allow multiple values). We need also some changes in UI, because right now Dataverse print this fields of groups are "flatten" i.e:
You're absolutely right, thanks for pointing that. DwC schema needs some more work to handle this case and some other if appears, we need to take care of the details. It is true, depositors instructions will not work, there would be a mismatch in this fields.
The new DwC schema metadata block is included only in JSON and OAI_ORE. Currently that's enough for us to build some additional tool on top of the Dataverse similar to NHM. For the future it will be nice if DwC (which is extension of DublinCore) would be included in DublinCore metadata export and other, especially for google and machines, which predictably will use that metadata in future..
Olga Kurek copied all the fields from DarwinCore schema (https://dwc.tdwg.org/) to excel spreadsheet, so it's 1:1 match. Categories/Class are mapped as groups/parent in TSV (in "DwC schema with groups"). Only dwc namespace is without colon char, because it's restricted char in SOLR. |
Thanks @kamil386! Is it right to say that you're proposing:
I think I would be okay with both of these things if this is planned to be a metadata block that's added to Dataverse's "standard" metadata blocks. When you mentioned that the metadata was tested, did that mean tested to make sure it worked technically and didn't cause bugs or tested with depositors or both? |
@jggautier Thanks for a good summary. Yes, I'm proposing both and I hope second proposal won't break any compatibilty in the future, but it's not a blocker for us and can wait. It will also mandatory require at least #6588 and preferably #6589, without this we'll need to revert to DwC without groups. BTW this probably won't be a problem anymore:
I tested it to make sure it worked technically and didn't cause bugs, but I can arrange some test by scientists with real data. |
@kamil386 we now list "experimental metadata" in the appendix of the user guide like this: https://guides.dataverse.org/en/5.12/user/appendix.html#experimental-metadata Are you interested in advertising the Darwin Core metadata block here? If so, would you like to make a pull request? Thanks. |
2024/09/30: @kamil386 please let us know if the suggestion here will work for you: #6243 (comment). We are closing the issue. |
I downloaded the spreadsheet https://docs.google.com/spreadsheets/d/1P9xvaRLhCKsYmjz9eXXVl0T9d2U34UgynbvxDp-2Bjc/edit#gid=1331272861 as TSV
and run:
curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file /tmp/Comparative\ Zoology\ _\ Darwin\ Core\ Metadata\ -\ Sheet2.tsv
but received error response:
{"status":"ERROR","message":"For input string: \"\""}
That custom metadata block for Darwin Core does not work.
Comparative Zoology _ Darwin Core Metadata - Sheet2.zip
The text was updated successfully, but these errors were encountered: