Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Metadata API calls #6497

Closed
qqmyers opened this issue Jan 9, 2020 · 9 comments · Fixed by #7504
Closed

Semantic Metadata API calls #6497

qqmyers opened this issue Jan 9, 2020 · 9 comments · Fixed by #7504
Labels
GDCC: SciencesPO related to GDCC work for Sciences PO

Comments

@qqmyers
Copy link
Member

qqmyers commented Jan 9, 2020

The current API for getting, editing, deleting dataset metadata requires understanding Dataverse's internal metadata representation and does not map local terms to global identifiers as is now supported in metadatablocks. In contrast, the OAI_ORE export exposes metadata in a way that is more repository-agnostic / globally-interpretable (see excerpt below).

The RDA grant to provide a standards-based archival format that can be exported/imported will need to import based on the OAI_ORE information. To do that, having a way to add/edit metadata, just given the information in the OAI_ORE file would be useful. (Nominally adding information about Dataverse's internal representation to the OAI_ORE would make it possible to format things as required by the current API, but part of the goal is to be interoperable with Bags/OAI_ORE files produced by other repositories which would not have this info.)

I propose to add new add/edit/delete API calls for dataset metadata that would have relatively flat JSON-LD payloads to support this work that I hope will be generally useful/easier to use than the current calls and that I hope can be retained as further work to update Dataverse's metadata capabilities (which I understand may replace/supplement metadatablocks) moves forward.

The general concept would be to allow upload in a format similar to the following, which is just a JSON-LD excerpt from an example dataset's ORE map. FWIW - I haven't fully scrubbed this example to remove things like the dataset version number which is ~read-only or the includedInDataCatalog entry that represents the dataset's location in a dataverse rather than metadata, but I wanted to show the general concept that there is just a set of key/value pairs where there keys are then defined in the @context in terms of a global identifier that can be used by Dataverse or other repository to map the item to the correct internal field. (I.e. the initial implementation of these api calls would look at the global term and then look it up in the metadata block definition to identify how to store it, essentially reversing the logic used to scan through metadata fields and look at the block definition to find the global term that is used to produce the ORE map to start with.) Unless/until further work is done, e.g. to allow Dataverse to accept new metadata terms dynamically, these api calls could return an error if the metadata terms/values do not match with configured metadata blocks. Similarly, they could co-exist with the existing APIs since they don't change anything w.r.t. the internal representation.

Given the relatively small size and short timescale for the RDA grant work, I'd appreciate any comments/concerns/suggested alternative approaches, etc. asap. (FWIW: I plan to start working on some of the logic to match a globally-defined term to an internal dataverse field to start - seems like this is going to be needed in some form to handle externally produced Bags, regardless of whether the metadata is eventually submitted through the existing apis or the ones proposed here.)

{
  "Title": "Halloween previewer testing",
  "Creator": {
    "author:Name": "Myers, Jim"
  },
  "citation:Contact": {
    "datasetContact:Name": "Myers, Jim",
    "datasetContact:E-mail": "qqmyers@hotmail.com"
  },
  "citation:Description": {
    "dsDescription:Text": "image and video - publishing to check in-page preview displays."
  },
  "Subject": "Computer and Information Science",
  "citation:Distributor": {
    "distributor:Name": "Qualitative Data Repository",
    "distributor:Affiliation": "Syracuse University",
    "distributor:Abbreviation": "QDR",
    "distributor:URL": "https://qdr.syr.edu"
  },
  "citation:Depositor": "Myers, Jim",
  "Deposit Date": "2019-11-01",
  "schema:version": "1.0",
  "schema:datePublished": "2019-11-26",
  "schema:name": "Halloween previewer testing",
  "schema:dateModified": "2019-11-26 21:58:45.161",
  "schema:license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "dvcore:fileTermsOfAccess": {
    "dvcore:fileRequestAccess": false
  },
  "schema:includedInDataCatalog": "QDR Main Collection",
  "@context": {
    "Creator": "http://purl.org/dc/terms/creator",
    "Deposit Date": "http://purl.org/dc/terms/dateSubmitted",
    "Identifier Scheme": "http://purl.org/spar/datacite/AgentIdentifierScheme",
    "Subject": "http://purl.org/dc/terms/subject",
    "Title": "http://purl.org/dc/terms/title",
    "author": "https://dataverse.org/schema/citation/author#",
    "citation": "https://dataverse.org/schema/citation/",
    "datasetContact": "https://dataverse.org/schema/citation/datasetContact#",
    "dcterms": "http://purl.org/dc/terms/",
    "distributor": "https://dataverse.org/schema/citation/distributor#",
    "dsDescription": "https://dataverse.org/schema/citation/dsDescription#",
    "dvcore": "https://dataverse.org/schema/core#",
    "keyword": "https://dataverse.org/schema/citation/keyword#",
    "ore": "http://www.openarchives.org/ore/terms/",
    "schema": "http://schema.org/"
  }
}
@qqmyers
Copy link
Member Author

qqmyers commented Jan 9, 2020

@djbrooke
Copy link
Contributor

djbrooke commented Jan 9, 2020

Thanks @qqmyers for the detailed description here. Sounds cool. If the grant proposal is shareable can you include a link? Any UI impacts or just new APIs and backend changes?

@qqmyers
Copy link
Member Author

qqmyers commented Jan 9, 2020

The proposal body is at https://docs.google.com/document/d/1L24GFIpp75Mc6PNodAk0619eLFAzhmFxv4zuX8Ihzd0/edit?usp=sharing. It promises a round-trip export/import of a dataset leveraging the OAI_ORE/BagIt mechanism already ~implemented for export but doesn't get too specific about how. Initially I expect to make the DVUploader handle import, but that could turn into a button in Dataverse, etc. The intent/hope was to make this something that the Dataverse community could adopt, and getting community input to refine the design and final deliverables is part of the plan. We're also expecting this to be an exemplar for how GDCC can coordinate on community requested developments going forward, so feedback on how to use issues, google docs, community maillist, meetings, etc. are all welcome.

@pdurbin
Copy link
Member

pdurbin commented Jan 10, 2020

@qqmyers overall, I love the idea of being able to create a dataset in Dataverse using standards such as an ORE map (not that I'm very familiar with this specific standard).

Thanks also for linking to the grant proposal. I see BagPack is mentioned. Great. Do you plan to follow up on the conversation at https://groups.google.com/d/msg/dataverse-community/YlUErmoIl30/gkMXSmQPBQAJ with a link to this issue? Or I can if you want.

With regard to the JSON above, a few questions come to mind:

@qqmyers
Copy link
Member Author

qqmyers commented Jan 10, 2020

@pdurbin - thanks - feel free to crosslink where appropriate. I've tried to look through issues to see how/where things might connect and I/we'll send community emails going forward, but any help in connecting the dots is appreciated.
W.r.t. questions:
first - the connect with standards is really to JSON-LD rather than ORE per se. ORE just covers how to describe that one thing contains others and one can then use other vocabularies to describe those things.

  • How do you create a dataset with multiple authors?
    To serialize this in JSON-LD, you'd just make 'Creator' an array of objects rather than one object. The interoperability challenge here is what to do if a repository only allows one. Dataverse allows more than one, but there are metadata items where it expects a single value. Similar issues for when a field is compound - it's pretty straight-forward to serialize things in JSON-LD, and the interoperability challenge is in the next step beyond that - trying to interpret what's sent in terms of the repository's internal model.

  • Is the "dv" in "dvcore" for Dataverse? Where does this come from?
    Yes, the dvcore namespace is used for terms in Dataverse (internally or in the citation block) that don't match well with any in community vocabularies, due to there not being such a term, or such a term having restrictions on it's value that are inconsistent with Dataverse use. It is a namespace I invented when creating the original ORE export and updating the metadata block mechanism to allow terms to be associated with community vocabularies.

  • https://dataverse.org/schema/core# is a 404.
    Yes. My understanding is that it is valid to use a namespace/term URN that does not reference an online representation (e.g. people are represented by URNs in RDF/JSON-LD). That said, it might make sense to post a doc that provides a list/definitions for these terms. (I think this was discussed when I was originally proposing mappings for Dataverse terms, but no one was motivated at the time (probably reasonable since these only appear now in the ORE/Bag stuff, but more relevant to either document these and/or revisit whether there are good community terms to use if this becomes a public api as well).

  • Since this is JSON-LD and so is SWORDv3 would this work make it easier to support SWORDv3 in the future? I'm not even sure how much interest or demand there is for SWORDv3 since when I asked for opinions here, no one replied (which is fine! 😄): https://groups.google.com/d/msg/dataverse-community/D7lRTA3f8Hc/G3ysd7NvAwAJ
    Possibly - I think this came up when I was working on the original ORE/Bag stuff. I looked at the draft SWORDv3 stuff then and saw the similarities. That said, JSON-LD is 'just' a serialization of a graph and the bigger challenges are in matching the data models for different applications. In general, I think a JSON-LD style API will useful going forward in that it standardizes what no one wants to argue about (how to write object-verb statements about subjects using globally defined terms) and doesn't get in the way where they do (what terms can/may be used to describe something and what set of values are allowed). 😃

@qqmyers
Copy link
Member Author

qqmyers commented Jan 31, 2020

per discussion, I'm going to move forward to implement something here and will use API versioning in doing so.

@pdurbin
Copy link
Member

pdurbin commented Feb 4, 2020

@qqmyers sounds good. Since there are purl URLs in this issue I'd like to point people to http://bitly.com/purl-crisis which is a Google doc ( https://docs.google.com/document/d/17TBUja8z8EJGx5ZEyknP3gWuPITf6lt-XkXDHU_CKaY/edit?usp=sharing ) I heard about last week in a lightning talk by @paulwalk at PIDapalooza 2020. From the doc, here are the issues he identified:

"Issues with PURL:

  • DCMI PURL namespace was corrupted by a malicious actor who started to create bogus sub-domains
  • ongoing inability to edit PURLs that have been tied to legacy user accounts (reported by several other organisations)
  • no response from support email address (reported by many)
  • PURL cannot compete with other PID systems (lack or investment on maintenance and further development, technical shortcomings)"

In short, he's calling anyone who wants to help with this to add themselves to the Google doc above and get involved. Here's the tweet from https://twitter.com/pidapalooza/status/1222814910209495041

Screen Shot 2020-02-04 at 10 34 08 AM

@qqmyers to be clear, I'm not asking you to do anything. Again, I just wanted to mention it.

@qqmyers
Copy link
Member Author

qqmyers commented Feb 4, 2020

Interesting. FWIW: The only purls here are ones that are assigned by orgs like Dublin Core, so presumably any action would be from those guys. Other than updating Dataverse if/when they switch, I don't see anything we can do in design to help for this API. In the larger work to be able to reimport bags, we should expect that we might have to translate from one term to another for reasons like this and add support for that to the design.

@skasberger
Copy link
Contributor

A more flat API JSON structure would definitely make sense. Also to create mappings from/to different data sources. I will do that at least for pyDataverse in the next 2 months, to shift this from the sourcecode level to a XML or JSON file level. easier to maintain and change. So I guess we have some similar issues here.

@qqmyers qqmyers added the GDCC: SciencesPO related to GDCC work for Sciences PO label Mar 8, 2021
qqmyers added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Jul 22, 2021
add dev guide links to list of APIs IQSS#6497
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC: SciencesPO related to GDCC work for Sciences PO
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants