-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No standard mapping from SciCat metadata to DataCite metadata #1175
Comments
Does it make sense for the core backend to provide DataCite, or would it be sufficient to expand the oai-provider-service to provide more complete support? |
Hi @sbliven, I don't have a strong opinion, here, but would share some thoughts and observations. The primary goal for this issue is to make it easier to mint a DOI. Ideally, SciCat would provide a standard place (within the SciCat data model) for all DataCite metadata fields, so facilities don't require any customisation. For minting DOIs, the metadata description (of a dataset) supplied to DataCite could come from OAI-PMH. Moreover, I believe there are generic DOI-minting services that acquire the DataCite metadata description by querying an OAI-PMH endpoint. The current oai-provider-service seems to query the MongoDB directly, without using the SciCat data model classes. If so, then there's a risk that changes to the underlying MongoDB storage layout could break OAI-PMH support unless a corresponding change is made to oai-provider-service (making combining these services somewhat fragile). There are other potential scenarios that would benefit from having an easy/supported way of generating a description (of some dataset) that conforms to some external metadata standard. For example, a DOI landing page might include a JSON-LD description of a dataset based on Schema.org, to support harvesting of dataset. So, I'm wondering whether backend would be updated to provide a REST API where the client can request a description of a dataset according to some external metadata standard (Dublin Core, DataCite, Schema.org, etc). The oai-provider-service could be refactored, to take advantage of this API. Under this approach, the same API could also be used when minting DOIs. In some sense, such an interface would involve migrating some of the oai-provider-service functionality into backend, to allow more direct reuse in other scenarios. Just to be clear: updating the oai-provider-service should be fine, but it might not be the best long-term approach. |
@paulmillar if I understand correctly, you are suggesting to extend BE with an endpoint that provides item requested in the format requested where the field mapping between format and SciCat entity is configurable although a default configuration should be provided. Marking this issue as "possible feature" and "needs dicussion" |
Hi @nitrosx, Sorry, the discussion perhaps went a little off on a tangent. Just to be clear, the goal would be to have a standard mapping from the SciCat data model to the DataCite metadata schema. This goal could be achieved by having a document (a text file, markdown, CSV, ODT, ...) that describes, for each DataCite metadata schema element (at least, those of interest) where that information is available from within the SciCat data model. For example:
The idea is that, if this mapping were to be standardised then DOI minting could be supported without requiring any facility-specific configuration or behaviour; i.e., out-of-the-box support for DOI minting --- just configure DataCite credentials. Instead of writing down the mapping as a human-readable document, the mapping could be described in some machine actionable way (e.g., written in TypeScript). The API idea (allowing a client to ask backend for a DataCite description of a dataset) is just one way to use such a machine actionable description. I hope that explains the issue a little better. |
@paulmillar would you like to contribute the first document in a PR? |
A quick update. Previously, I hadn't appreciated that the metadata describing a dataset (used for DOI minting and OAI-PMH) is a separate class: I've gone through the <oai_dc:dc xmlns:oai_dc='http://www.openarchives.org/OAI/2.0/oai_dc/' xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:title>{{record.title}}</dc:title>
<dc:description>{{record.dataDescription}}</dc:description>
<dc:identifier>{{record[this.collection_id]}}</dc:identifier>
<dc:identifier>{{process.env.BASE_URL + "/detail/" + encodeURIComponent(record[this.collection_id])}}</dc:identifier>
<dc:date>{{record.publicationYear}}</dc:date>
<dc:creator>{{record.creator}}</dc:creator>
<dc:type>dataset</dc:type>
<dc:publisher>{{record.publisher}}</dc:publisher>
<dc:rights>Available to the public.</dc:rights>
</oai_dc:dc> and DataCite (again, as encoded as an OAI-PMH record): <datacite:resource xmlns:datacite="http://datacite.org/schema/kernel-4">
<datacite:titles>
<title>{{record.title}}</title>
</datacite:titles>
<datacite:identifier identifierType="URL">https://doi.org/{{record[this.collection_id].toString()}}</datacite:identifier>
<datacite:descriptions>
<description descriptionType="Abstract">
{{record.dataDescription}}
</description>
</datacite:descriptions>
<datacite:dates>
<datacite:date dateType="Issued">2020-01-01</datacite:date>
<datacite:date dateType="Available">2020-01-01</datacite:date>
</datacite:dates>
<datacite:publicationYear>{{record.publicationYear}}</datacite:publicationYear>
<datacite:creators>
<creator>
<creatorName>{{record.creator}}</creatorName>
<affiliation>{{record.affiliation}}</affiliation>
</creator>
</datacite:creators>
<datacite:publisher>{{record.publisher}}</datacite:publisher>
<datacite:rightsList>
<datacite:rights rightsURI="info:eu-repo/semantics/openAccess">OpenAccess</datacite:rights>
</datacite:rightsList>
</datacite:resource> In both cases, The DOI minting seems to use (simplifying slightly): <resource xmlns="http://datacite.org/schema/kernel-4">
<identifier identifierType="doi">${doi}</identifier>
<creators>
<creator>
<creatorName>${lastName}, ${firstName}</creatorName>
<givenName>${firstName}</givenName>
<familyName>${lastName}</familyName>
<affiliation>${affiliation}</affiliation>
</creator>
<!-- Repeated, as needed -->
</creators>
<titles>
<title>${title}</title>
</titles>
<publisher>${publisher}</publisher>
<publicationYear>${publicationYear}</publicationYear>
<descriptions>
<description xml:lang="en-us" descriptionType="Abstract">${abstract}</description>
</descriptions>
<resourceType resourceTypeGeneral="Dataset">${resourceType}</resourceType>
</resource> The part that's missing is which parts of Given this, instead of adding a new document, perhaps it would make more sense to update |
@paulmillar thank you for the great investigative work. |
@nitrosx One of my hopes here is that we can avoid facility-specific customisation. New features would likely be optional (not all facilities would support it), but if SciCat provides a "standard way of doing something" then it becomes easier for facilities to adopt that standard approach. To give a concrete example, with schema v4.5, DataCite introduced more complete support for DOIs as instrument PIDs. If instruments at a facility were to have DOIs (something SciCat might or might not have a role) then we would want a Dataset DOI to have DataCite metadata that links it to the corresponding instrument DOI. Similarly, published data is currently missing the possibility to include ORCID and ROR identifiers in the DataCite metadata description. So, I would imagine, over time, there would be a slow and steady improvement in the metadata that SciCat provides. In the short term, I think simply providing better documentation on the semantics of the different properties (e.g., OpenAPI/Swagger docs) would be a win. |
Motivation: Many of the fields exposed by the PublishedData endpoint currently have rather limited descriptions, with poorly defined semantics. Modification: Semantics of some fields were understood by examining how they are currently used both when minting a DOI and when building OAI-PMH records. The OpenAPI descriptions are updated following this understanding. An example DOI is added to the description of the `doi` field, to remove any ambiguity. Result: The PublishedData endpoint semantics are now better described. Closes: SciCatProject#1175
Motivation: Many of the fields exposed by the PublishedData endpoint currently have rather limited descriptions, with poorly defined semantics. Modification: Semantics of some fields were understood by examining how they are currently used both when minting a DOI and when building OAI-PMH records. The OpenAPI descriptions are updated following this understanding. An example DOI is added to the description of the `doi` field, to remove any ambiguity. Result: The PublishedData endpoint semantics are now better described. Closes: SciCatProject#1175
Motivation: Many of the fields exposed by the PublishedData endpoint currently have rather limited descriptions, with poorly defined semantics. Modification: Semantics of some fields were understood by examining how they are currently used both when minting a DOI and when building OAI-PMH records. The OpenAPI descriptions are updated following this understanding, by including the corresponding terms and including MarkDown links that point to the definition. (Note: the links point to DataCite metadata v4.5 descriptions because it is with this version that the schema documentation was available as HTML.) An example DOI is added to the description of the `doi` field, to remove any ambiguity on how the DOI is represented. The description of the `pidArray` field is updated to remove ambiguity resulting from "pid" (meaning "persistent identifier") being an overloaded term. Result: The PublishedData endpoint semantics are now better described. Closes: SciCatProject#1175
My proposal is that we close this issue is closed with pull request #1189. This pull request documents the mapping between SciCat data model (and PublishedData, specifically) and the corresponding Dublin Core and DataCite schemata ... so, job done. We can then open another issue specifically about extending the PublishData API to support clients requesting a description of the dataset under different metadata models (e.g., DataCite, Dublin Core, Schema.org). That new issue could refer to this issue (citing it as inspiration), so the above comments are not lost. |
SciCat data model has various fields that can describe a dataset. This forms a kind of metadata standard when describing a dataset.
DataCite maintain a metadata standard that may be used when describing a dataset. Perhaps the most prominent use is when creating DOIs, but various OAI-PMH endpoints support clients querying for a DataCite description of their records.
Currently (to the best of my knowledge) SciCat does not provide a canonical mapping of SciCat metadata to DataCite metadata. The closest is a partial support for such a mapping in oai-provider-service module, which provides a (limited) datacite-based description of datasets.
It would be helpful if SciCat provided a standard mapping from the SciCat data model to the DataCite metadata standard.
Simply documenting the mapping would be a good start, but perhaps the REST API could be extended so that it provides as a way for a client to obtain a DataCite description of a given dataset.
Such support would make it easier to mint DOIs, as support for obtaining a DataCite description of the dataset would be provided by SciCat.
The text was updated successfully, but these errors were encountered: