Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

records-api: add Data Package serializer #1905

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

slint
Copy link
Member

@slint slint commented Dec 9, 2024

  • Exposes the Data Package serializer under a JSON-LD serialization with
    the appropriate profile parameter.

* Exposes the Data Package serializer under a JSON-LD serialization with
  the appropriate profile parameter.
@slint slint force-pushed the datapackage-accept branch from 7b6915f to a6c7b9f Compare December 9, 2024 23:28
@slint
Copy link
Member Author

slint commented Dec 10, 2024

@roll I've hooked-in the data package serializer with the MIMEType I mentioned in your original PR. See also inveniosoftware/invenio-app-rdm#2936 for the integration under the "Export" menu on each record.

I assume the of this is to integrate with the Frictionless Data libraries, e.g. for fetching datasets, so let me know how we can help with that.

@roll
Copy link
Contributor

roll commented Dec 10, 2024

Hi @slint, great thanks!

As I mentioned in the original PR the application/ld+json;profile=PROFILE MIME-Type might not work for Data Package as https://datapackage.org/profiles/2.0/datapackage.json is a JSON Schema, not a JSON-LD profile.

I think the most appropriate type would be just application/json as application/schema-json is still in draft - https://stackoverflow.com/a/60148824

@roll
Copy link
Contributor

roll commented Dec 10, 2024

So, yes, once the {dataset}/datapackage.json endpoint is available all the Frictionless software like https://github.com/frictionlessdata/frictionless-py or https://github.com/frictionlessdata/frictionless-r will be able to access and explore any dataset in one line of code like package = Package('{dataset}/datapackage.json') 🚀

@slint
Copy link
Member Author

slint commented Dec 16, 2024

The main issue is that we use Content-Types to differentiate between the different serialization formats in REST API responses and application/json is already "taken" by the default JSON serialization of InvenioRDM.

Indeed as you mentioned there's still a draft proposal similar to the profile=... convention, for something in the lines of application/schema-instance+json;schema="https://datapackage.org/profiles/2.0/datapackage.json".

I think there are two separate issues/solutions to address:

  • Figuring out the MIMEType to content-negotiate Data Package JSON responses
  • Adding a {dataset}/datapackage.json endpoint which from what I understand is the current convention to retrieve Data Package metadata from a dataset. In our case that would be something like/records/{id}/datapackage.json. This endpoint should:
    • Check if there is a datapackage.json file already as part of the dataset itself and serve it
    • Fallback to the "global" serializer

If either of the above makes sense, I can adjust the implementation so we get at least something in that can work with the Frictionless libraries.

@roll
Copy link
Contributor

roll commented Dec 16, 2024

Adding a {dataset}/datapackage.json endpoint which from what I understand is the current convention to retrieve Data Package metadata from a dataset. In our case that would be something like/records/{id}/datapackage.json. This endpoint should:

Yes, so basically the only API Data Package tooling uses is a link to a datapackage.json so having https://data-portal/records/{id}/datapackage.json is everything needed for the full integration.

Check if there is a datapackage.json file already as part of the dataset itself and serve it

If it's possible to do it will be amazing! Data publishers really appreciate when it's integrated into the system (so they don't need to share a custom data package link separately) cc @peterdesmet

Figuring out the MIMEType to content-negotiate Data Package JSON responses

In our tooling we don't use MIME type so technically any will work. I think we can use any correct one e.g. the one you proposed application/schema-instance+json;schema="https://datapackage.org/profiles/2.0/datapackage.json" or even something like application/vnd.datapackage+json

@peterdesmet
Copy link

Neat! Regarding:

Check if there is a datapackage.json file already as part of the dataset itself and serve it

  • What would the endpoint return if a datapackage.json file was not present (i.e. what do you loose by overwriting it with your own datapackage.json)?
  • Does the datapackage.json file need to be present as a separate file named datapackage.json in the root of the deposit? In my opinion yes (so no support for a datapackage.json file that is part of a zip).

If you want an example, see: https://zenodo.org/records/10053903

@slint
Copy link
Member Author

slint commented Dec 16, 2024

Figuring out the MIMEType to content-negotiate Data Package JSON responses

In our tooling we don't use MIME type so technically any will work. I think we can use any correct one e.g. the one you proposed application/schema-instance+json;schema="https://datapackage.org/profiles/2.0/datapackage.json" or even something like application/vnd.datapackage+json

Since I see the JSONSchema draft proposals sort of "frozen" at the moment and can't really understand if the schema=... variation is something that's generally accepted but not yet published, I would go with something like application/vnd.datapackage+json for now if it's ok with you. In the future, and if things become more clear, we can add the new JSONSchema-"blessed" MIMEType (and potentially deprecate the vnd.* variant).


Check if there is a datapackage.json file already as part of the dataset itself and serve it

If it's possible to do it will be amazing! Data publishers really appreciate when it's integrated into the system (so they don't need to share a custom data package link separately)

  • What would the endpoint return if a datapackage.json file was not present (i.e. what do you loose by overwriting it with your own datapackage.json)?
  • Does the datapackage.json file need to be present as a separate file named datapackage.json in the root of the deposit? In my opinion yes (so no support for a datapackage.json file that is part of a zip).

If you want an example, see:

What I had in mind e.g. for the record at https://zenodo.org/records/10053903 when accessing the https://zenodo.org/records/10053903/datapackage.json endpoint*:

  • If it has a datapackage.json file (like in this case), it would return that file as-is. Basically, we're choosing to "trust" the record uploader and the file that they have provided to be the "standard" Data Package JSON for this record
    • This would help users take "manual" control of how rich their Data Package metadata can be.
    • It does indeed have the caveat that a datapackage.json file inside an archive (ZIP, tar, etc.) wouldn't be taken into account
  • If the record did not have a datapackage.json we would serve the autogenerated Data Package JSON from the serializer.
    • It could be less rich than one provided purposefully by the uploader. Still, at least there would be something returned with correct autogenerated content concerning the files that the record has.

*Note that the /records/{id}/datapackage.json endpoint I'm referring to is not the same as the one we have for downloading files of a record (i.e. /records/{id}/files/{file_key}).


One of my concerns with allowing users to provide their own datapackage.json as a regular file that is part of the main record fileset, is that it's "frozen" in time since we don't allow changing files of a published record that has a DOI registered for it. The only way to change files is by creating a new version, which by extension means registering a new DOI (similar to how the Git VCS works).

Since the datapackage.json is basically "metadata" for describing the dataset, it really suffers from this limitation, since if the uploader wishes to enhance their dataset's Data Package JSON metadata, they'll have to create a new version with a new DOI, just to update metadata...

We have two solutions in the works for circumventing this limitation:

  • allow users to edit records and supply additional metadata for their files, in the same way as editing their record metadata (e.g. title, creators, description, etc.)
  • allow users to upload a set of separate special "metadata files", that can be updated at any time without the need to create new versions and DOIs

There are no short or mid-term plans for developing these features at the moment, but just something to have in mind :)

@roll
Copy link
Contributor

roll commented Dec 17, 2024

Since I see the JSONSchema draft proposals sort of "frozen" at the moment and can't really understand if the schema=... variation is something that's generally accepted but not yet published, I would go with something like application/vnd.datapackage+json for now if it's ok with you. In the future, and if things become more clear, we can add the new JSONSchema-"blessed" MIMEType (and potentially deprecate the vnd.* variant).

Yea totally make sense. Also, we will going to try to register the MIME-type this year so it might end-up as something like application/datapackage+json that can be updated in RDM later.

@roll
Copy link
Contributor

roll commented Dec 17, 2024

What I had in mind e.g. for the record at https://zenodo.org/records/10053903 when accessing the https://zenodo.org/records/10053903/datapackage.json endpoint*:

  • If it has a datapackage.json file (like in this case), it would return that file as-is. Basically, we're choosing to "trust" the record uploader and the file that they have provided to be the "standard" Data Package JSON for this record
    • This would help users take "manual" control of how rich their Data Package metadata can be.
    • It does indeed have the caveat that a datapackage.json file inside an archive (ZIP, tar, etc.) wouldn't be taken into account
  • If the record did not have a datapackage.json we would serve the autogenerated Data Package JSON from the serializer.
    • It could be less rich than one provided purposefully by the uploader. Still, at least there would be something returned with correct autogenerated content concerning the files that the record has.
      *Note that the /records/{id}/datapackage.json endpoint I'm referring to is not the same as the one we have for downloading files of a record (i.e. /records/{id}/files/{file_key}).

I agree as it's exact workflow I had in mind (TBH I didn't expect we can get there in the first iteration) and the same I'm going to propose for CKAN community.

Regarding immutability, I would pass this question to @peterdesmet but personally I don't see a problem here - data + metadata is an artifact so if the platform (RDM in this case) operates with immutable versioned artifacts than data publishers have just use this model and publish accordingly (e.g. develop a dataset on github and then publish to Zenodo when it's released and immutable). As @peterdesmet and others already publish datapackage.json with this model I guess it works fine for them.

@peterdesmet
Copy link

@slint everything in your #1905 (comment) makes sense to me 👍

Regarding immutability:

As @peterdesmet and others already publish datapackage.json with this model I guess it works fine for them.

Well, uploading an immutable datapackage.json is the only workflow that is currently available, so that is the one I use. 😄 It is the reason though why I keep datapackage.json limited to a technical description of the resources. I don't add the license, contributors, etc. because a) I have to add that to the deposit metadata anyway and b) I can't correct it like I can correct the deposit metadata. The only real metadata element I add is the doi (so software reading that file has a link to more metadata), like:

"id": "https://doi.org/10.5281/zenodo.10053903"

Sometimes I make mistakes (like here) or forget to upload the datapackage.json altogether (like here), which I can only fix by creating a new version (and writing something in Notes for the incorrect version). This proposal:

allow users to upload a set of separate special "metadata files", that can be updated at any time without the need to create new versions and DOIs

Sounds like a very good solution to me.

@roll
Copy link
Contributor

roll commented Dec 17, 2024

Would it make sense and be possible to merge user's datapackage.json and system's datapackage.json then?

@peterdesmet
Copy link

Would it make sense and be possible to merge user's datapackage.json and system's datapackage.json then?

It would have the benefit that the user's datapackage.json is enriched with metadata recorded in the deposit. It would require rules though (e.g. if a property is defined in the user's datapackage.json then don't overwrite it with the systems) which could get fairly complicated and can potentially create invalid packages (e.g. one with generated v2 $schema, while the user defined a v1 profile). So I'm fine with the pick one or the other if that is more feasible.

@roll
Copy link
Contributor

roll commented Dec 19, 2024

Shall we just start from the simplest solution as having a system's Data Package at /records/{id}/datapackage.json, and then expand this solution (e.g. by an ability of overriding or merging) based on feedback from the community and Data Package Working Group? It feels like the safest bet for me for now. WDYT?

There are also other options like having merging mechanism in the standard itself like extend in tsconfig.json for cases when metadata is split in two places like in Peter's use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants