-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RO-Crate exporter PoC #10086
RO-Crate exporter PoC #10086
Conversation
Implements an exporter to generate RO-Crate JSON representation of a dataset using the metadatablocks of the dataset as the schema of the RO-Crate. This class has been extracted from the ARP project (https://science-research-data.hu/en) in the frame of FAIR-IMPACT's 1st Open Call "Enabling FAIR Signposting and RO-Crate for content/metadata discovery and consumption".
Interesting work! Two quick comments: Fields should always be getting a valid context entry. If the field and the metadatablock don't have an assigned URI, one should be created via
Second, as of 5.14/6.0, Exporters can be created as stand-alone jars - see https://github.com/gdcc/dataverse-exporters (which is linked from https://guides.dataverse.org/en/latest/developers/metadataexport.html#building-an-exporter). Although this is a new feature, I think that it is probably now the preferred route for adding new exporters these days (versus a PR to add it to the Dataverse code base). I think this would be the first real exporter to use that mechanism and I'd/we'd be happy to help if you run into any problems. (hopefully not much of a code change for you as you're already using the new ExportDataProvider interface). |
getJsonLDNamespace() uses getAssignedNamespaceUri(), which makes sure a valid URI is always available.
@qqmyers, thanks for the suggestions!
Unfortunately
Sure, I could try that. One thing is that ExportDataProvider provides JSON/XML/etc format of the metadata, while our RoCrateManager works directly with the Dataset/DatasetField/DatasetFieldType/etc objects. Moreover, the ultimate goal is to generate a downloadable RO-Crate .zip at the end, which packages all the files in the dataset together with the ro-crate-metadata.json. This would require deeper integration and UI changes at the end. |
Ah - I didn't look closely to see that you are going back to using the internal classes. Nominally the ExportDataProvider.getDatasetORE() gives you the json-ld context and 'everything' you'd need (all metadata, links to retrieve all the files) so you shouldn't have to go back to the database. (There are a couple known missing things in the ORE output, e.g. auxiliary files that were added after the ORE/archival bags functionality. If there's anything you need for an ROCrate that isn't available, we should add it.) |
Great work and as I said in the container meeting this morning, it's fun that you used containers! I let the RO-Crate folks know about it: https://seek4science.slack.com/archives/C01LQQAAAS1/p1698956581547419 Also (back to containers), I'm glad it motivated you to work on faster redeploys! I can't wait to try this: Thanks!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was curious about what Java library was used (I am using RO-Crate in Python, but former Java developer). The code looks interesting! A bit hard to follow all the for-loops and what's happening without knowing dataverse. But if you share some generated JSON files maybe I can have a look, run it through runcrate
too (if you haven't done it). @simleo knows RO-Crate a lot more, and is a master in spotting issues in these JSON files 👍
src/main/java/edu/harvard/iq/dataverse/rocrate/RoCrateManager.java
Outdated
Show resolved
Hide resolved
src/main/java/edu/harvard/iq/dataverse/rocrate/RoCrateManager.java
Outdated
Show resolved
Hide resolved
@kinow, thanks for the suggestions! We use https://github.com/kit-data-manager/ro-crate-java. Quite nice library, but sometimes a bit rigid. Here's an example ro-crate-metadata.json with multiple metadatablocks used: And regarding the implementation: it could be improved in many ways, but we just wanted to start a discussion whether this MDB based approach or a Schema.org mapped one is a better solution for Dataverse. Or maybe both, depending on the user's use case. The Schema.org mapped version could provide easier transfer to other repositories, while the MDB based approach could allow easier transfer to other DV installations. Moreover, in our ARP project we use RO-Crate to extend the file metadata capabilities of Dataverse. Namely, in the current form DV only provides minimal metadata for files. In our system we allow editing the ro-crate-metadata.json to add (practically any) metadata to files, so that when the dataset is exported as RO-Crate you get all the metadata you entered in Dataverse + all additional metadata you added at the "RO-Crate layer". |
Saw that in the
Got it!
I am no expert, so I rely on I think that's because you have no action in your file. You don't have the provenance of what was used to generate the data? A person, a workflow, some software? If so, I think a Cheers |
FWIW: In general, Dataverse maps metadata items to schema.org if there's a term with the same meaning and range. If there are more that we could map, I think there would be interest in updating the metadata blocks, etc. Due to that, I'm not sure there's much difference between a schema.org vs. mdb approach, other than that schema.org only would be lossy. |
This looks so interesting! We are also working on an external exporter but took the first strategy instead, with customizability to counter the disadvantage of the approach. |
@kinow Re your comments about runcrate above -- note that runcrate is a family of RO-Crate profiles for a particular purpose (documenting the outcomes of workflows), a general purpose repository like Dataverse is more geared to general descriptions of many kinds of data -- not that workflows might not be some of them but not all RO-Crates are expected to be runcrate compatible. |
Just to clarify: the collection of RO-Crate profiles for capturing workflow provenance is called Workflow Run RO-Crate, while runcrate is a toolkit to interact with RO-Crates that follow the Workflow Run RO-Crate profiles. |
Merge v6.1 into master
Related: Also, @beepsoft I just noted there are merge conflicts. Would you be able to resolve them? Thanks! |
Merge 6.2 into master
If you are still interested in this PR, can you please merge and resolve any merge conflicts with the latest from develop? If so, we can prioritize reviewing and QAing the changes. If we don’t hear from you by May 22, 2024, we’ll go ahead and close this PR (it can always be reopened after that date, if there is still interest). |
Thanks @scolapasta for the heads up! I just fixed the issues, sorry for the long delay. Do you have any other suggestions to update in this PR? For example, it is now added as a default exporter, so once it is merged, it will be available in all Dataverse installations. Can this be made an optional exporter somehow, which users can install if they need? |
FWIW: There is now an external exporter mechanism and separate repo for creating exporters that can be dropped into an installation as a separate jar file. See https://github.com/gdcc/dataverse-exporters and the examples and PRs there for details. That would make the exporter optional and something you could update outside the Dataverse release cycle. |
@qqmyers the problem with the ExportDataProvider I faced was that it only provides the dataset JSON and for the RO-Crate I needed deeper understanding of the dataset, for example, I needed to access the |
I haven't tried it but I think you may be able to access those classes in an external exporter - you would just have to build against Dataverse rather than the smaller spi jar. That said, the intent with the interface was to give you all the info needed to create an export so if there's additional information that needs to get sent, let us know and perhaps we can add it. |
Probably based on a combination of the JSON and OAI_ORE exports the necessary data could be gathered, but we started implementing our solution before the new exporter API has been released. |
@beepsoft |
This PR was in "ready for review" so I picked it up but my understanding is that we have three different RO-Crate implementation to evaluate:
Generally speaking, now that we have an external exporter framework, we are favoring implementations that use it for a number of reasons:
So at minimum, I'd like the guides to be updated to reference one or more external RO-Crate exporters. As for this pull request, there are merge conflicts at present. @beepsoft please feel free to resolve them but I think I'm going to play around first with @okaradeniz's exporter. Then I'm hoping to look at the other two. |
@beepsoft sound good. Thanks. I created a new issue to track testing those other two implementations: |
What this PR does / why we need it:
This PR is a proof of concept implementation of an RO-Crate metadata JSON exporter, as a followup to this issue: #8688
There are two strategies to generate ro-crate-metadata.json:
The RO-Crate spec suggests the use of Schema.org for describing datasets (https://www.researchobject.org/ro-crate/1.1/metadata.html#base-metadata-standard-schemaorg). For this we need to map the Dataverse dataset metadata fields to Schema.org properties. The advantage of this approach is that the created ro-crate-metadata.json could be interpreted by all RO-Crate tools using Schema.org. The disadvantage of this is, however, that there could be data fields in a DV dataset that cannot be mapped to a Schema.org property, and so this would lead to a lossy export.
The RO-Crate spec allows the use of alternative schema/vocabularies as well:
This opens up the possibility to use Dataverse's metadatablocks as our schemas/vocabularies and generate the ro-crate-metadata.json accordingly.
This PR implementats RO-Crate metadata generation this way, based on metadatablocks, thus exporting all available data of a dataset.
Known issues:
RO-Crate and JSON-LD requires the specification of URI-s for properties in
@context
. For this to work every metadatablock field must have an URI associated with it. In case of the Citation MDB some fields have an URI explicitly associated, eg. http://purl.org/dc/terms/title to the "title" field, some fields have an implicit URI based on the MDB'snamespaceUri
+ the field's name. However other MDB-s like geospatial, socialscience etc. don't have neither explicit URI-s, nor anamespaceUri
associated. This results innull
values in the@context
:This can be easily overcome by associating a
namespaceUri
with every MDB, similar to the citation MDB.Suggestions on how to test this:
The easiest way to test it is to run Dataverse in docker following the container guide
Once a dataset is created and published the "RO-Crate (ARP style)" button appears under "Export Metadata":
Clicking that generate the RO-Crate metadata JSON.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No, just adds a new item to the "Export Metadata" droopdown.
Additional documentation:
The code for this implementationhas been extracted from the ARP project in the frame of FAIR-IMPACT's 1st Open Call "Enabling FAIR Signposting and RO-Crate for content/metadata discovery and consumption" by SZTAKI DSD (Department of Distributed Systems)