Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a researcher, I want my provenance to be made available to other researchers at publish time so that they can make better use of my data #4381

Closed
djbrooke opened this issue Dec 13, 2017 · 11 comments

Comments

@djbrooke
Copy link
Contributor

djbrooke commented Dec 13, 2017

In #4343, we've added the ability to add Provenance files/freetext to Dataverse.

When this story is done, we'd expect those files to be sent to the provenance system and saved.

@djbrooke
Copy link
Contributor Author

ping @MKLau

@jacksonokuhn
Copy link

On the upload page, can we refer to the provenance as "Documents" rather than "Bundles"? We're storing the data as bundles on the backend, but they're gonna be uploaded as standard PROV-JSON docs. The current wording may confuse people.

@mheppler
Copy link
Contributor

@jacksonokuhn Your suggested revisions for the label/help text in the UI has been committed by @matthew-a-dunlap as part of issue #4343 b058423. @dlmurphy was going to address the User Guide revisions in that branch as well.

@dlmurphy
Copy link
Contributor

Update: instead of "bundle file", we're using "provenance file" as our terminology to describe the JSON file that users upload which contains provenance information about their data.

@mheppler
Copy link
Contributor

mheppler commented Jan 30, 2018

Updated the mockups link in the first comment from @djbrooke to point to the mockups that include the Provenance tab on the file page, as opposed to the file upload workflow, which is covered in issue #4343.

Included in that 4343 issue is a comment of mine that outlines "what we are building". Here is an outline for this preview issue.

What we are building

File Pg

  • New Provenance tab, in between Metadata and Versions
  • Add Provenance button that links to Provenance popup
  • Provenance option in Edit button dropdown menu that links to Provenance popup
  • "Why Use Provenance?" info text if provenance does not exist
  • Info text if provenance exists
  • Description label and provenance free-form text value displayed if it exists
  • Graph (from info in Provenance system, after publish) if provenance info exists from JSON file upload and/or system transformations
  • Export Provenance button if provenance exists

(See issue #4345 for pre-publish provenance preview.)

@jacksonokuhn
Copy link

jacksonokuhn commented Feb 11, 2018

so here's basically what needs to happen on publish as a series of api calls:

  • create a bundle: POST provapi/bundle
  • create an activity called something like "[dataset-id]-publish": POST provapi/object
  • add the property "startedAtTime" to that activity: POST provapi/object//property
  • lookup or create the user who's publishing the dataset: POST provapi/lookup/object/ then POST provapi/object if not found
  • connect the publish activity to that user (type: "wasAssociatedWith"): POST provapi/object//relation

for each datafile:

  • create an object called "[datafile-id]" for each datafile that isn't in the prov system: POST provapi/object

  • connect each of those datafiles to the publish activity ("wasGeneratedBy"): POST provapi/object//relation

  • add the property "generatedAtTime" to each datafile: POST provapi/object//property

    for any datafile that's a replacement:

    • lookup or create the previous datafile: POST provapi/lookup/object/ then POST provapi/object if not found
    • connect the current datafile to its predecessor (type: "wasDerivedFrom"): POST provapi/object//relation

    for any datafile that was changed on ingest:

    • create an activity called "[datafile-id]-ingest": POST provapi/object
    • add properties representing what was done on ingest to the activity (not sure what this should be yet)
    • connect the datafile to that activity (type "wasGeneratedBy"): POST provapi/object//relation

@jacksonokuhn
Copy link

jacksonokuhn commented Feb 11, 2018

to connect the prov generated by publish to user uploaded prov, we do the following:
if the datafile was not changed on ingest:

  • upload the json with anchor_objects:{[id: internal id of datafile, name: name of datafile in uploaded json]}

if the datafile was changed on ingest

  • upload the json with no anchor objects
  • lookup the datafile in the json
  • connect the ingest activity to the datafile (type "used")

@matthew-a-dunlap
Copy link
Contributor

During our weekly technical meeting we identified an additional need for dataverse communicating with the provenance system (CPL). Specifically, we identified that CPL needs us to combine our uploading of provenance json into one call in cases where the provenance json is the same. To do this, we will need to look at all the provenance json being sent to CPL when we publish and see if any of the bundles match. In cases where they do match, we will need to tell CPL that that bundle points to multiple DataFiles and provide those files in a list. (Note: this description is a tad vague as I do not understand Prov syntax and the CPL apis well enough to give better details.)

Someday this could be improved with a UI enhancement to allow the user select for one DataFile a prov.json file already uploaded for another file.

Open questions:

  • What happens if a user uploaded a prov json file in the last published version of the dataset and then needs to use that same json file with new DataFiles in a subsequent dataset version? Is there something special we need to do in this case or can CPL handle us re-uploading the same json.

@pdurbin
Copy link
Member

pdurbin commented Feb 16, 2018

Hopefully b23c396 gives a sense of what I've been working on lately, which is the "automatic" creation of prov data based on normal, boring interactions that Dataverse has supported since the beginning of time. That is to say, all I'm doing is creating a dataset, uploading a file, and publishing the dataset (all in ProvIT.java). The difference is that I stubbed out a new API endpoint that is supposed to represent what happens when the user clicks "Export Provenance" as in the screenshot below from #4346, which is that some JSON is supposed to be downloaded. You will notice many FIXMEs in the code because I'm not sure how the JSON is supposed to look. @jacksonokuhn will be able to advise on this.

export-prov

After looking at that commit and seeing that I was starting to implement something that's different than what @jacksonokuhn had in mind, he updated his comment at #4381 (comment) to put in more detail (thanks!) of the REST API endpoints that should be called into from the FinalizeDatasetPublicationCommand. Those endpoints are documented at https://github.com/ProvTools/prov-cpl/blob/master/bindings/python/RestAPI/rest-docs.txt and on the Dataverse side I'm using ProvenanceRestServiceBeanIT.java to test calling into them statically before moving the logic into the publish command above.

To work on this issue, one must have the prov system running, which @matthew-a-dunlap and @sekmiller have done following instructions in pull request #4461. In #4364 @jacksonokuhn is working on moving the config from the Dataverse git repo to the cpl-prov repo but for now those configs in the pull request work fine. You download a couple files and spin up an Ubuntu VM in Vagrant running prov.

I'm on vacation next week so I'll take myself off this issue.

@pdurbin pdurbin removed their assignment Feb 16, 2018
@jacksonokuhn
Copy link

So @matthew-a-dunlap and I were talking and it looks like there was a bit of confusion around order of operations. We should generate the automatic publish PROV and THEN upload the PROV bundle. I can talk about it more with whoever is working on this.

@djbrooke
Copy link
Contributor Author

Will close for now, will reopen if we decide to take on a similar approach with further provenance work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants