-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GDCC/8449 use original files in archival bag #8901
GDCC/8449 use original files in archival bag #8901
Conversation
Just a quick note to capture some results from exporting from develop (04bfd3d) vs. this pull request (47c570c, called "archive" below). The doi-10-5072-fk2-tgbhlb-datacite.v1.0.xml file is identical. The zip files differ. My example is a little weird because the original file is a tsv but it looks like indeed the tsv (original) is in the bag as of this pull request and the tab (preservation file) is out. Non-tabular files (text files, in my case) are unaffected. Below I show comparisons of the following files:
Here are the bags/zips of "develop" and "archive" (this PR) if someone (@shlake ?) wants to inspect them. Testing was a bit challenging because the API endpoint changed, both the verb and path. In develop it's Which files changed$ diff -Naur --brief develop archive How the oai-ore.jsonld file changed
How the baginfo.txt, manifest-md5.txt, and pid-mapping.txt files changed
|
FWIW: In terms of file changes between two bags:
|
…al_files_in_archival_bag
…al_files_in_archival_bag
…//github.com/GlobalDataverseCommunityConsortium/dataverse.git into GDCC/8449-use_original_files_in_archival_bag
Also allow strings as setting keys because :BagItLocalPath isn't in the usual enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine and, more importantly, seems to work fine. See my detailed comment showing a diff of before and after bags.
I added a very basic BagIT API test and Jenkins executed it. There should be a bag in /tmp on the EC2 instance. Here's the output of the test: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-8901/6/testReport/edu.harvard.iq.dataverse.api/BagIT/testBagItExport/
Moving to QA.
…al_files_in_archival_bag
…//github.com/GlobalDataverseCommunityConsortium/dataverse.git into GDCC/8449-use_original_files_in_archival_bag
…al_files_in_archival_bag
I've been using this PR to learn/figure out the QA process itself. The functionality provided in this PR, for all practical purposes @pdurbin had QA-ed it all during his code review before he approved it (see his very detailed notes above). So it would have been safe enough to just click "merge" without further ado... In other words, a good PR to use as a learning one. My process was as follows:
|
@qqmyers You mentioned in the PR description that you were going to add something to the 5.12 release note. Was it about running a re-export, because of the OAI_ORE format changes? We will have at least one other PR in 5.12 that comes with a reexport, so that part will be covered. |
@qqmyers perfect. |
Jenkins tests have passed. So I think this is ready to be merged. |
What this PR does / why we need it: This PR updates the OAI-ORE metadata export to include a consistent set of name/mimetype, size, download URL, and checksum values that are for
As noted in the issue, the current OAI_ORE, and hence archival bags as well include the checksum of the original file but, for tabular/ingested files, use the name/size/mimetype, and download URL (which is used to generate the bag) of the ingested version which makes the OAI_ORE inconsistent and makes it impossible to use the checksum to validate the bag contents.
Which issue(s) this PR closes:
Closes #8449
Special notes for your reviewer: This is a relatively straight-forward change to the OAI-ORE generator class. It doesn't address the broader issue(s) related to moving the original file upon ingest as recently discussed in tech hour. The part that's relevant here is that we don't change the checksum/don't calculate the checksum of the ingested version, so in addition to it being reasonable to include the original file in an archival bag (since ingest would occur if/when you read the bag back in), we don't have a checksum to use with the ingested version anyway. (In making the fix here, I also realized that there's also a minor issue in that the per-version filenames are only for the ingested version and the original file download only has the original name, i.e. the original download button for 'newfilename1.tab' in v2 of a dataset is 'oldname.csv' (versus newfilename1.csv). - not sure if that's worth an issue or is something I can address when refactoring how ingested files are stored for the remote overalay post 5.12 work).
Suggestions on how to test this: Publish datasets with a mix of normal and ingested files. Verify that the OAI_ORE output and an archival bag (e.g. using the File Archiver) include the original file/metadata including name/mimetype/size/checksum for ingested files and include the name from the latest version, mimetype/size/checksum for normal files. Also verify that the schema:sameAs link in the OAI-ORE includes '&format=original' for the ingested files (so if you use it, you'll get the original file downloaded).
Does this PR introduce a user interface change? If mockups are available, please link/include them here: no
Is there a release notes update needed for this change?: will include in main release note file.
[all the release notes for HDC 1 and 3A/3B, including this PR are combined in #8894 - L.A.]
Additional documentation: