GDCC/8449 use original files in archival bag #8901

qqmyers · 2022-08-06T17:35:57Z

What this PR does / why we need it: This PR updates the OAI-ORE metadata export to include a consistent set of name/mimetype, size, download URL, and checksum values that are for

non-tabular/not ingested: the file as is in the current version
tabular/ingested: the original file.

As noted in the issue, the current OAI_ORE, and hence archival bags as well include the checksum of the original file but, for tabular/ingested files, use the name/size/mimetype, and download URL (which is used to generate the bag) of the ingested version which makes the OAI_ORE inconsistent and makes it impossible to use the checksum to validate the bag contents.

Which issue(s) this PR closes:

Closes #8449

Special notes for your reviewer: This is a relatively straight-forward change to the OAI-ORE generator class. It doesn't address the broader issue(s) related to moving the original file upon ingest as recently discussed in tech hour. The part that's relevant here is that we don't change the checksum/don't calculate the checksum of the ingested version, so in addition to it being reasonable to include the original file in an archival bag (since ingest would occur if/when you read the bag back in), we don't have a checksum to use with the ingested version anyway. (In making the fix here, I also realized that there's also a minor issue in that the per-version filenames are only for the ingested version and the original file download only has the original name, i.e. the original download button for 'newfilename1.tab' in v2 of a dataset is 'oldname.csv' (versus newfilename1.csv). - not sure if that's worth an issue or is something I can address when refactoring how ingested files are stored for the remote overalay post 5.12 work).

Suggestions on how to test this: Publish datasets with a mix of normal and ingested files. Verify that the OAI_ORE output and an archival bag (e.g. using the File Archiver) include the original file/metadata including name/mimetype/size/checksum for ingested files and include the name from the latest version, mimetype/size/checksum for normal files. Also verify that the schema:sameAs link in the OAI-ORE includes '&format=original' for the ingested files (so if you use it, you'll get the original file downloaded).

Does this PR introduce a user interface change? If mockups are available, please link/include them here: no

Is there a release notes update needed for this change?: will include in main release note file.
[all the release notes for HDC 1 and 3A/3B, including this PR are combined in #8894 - L.A.]

Additional documentation:

pdurbin · 2022-08-09T19:21:43Z

Just a quick note to capture some results from exporting from develop (04bfd3d) vs. this pull request (47c570c, called "archive" below).

The doi-10-5072-fk2-tgbhlb-datacite.v1.0.xml file is identical.

The zip files differ. My example is a little weird because the original file is a tsv but it looks like indeed the tsv (original) is in the bag as of this pull request and the tab (preservation file) is out. Non-tabular files (text files, in my case) are unaffected.

Below I show comparisons of the following files:

oai-ore.jsonld
baginfo.txt
manifest-md5.txt
pid-mapping.txt

Here are the bags/zips of "develop" and "archive" (this PR) if someone (@shlake ?) wants to inspect them.

Testing was a bit challenging because the API endpoint changed, both the verb and path.

In develop it's .post("/api/admin/submitDatasetVersionToArchive/"
In this PR it's .get("/api/admin/submitDataVersionToArchive/"

Which files changed

$ diff -Naur --brief develop archive
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/bag-info.txt and archive/doi-10-5072-FK2-TGBHLBv-1-0/bag-info.txt differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/data/2019-02-25.tab and archive/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/data/2019-02-25.tab differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/data/2019-02-25.tsv and archive/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/data/2019-02-25.tsv differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/results/language-count.tab and archive/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/results/language-count.tab differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/results/language-count.tsv and archive/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/results/language-count.tsv differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/manifest-md5.txt and archive/doi-10-5072-FK2-TGBHLBv-1-0/manifest-md5.txt differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/metadata/oai-ore.jsonld and archive/doi-10-5072-FK2-TGBHLBv-1-0/metadata/oai-ore.jsonld differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/metadata/pid-mapping.txt and archive/doi-10-5072-FK2-TGBHLBv-1-0/metadata/pid-mapping.txt differ

How the oai-ore.jsonld file changed

$ diff <(cat develop/doi-10-5072-FK2-TGBHLBv-1-0/metadata/oai-ore.jsonld | jq .) <(cat archive/doi-10-5072-FK2-TGBHLBv-1-0/metadata/oai-ore.jsonld | jq .)

52c52
<         "schema:name": "2019-02-25.tab",
---
>         "schema:name": "2019-02-25.tsv",
61c61
<         "schema:sameAs": "http://localhost:8080/api/access/datafile/8",
---
>         "schema:sameAs": "http://localhost:8080/api/access/datafile/8?format=original",
63,64c63,64
<         "schema:fileFormat": "text/tab-separated-values",
<         "dvcore:filesize": 17232,
---
>         "schema:fileFormat": "text/tsv",
>         "dvcore:filesize": 17307,
66,67c66
<         "dvcore:originalFileFormat": "text/tsv",
<         "dvcore:originalFormatLabel": "Tab-Separated Values",
---
>         "dvcore:currentIngestedName": "2019-02-25.tab",
121c120
<         "schema:name": "language-count.tab",
---
>         "schema:name": "language-count.tsv",
131c130
<         "schema:sameAs": "http://localhost:8080/api/access/datafile/6",
---
>         "schema:sameAs": "http://localhost:8080/api/access/datafile/6?format=original",
133,134c132,133
<         "schema:fileFormat": "text/tab-separated-values",
<         "dvcore:filesize": 203,
---
>         "schema:fileFormat": "text/tsv",
>         "dvcore:filesize": 180,
136,137c135
<         "dvcore:originalFileFormat": "text/tsv",
<         "dvcore:originalFormatLabel": "Tab-Separated Values",
---
>         "dvcore:currentIngestedName": "language-count.tab",

How the baginfo.txt, manifest-md5.txt, and pid-mapping.txt files changed

diff -Naur develop/doi-10-5072-FK2-TGBHLBv-1-0/bag-info.txt archive/doi-10-5072-FK2-TGBHLBv-1-0/bag-info.txt
--- develop/doi-10-5072-FK2-TGBHLBv-1-0/bag-info.txt	2022-08-09 14:36:20.000000000 -0400
+++ archive/doi-10-5072-FK2-TGBHLBv-1-0/bag-info.txt	2022-08-09 14:59:46.000000000 -0400
@@ -12,6 +12,6 @@
  open-source-at-harvard</a>.
 Bagging-Date: 2022-08-09
 External-Identifier: https://doi.org/10.5072/FK2/TGBHLB
-Bag-Size: 17.79 KB
-Payload-Oxum: 17789.5
+Bag-Size: 17.84 KB
+Payload-Oxum: 17841.5
 Internal-Sender-Identifier: Root:Open Source at Harvard
diff -Naur develop/doi-10-5072-FK2-TGBHLBv-1-0/manifest-md5.txt archive/doi-10-5072-FK2-TGBHLBv-1-0/manifest-md5.txt
--- develop/doi-10-5072-FK2-TGBHLBv-1-0/manifest-md5.txt	2022-08-09 14:36:20.000000000 -0400
+++ archive/doi-10-5072-FK2-TGBHLBv-1-0/manifest-md5.txt	2022-08-09 14:59:46.000000000 -0400
@@ -1,5 +1,5 @@
-8af47fd3e16134e1aa4700fbfc48ff50 data/Open Source at Harvard/data/2019-02-25.tab
+8af47fd3e16134e1aa4700fbfc48ff50 data/Open Source at Harvard/data/2019-02-25.tsv
 815c7d68f7de886fc5df715c3e4f8a80 data/Open Source at Harvard/doc/README.md
 4cbc5116116ee776466b613e77693c2f data/Open Source at Harvard/code/language-count.sh
-0521aec1df05efd6733b8428912095bb data/Open Source at Harvard/results/language-count.tab
+0521aec1df05efd6733b8428912095bb data/Open Source at Harvard/results/language-count.tsv
 1ff97714bdeaf41908c05a2271bd5afa data/Open Source at Harvard/run.sh
diff -Naur develop/doi-10-5072-FK2-TGBHLBv-1-0/metadata/pid-mapping.txt archive/doi-10-5072-FK2-TGBHLBv-1-0/metadata/pid-mapping.txt
--- develop/doi-10-5072-FK2-TGBHLBv-1-0/metadata/pid-mapping.txt	2022-08-09 14:36:20.000000000 -0400
+++ archive/doi-10-5072-FK2-TGBHLBv-1-0/metadata/pid-mapping.txt	2022-08-09 14:59:46.000000000 -0400
@@ -1,6 +1,6 @@
 https://doi.org/10.5072/FK2/TGBHLB data/Open Source at Harvard/
-http://localhost:8080/file.xhtml?fileId=8 data/Open Source at Harvard/data/2019-02-25.tab
+http://localhost:8080/file.xhtml?fileId=8 data/Open Source at Harvard/data/2019-02-25.tsv
 http://localhost:8080/file.xhtml?fileId=7 data/Open Source at Harvard/doc/README.md
 http://localhost:8080/file.xhtml?fileId=5 data/Open Source at Harvard/code/language-count.sh
-http://localhost:8080/file.xhtml?fileId=6 data/Open Source at Harvard/results/language-count.tab
+http://localhost:8080/file.xhtml?fileId=6 data/Open Source at Harvard/results/language-count.tsv
 http://localhost:8080/file.xhtml?fileId=4 data/Open Source at Harvard/run.sh

qqmyers · 2022-08-09T19:31:53Z

FWIW: In terms of file changes between two bags:

bag-info.txt contains the Bagging-Date and Bag-Size which change with when the bagging is done and what files are included.
manfiest-md5.txt and pid-mapping.txt contain file names and so will change when .tsv vs .tab extensions are used
I think the other changes above would be the direct changes to the included files and the oai-ore metadata file itself.

…al_files_in_archival_bag

coveralls · 2022-08-09T19:41:33Z

Coverage decreased (-0.003%) to 20.045% when pulling cf2e996 on GlobalDataverseCommunityConsortium:GDCC/8449-use_original_files_in_archival_bag into 6ca1c9d on IQSS:develop.

…al_files_in_archival_bag

…//github.com/GlobalDataverseCommunityConsortium/dataverse.git into GDCC/8449-use_original_files_in_archival_bag

Also allow strings as setting keys because :BagItLocalPath isn't in the usual enum.

pdurbin

The code looks fine and, more importantly, seems to work fine. See my detailed comment showing a diff of before and after bags.

I added a very basic BagIT API test and Jenkins executed it. There should be a bag in /tmp on the EC2 instance. Here's the output of the test: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-8901/6/testReport/edu.harvard.iq.dataverse.api/BagIT/testBagItExport/

Moving to QA.

…al_files_in_archival_bag

…//github.com/GlobalDataverseCommunityConsortium/dataverse.git into GDCC/8449-use_original_files_in_archival_bag

…al_files_in_archival_bag

landreev · 2022-08-24T01:09:59Z

I've been using this PR to learn/figure out the QA process itself. The functionality provided in this PR, for all practical purposes @pdurbin had QA-ed it all during his code review before he approved it (see his very detailed notes above). So it would have been safe enough to just click "merge" without further ado... In other words, a good PR to use as a learning one.

My process was as follows:
(the steps are all standard, I'm not doing anything outside of Kevin's normal process. but this describes the order in which I was going about it, and my thought process and questions/ideas about it)

Read the PR description and the notes, the one for the review and the "how to test" one;
Read all the comments added during the review (which, once again, included a detailed report of testing the functionality; something we normally don't do in reviews);
Checked if any user documentation was changed/added as part of the PR (none);
Checked if there are any configuration/upgrade instructions (none);
Looked at the code files changed in the PR; didn't try to read/re-review the code, focused solely on the scope of the changes/tried to make an assessment of the likelihood of the changes in the PR breaking something elsewhere in the application, outside of the BagIT functionality proper. (in my professional opinion, very low). I feel like this is a super important step, because based on this conclusion you decide how much of the functionality outside of the immediate feature(s) covered by the PR to re-test.
Checked out the branch, made sure it builds on my own dev. environment.
I assumed it wasn't necessary to run all the restassured tests on my local environment. But I made sure the test for the functionality in question passed (thanks to Phil for adding the test during review!).
Tested the output of the api "before" and "after" on my own build; compared the results. Basically made sure to replicate Phil's result above. (everything checks out; minus the difference in the method and path he describes, which is no longer there).
Tested the rest of the application. Based on the conclusion in 4. above only ran a limited smoke test (Kevin's standard test mostly with a few extra clicks).
Deployed the war file on dataverse-internal; repeated the steps 8. and 9. there.
Checked on the status of the restassured tests run in the PR. This may have been a mistake - I'm thinking next time I'll start with this step; and use the war file built by Jenkins for all the further testing, if immediately available (instead of my own-built war file). With this PR, it turned out that the automatic Jenkins run that was triggered after the last commit 5 days ago failed to build; so there was no war file. It looked like that was a Jenkins hickup. Asked Don to rerun the tests; it built this time around - but then died for some other reason - it still looks like it's unrelated to the contents of the PR itself; we will figure this part out tomorrow.
Based on the testing above still have one minor question to ask the developer.

landreev · 2022-08-24T01:18:40Z

@qqmyers You mentioned in the PR description that you were going to add something to the 5.12 release note. Was it about running a re-export, because of the OAI_ORE format changes? We will have at least one other PR in 5.12 that comes with a reexport, so that part will be covered.
If there's anything else to include in the release doc, I would propose, just for the sake of process, that we always include a note under doc/release-notes when needed - even if it's just 1 sentence that says "this one will need an upgrade instruction, will add in the main release doc". Just to make sure we don't forget.

qqmyers · 2022-08-24T11:26:14Z

@landreev - In the #8894 release notes PR, lines 17 and 33 involve the functionality from this PR and the upgrade instructions include reexporting. If there's anything else to add, let me know.

landreev · 2022-08-24T15:03:26Z

@qqmyers perfect.

landreev · 2022-08-24T15:27:27Z

Jenkins tests have passed. So I think this is ready to be merged.

qqmyers added 3 commits August 6, 2022 13:18

8449 switch to original

8b8ee12

fix ORE file DOI case

277c848

use original name/mimetype/size if ingested

47c570c

qqmyers added HDC Harvard Data Commons HDC: 3a Harvard Data Commons Obj. 3A labels Aug 6, 2022

scolapasta added this to the 5.12 milestone Aug 8, 2022

pdurbin self-assigned this Aug 9, 2022

Merge remote-tracking branch 'IQSS/develop' into GDCC/8449-use_origin…

41105ba

…al_files_in_archival_bag

pdurbin mentioned this pull request Aug 9, 2022

Release Notes related to HDC 1 and 3A/3B #8894

Merged

pdurbin and others added 4 commits August 9, 2022 16:36

add test to (locally) exercise submitDatasetVersionToArchive IQSS#8449

bf35115

Merge remote-tracking branch 'IQSS/develop' into GDCC/8449-use_origin…

970456c

…al_files_in_archival_bag

Merge branch 'GDCC/8449-use_original_files_in_archival_bag' of https:…

a9e4802

…//github.com/GlobalDataverseCommunityConsortium/dataverse.git into GDCC/8449-use_original_files_in_archival_bag

add BagIT to API test suite IQSS#8449

6210a71

Also allow strings as setting keys because :BagItLocalPath isn't in the usual enum.

pdurbin approved these changes Aug 10, 2022

View reviewed changes

pdurbin removed their assignment Aug 10, 2022

landreev self-assigned this Aug 11, 2022

qqmyers added 3 commits August 12, 2022 18:11

Merge remote-tracking branch 'IQSS/develop' into GDCC/8449-use_origin…

39f8b89

…al_files_in_archival_bag

Merge branch 'GDCC/8449-use_original_files_in_archival_bag' of https:…

932c941

…//github.com/GlobalDataverseCommunityConsortium/dataverse.git into GDCC/8449-use_original_files_in_archival_bag

Merge remote-tracking branch 'IQSS/develop' into GDCC/8449-use_origin…

cf2e996

…al_files_in_archival_bag

landreev merged commit 1ef73b0 into IQSS:develop Aug 24, 2022

pdurbin mentioned this pull request Aug 25, 2022

test failure in develop: HarvestingServerIT.testOaiFunctionality #8937

Closed

qqmyers mentioned this pull request Jan 10, 2023

Client-side multifile zip download #9245

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GDCC/8449 use original files in archival bag #8901

GDCC/8449 use original files in archival bag #8901

qqmyers commented Aug 6, 2022 •

edited by landreev

Loading

pdurbin commented Aug 9, 2022 •

edited

Loading

qqmyers commented Aug 9, 2022

coveralls commented Aug 9, 2022 •

edited

Loading

pdurbin left a comment •

edited

Loading

landreev commented Aug 24, 2022

landreev commented Aug 24, 2022

qqmyers commented Aug 24, 2022

landreev commented Aug 24, 2022

landreev commented Aug 24, 2022

GDCC/8449 use original files in archival bag #8901

GDCC/8449 use original files in archival bag #8901

Conversation

qqmyers commented Aug 6, 2022 • edited by landreev Loading

pdurbin commented Aug 9, 2022 • edited Loading

Which files changed

How the oai-ore.jsonld file changed

How the baginfo.txt, manifest-md5.txt, and pid-mapping.txt files changed

qqmyers commented Aug 9, 2022

coveralls commented Aug 9, 2022 • edited Loading

pdurbin left a comment • edited Loading

Choose a reason for hiding this comment

landreev commented Aug 24, 2022

landreev commented Aug 24, 2022

qqmyers commented Aug 24, 2022

landreev commented Aug 24, 2022

landreev commented Aug 24, 2022

qqmyers commented Aug 6, 2022 •

edited by landreev

Loading

pdurbin commented Aug 9, 2022 •

edited

Loading

coveralls commented Aug 9, 2022 •

edited

Loading

pdurbin left a comment •

edited

Loading