Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDCC/8449 use original files in archival bag #8901

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Aug 6, 2022

What this PR does / why we need it: This PR updates the OAI-ORE metadata export to include a consistent set of name/mimetype, size, download URL, and checksum values that are for

  • non-tabular/not ingested: the file as is in the current version
  • tabular/ingested: the original file.

As noted in the issue, the current OAI_ORE, and hence archival bags as well include the checksum of the original file but, for tabular/ingested files, use the name/size/mimetype, and download URL (which is used to generate the bag) of the ingested version which makes the OAI_ORE inconsistent and makes it impossible to use the checksum to validate the bag contents.

Which issue(s) this PR closes:

Closes #8449

Special notes for your reviewer: This is a relatively straight-forward change to the OAI-ORE generator class. It doesn't address the broader issue(s) related to moving the original file upon ingest as recently discussed in tech hour. The part that's relevant here is that we don't change the checksum/don't calculate the checksum of the ingested version, so in addition to it being reasonable to include the original file in an archival bag (since ingest would occur if/when you read the bag back in), we don't have a checksum to use with the ingested version anyway. (In making the fix here, I also realized that there's also a minor issue in that the per-version filenames are only for the ingested version and the original file download only has the original name, i.e. the original download button for 'newfilename1.tab' in v2 of a dataset is 'oldname.csv' (versus newfilename1.csv). - not sure if that's worth an issue or is something I can address when refactoring how ingested files are stored for the remote overalay post 5.12 work).

Suggestions on how to test this: Publish datasets with a mix of normal and ingested files. Verify that the OAI_ORE output and an archival bag (e.g. using the File Archiver) include the original file/metadata including name/mimetype/size/checksum for ingested files and include the name from the latest version, mimetype/size/checksum for normal files. Also verify that the schema:sameAs link in the OAI-ORE includes '&format=original' for the ingested files (so if you use it, you'll get the original file downloaded).

Does this PR introduce a user interface change? If mockups are available, please link/include them here: no

Is there a release notes update needed for this change?: will include in main release note file.
[all the release notes for HDC 1 and 3A/3B, including this PR are combined in #8894 - L.A.]

Additional documentation:

@qqmyers qqmyers added HDC Harvard Data Commons HDC: 3a Harvard Data Commons Obj. 3A labels Aug 6, 2022
@scolapasta scolapasta added this to the 5.12 milestone Aug 8, 2022
@pdurbin pdurbin self-assigned this Aug 9, 2022
@pdurbin
Copy link
Member

pdurbin commented Aug 9, 2022

Just a quick note to capture some results from exporting from develop (04bfd3d) vs. this pull request (47c570c, called "archive" below).

The doi-10-5072-fk2-tgbhlb-datacite.v1.0.xml file is identical.

The zip files differ. My example is a little weird because the original file is a tsv but it looks like indeed the tsv (original) is in the bag as of this pull request and the tab (preservation file) is out. Non-tabular files (text files, in my case) are unaffected.

Below I show comparisons of the following files:

  • oai-ore.jsonld
  • baginfo.txt
  • manifest-md5.txt
  • pid-mapping.txt

Here are the bags/zips of "develop" and "archive" (this PR) if someone (@shlake ?) wants to inspect them.

Testing was a bit challenging because the API endpoint changed, both the verb and path.

In develop it's .post("/api/admin/submitDatasetVersionToArchive/"
In this PR it's .get("/api/admin/submitDataVersionToArchive/"

Which files changed

$ diff -Naur --brief develop archive
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/bag-info.txt and archive/doi-10-5072-FK2-TGBHLBv-1-0/bag-info.txt differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/data/2019-02-25.tab and archive/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/data/2019-02-25.tab differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/data/2019-02-25.tsv and archive/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/data/2019-02-25.tsv differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/results/language-count.tab and archive/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/results/language-count.tab differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/results/language-count.tsv and archive/doi-10-5072-FK2-TGBHLBv-1-0/data/Open Source at Harvard/results/language-count.tsv differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/manifest-md5.txt and archive/doi-10-5072-FK2-TGBHLBv-1-0/manifest-md5.txt differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/metadata/oai-ore.jsonld and archive/doi-10-5072-FK2-TGBHLBv-1-0/metadata/oai-ore.jsonld differ
Files develop/doi-10-5072-FK2-TGBHLBv-1-0/metadata/pid-mapping.txt and archive/doi-10-5072-FK2-TGBHLBv-1-0/metadata/pid-mapping.txt differ

How the oai-ore.jsonld file changed

$ diff <(cat develop/doi-10-5072-FK2-TGBHLBv-1-0/metadata/oai-ore.jsonld | jq .) <(cat archive/doi-10-5072-FK2-TGBHLBv-1-0/metadata/oai-ore.jsonld | jq .)

52c52
<         "schema:name": "2019-02-25.tab",
---
>         "schema:name": "2019-02-25.tsv",
61c61
<         "schema:sameAs": "http://localhost:8080/api/access/datafile/8",
---
>         "schema:sameAs": "http://localhost:8080/api/access/datafile/8?format=original",
63,64c63,64
<         "schema:fileFormat": "text/tab-separated-values",
<         "dvcore:filesize": 17232,
---
>         "schema:fileFormat": "text/tsv",
>         "dvcore:filesize": 17307,
66,67c66
<         "dvcore:originalFileFormat": "text/tsv",
<         "dvcore:originalFormatLabel": "Tab-Separated Values",
---
>         "dvcore:currentIngestedName": "2019-02-25.tab",
121c120
<         "schema:name": "language-count.tab",
---
>         "schema:name": "language-count.tsv",
131c130
<         "schema:sameAs": "http://localhost:8080/api/access/datafile/6",
---
>         "schema:sameAs": "http://localhost:8080/api/access/datafile/6?format=original",
133,134c132,133
<         "schema:fileFormat": "text/tab-separated-values",
<         "dvcore:filesize": 203,
---
>         "schema:fileFormat": "text/tsv",
>         "dvcore:filesize": 180,
136,137c135
<         "dvcore:originalFileFormat": "text/tsv",
<         "dvcore:originalFormatLabel": "Tab-Separated Values",
---
>         "dvcore:currentIngestedName": "language-count.tab",

How the baginfo.txt, manifest-md5.txt, and pid-mapping.txt files changed

diff -Naur develop/doi-10-5072-FK2-TGBHLBv-1-0/bag-info.txt archive/doi-10-5072-FK2-TGBHLBv-1-0/bag-info.txt
--- develop/doi-10-5072-FK2-TGBHLBv-1-0/bag-info.txt	2022-08-09 14:36:20.000000000 -0400
+++ archive/doi-10-5072-FK2-TGBHLBv-1-0/bag-info.txt	2022-08-09 14:59:46.000000000 -0400
@@ -12,6 +12,6 @@
  open-source-at-harvard</a>.
 Bagging-Date: 2022-08-09
 External-Identifier: https://doi.org/10.5072/FK2/TGBHLB
-Bag-Size: 17.79 KB
-Payload-Oxum: 17789.5
+Bag-Size: 17.84 KB
+Payload-Oxum: 17841.5
 Internal-Sender-Identifier: Root:Open Source at Harvard
diff -Naur develop/doi-10-5072-FK2-TGBHLBv-1-0/manifest-md5.txt archive/doi-10-5072-FK2-TGBHLBv-1-0/manifest-md5.txt
--- develop/doi-10-5072-FK2-TGBHLBv-1-0/manifest-md5.txt	2022-08-09 14:36:20.000000000 -0400
+++ archive/doi-10-5072-FK2-TGBHLBv-1-0/manifest-md5.txt	2022-08-09 14:59:46.000000000 -0400
@@ -1,5 +1,5 @@
-8af47fd3e16134e1aa4700fbfc48ff50 data/Open Source at Harvard/data/2019-02-25.tab
+8af47fd3e16134e1aa4700fbfc48ff50 data/Open Source at Harvard/data/2019-02-25.tsv
 815c7d68f7de886fc5df715c3e4f8a80 data/Open Source at Harvard/doc/README.md
 4cbc5116116ee776466b613e77693c2f data/Open Source at Harvard/code/language-count.sh
-0521aec1df05efd6733b8428912095bb data/Open Source at Harvard/results/language-count.tab
+0521aec1df05efd6733b8428912095bb data/Open Source at Harvard/results/language-count.tsv
 1ff97714bdeaf41908c05a2271bd5afa data/Open Source at Harvard/run.sh
diff -Naur develop/doi-10-5072-FK2-TGBHLBv-1-0/metadata/pid-mapping.txt archive/doi-10-5072-FK2-TGBHLBv-1-0/metadata/pid-mapping.txt
--- develop/doi-10-5072-FK2-TGBHLBv-1-0/metadata/pid-mapping.txt	2022-08-09 14:36:20.000000000 -0400
+++ archive/doi-10-5072-FK2-TGBHLBv-1-0/metadata/pid-mapping.txt	2022-08-09 14:59:46.000000000 -0400
@@ -1,6 +1,6 @@
 https://doi.org/10.5072/FK2/TGBHLB data/Open Source at Harvard/
-http://localhost:8080/file.xhtml?fileId=8 data/Open Source at Harvard/data/2019-02-25.tab
+http://localhost:8080/file.xhtml?fileId=8 data/Open Source at Harvard/data/2019-02-25.tsv
 http://localhost:8080/file.xhtml?fileId=7 data/Open Source at Harvard/doc/README.md
 http://localhost:8080/file.xhtml?fileId=5 data/Open Source at Harvard/code/language-count.sh
-http://localhost:8080/file.xhtml?fileId=6 data/Open Source at Harvard/results/language-count.tab
+http://localhost:8080/file.xhtml?fileId=6 data/Open Source at Harvard/results/language-count.tsv
 http://localhost:8080/file.xhtml?fileId=4 data/Open Source at Harvard/run.sh

@qqmyers
Copy link
Member Author

qqmyers commented Aug 9, 2022

FWIW: In terms of file changes between two bags:

  • bag-info.txt contains the Bagging-Date and Bag-Size which change with when the bagging is done and what files are included.
  • manfiest-md5.txt and pid-mapping.txt contain file names and so will change when .tsv vs .tab extensions are used
    I think the other changes above would be the direct changes to the included files and the oai-ore metadata file itself.

@coveralls
Copy link

coveralls commented Aug 9, 2022

Coverage Status

Coverage decreased (-0.003%) to 20.045% when pulling cf2e996 on GlobalDataverseCommunityConsortium:GDCC/8449-use_original_files_in_archival_bag into 6ca1c9d on IQSS:develop.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine and, more importantly, seems to work fine. See my detailed comment showing a diff of before and after bags.

I added a very basic BagIT API test and Jenkins executed it. There should be a bag in /tmp on the EC2 instance. Here's the output of the test: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-8901/6/testReport/edu.harvard.iq.dataverse.api/BagIT/testBagItExport/

Moving to QA.

@pdurbin pdurbin removed their assignment Aug 10, 2022
@landreev landreev self-assigned this Aug 11, 2022
@landreev
Copy link
Contributor

I've been using this PR to learn/figure out the QA process itself. The functionality provided in this PR, for all practical purposes @pdurbin had QA-ed it all during his code review before he approved it (see his very detailed notes above). So it would have been safe enough to just click "merge" without further ado... In other words, a good PR to use as a learning one.

My process was as follows:
(the steps are all standard, I'm not doing anything outside of Kevin's normal process. but this describes the order in which I was going about it, and my thought process and questions/ideas about it)

  1. Read the PR description and the notes, the one for the review and the "how to test" one;
  2. Read all the comments added during the review (which, once again, included a detailed report of testing the functionality; something we normally don't do in reviews);
  3. Checked if any user documentation was changed/added as part of the PR (none);
  4. Checked if there are any configuration/upgrade instructions (none);
  5. Looked at the code files changed in the PR; didn't try to read/re-review the code, focused solely on the scope of the changes/tried to make an assessment of the likelihood of the changes in the PR breaking something elsewhere in the application, outside of the BagIT functionality proper. (in my professional opinion, very low). I feel like this is a super important step, because based on this conclusion you decide how much of the functionality outside of the immediate feature(s) covered by the PR to re-test.
  6. Checked out the branch, made sure it builds on my own dev. environment.
  7. I assumed it wasn't necessary to run all the restassured tests on my local environment. But I made sure the test for the functionality in question passed (thanks to Phil for adding the test during review!).
  8. Tested the output of the api "before" and "after" on my own build; compared the results. Basically made sure to replicate Phil's result above. (everything checks out; minus the difference in the method and path he describes, which is no longer there).
  9. Tested the rest of the application. Based on the conclusion in 4. above only ran a limited smoke test (Kevin's standard test mostly with a few extra clicks).
  10. Deployed the war file on dataverse-internal; repeated the steps 8. and 9. there.
  11. Checked on the status of the restassured tests run in the PR. This may have been a mistake - I'm thinking next time I'll start with this step; and use the war file built by Jenkins for all the further testing, if immediately available (instead of my own-built war file). With this PR, it turned out that the automatic Jenkins run that was triggered after the last commit 5 days ago failed to build; so there was no war file. It looked like that was a Jenkins hickup. Asked Don to rerun the tests; it built this time around - but then died for some other reason - it still looks like it's unrelated to the contents of the PR itself; we will figure this part out tomorrow.
  12. Based on the testing above still have one minor question to ask the developer.

@landreev
Copy link
Contributor

@qqmyers You mentioned in the PR description that you were going to add something to the 5.12 release note. Was it about running a re-export, because of the OAI_ORE format changes? We will have at least one other PR in 5.12 that comes with a reexport, so that part will be covered.
If there's anything else to include in the release doc, I would propose, just for the sake of process, that we always include a note under doc/release-notes when needed - even if it's just 1 sentence that says "this one will need an upgrade instruction, will add in the main release doc". Just to make sure we don't forget.

@qqmyers
Copy link
Member Author

qqmyers commented Aug 24, 2022

@landreev - In the #8894 release notes PR, lines 17 and 33 involve the functionality from this PR and the upgrade instructions include reexporting. If there's anything else to add, let me know.

@landreev
Copy link
Contributor

@qqmyers perfect.

@landreev
Copy link
Contributor

Jenkins tests have passed. So I think this is ready to be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HDC Harvard Data Commons HDC: 3a Harvard Data Commons Obj. 3A
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tabular File from Ingest Included in Bag - not original Uploaded File
5 participants