T300 bulkrax8.0.0 #555

kerchner · 2024-06-03T20:04:34Z

This PR does the following:

Upgrades Bulkrax from 2.3.0 to a post-8.0.0 commit (when this commit is included in the next Bulkrax release, we can at that point upgrade to that release)

Set up the code branch for testing:

Do a bundle install to upgrade Bulkrax.
Make sure to run db:migrate -- the Bulkrax upgrade affects Bulkrax-related database tables.
On the host server, create a /opt/scholarspace/scholarspace-ingest directory. (Creation of this directory, if not already present, has been added to Dockerfile.)

Set up data for testing:

There is an AWS S3 bucket called proquest-etds-test containing a set of ProQuest ETD zip files suitable for testing. Steps to set up the data so that it can be accessed from within the docker container app-server and sidekiq volumes:

Install s3fs-fuse on your server
Set up credentials in ${HOME}/.passwd-s3fs - please see me or @kilahimm for the bulkrax access key credentials.
Mount the bucket on your file system. For example, you might create a /data-s3-etds directory and then run s3fs proquest-etds-test /data-s3-etds
Copy some of the ETDs from the S3 mount point to /opt/scholarspace/scholarspace-ingest or a directory beneath it.

Test:

Run the ingest_bulkrax_prep rake task inside the container (or from the outside using docker exec. The task requires an argument, which is the path to the directory containing the ProQuest zips you wish to include in the ingest. For example, bundle exec rails gwss:ingest_pq_etds['/opt/scholarspace/scholarspace-ingest/etd-zips'] # where etds are in /opt/scholarspace/scholarspace-ingest/etd-zips
This will create /opt/scholarspace/scholarspace-hyrax/tmp/bulkrax_zip and within this folder, metadata.csv and files/ containing attachments (within subdirectories, one per ETD)
You may wish to scp tmp/bulkrax_zip/metadata.csv to your local computer so that you can more conveniently view it, for example using Excel, in order to validate test results.
Within the GW ScholarSpace web application, log in as an administrative user. On the Dashboard, click on Importers. Create a New importer with the following values:
- Name = any name
- Administrative Set = ETDs
- Frequency = Once (on save)
- Limit = leave blank
- Parser = CSV - Comma Separated Values
- Visibility = Public
- Rights Statement = leave blank
- Add CSV File to Import: Specify a Path on the Server. Import file path = /opt/scholarspace/scholarspace-hyrax/tmp/bulkrax_zip/metadata.csv

Before starting the import, open a tab to the Sidekiq administrator (at /sidekiq) so that you can watch progress of the queues.

Then proceed and click Create and Import.

*If you wish to re-run the task to generate the bulkrax-ready metadata and files, then you'll need to first clear out the results of the previous run: rm -r /opt/scholarspace/scholarspace-hyrax/tmp/bulkrax_zip

Validate the test results

Validate via the Importer in the UI, that there were no errors.
After waiting for all jobs to finish in Sidekiq:
- confirm that an ETD object was created for each ETD in metadata.csv (and that at least the count of ETDs matches the number of ETD zip files).
- confirm that files which should be embargoed are embargoed. Note that there seems to be a bug in the Hyrax UI, whereby embargo release date does not display. However, the embargo release date on a FileSet (or other object) can be confirmed via the rails console if desired.
- confirm that the thumbnail derivatives are created for attachments of thumbnail-able types.
- confirm that the thumbnail for the GwETD is the thumbnail for the main thesis/dissertation PDF.
- confirm that after the ETDs have all finished importing, the ETDs are public, even if some of them have embargoed FileSets
Via Sidekiq, there may be occasional objects (PDFs, etc.) that failed to characterize using ghostscript. These seem to fail in the same manner when attaching through the UI. As long as the objects (PDF, etc.) do attach and are downloadable via the UI, and that these do not somehow crash the whole ingest (they shouldn't), then the PR should pass.

Other notes to the reviewer

I recognize that the code for the rake task, as currently written, is probably not the most ruby-esque in style. I am very open to suggestions for improvements to the style (and, of course, function) of the code.

Final thoughts

While this procedure for parsing the ProQuest zips, then loading via the front end, is still somewhat manual, requiring running back-end tasks, loading ETDs is generally something that we do just a few times per year. I also recognize that separating out ETDs from the S3 mount to be newly loaded, and avoiding re-loading ETDs that we already loaded, will also be somewhat of a manual process, although I intend to post suggested bash commands to make this a bit easier. The concept is to get this much working, try it for a few cycles of loads, and take what we learn and then decide what enhancements to the overall workflow would be most worthwhile.

… to bulkrax zip staging directory, but will need to segregate files from each ETD into separate directoroes

…sing prerelease of next bulkrax release.

…bargo attributes

…uthor parsing

…sn't yet an admin set for admin to deposit to

alepbloyd

Everything seems to work as described! Woo! I took a quick look at the ingest code, and I think it's fine as is - if it's working properly and probably won't be changing frequently, I don't think it's worth trying to pick it apart to make it more ruby-ish.

Two thoughts for future iterations:

I'll try to take a stab at writing a handful of RSpec tests for this. I think we could make a dummy ETD zip with embargoed files as a fixture, and then run the ingest task against that zip file. I don't think that needs to hold up merging this though and can do that as a separate PR.
It seems like there's a way to use AWS Lambda to trigger events when the contents of an S3 bucket change (https://docs.aws.amazon.com/lambda/latest/dg/lambda-services.html#supported-event-source-s3). I wonder if we could set that up to run the ingest when a new ETD zip is deposited in the S3 bucket and remove the need to manually run this? Definitely something we'd want to discuss more, but I might mess around and see if I can get a proof of concept running when I have some time.

kerchner · 2024-06-10T17:00:42Z

Thanks @alepbloyd !

Re: 1. I am definitely interested to see what you come up with for RSpec tests, I realize that those need to be written as well.

Re: 2. That sounds like a nice technological solution. However - and we should discuss this - I don't think we're near the point where we'd want any works at all being loaded into production without a human triggering the process and without a human doing QC. In practice we should first run any ingest in the test environment and watch the process to observe any issues and check the results. There are ETDs that can choke, especially on the steps where ghostscript characterizes them and where ImageMagick generates thumbnails; in theory we could also encounter issues parsing the ProQuest XML. So, running the ingest is always well served by having a person monitor the process.

@dolsysmith do you also want to test this PR, or should we proceed with merge?

dolsysmith

I ran the tests on the complete contents of the etds folder in the proquest-etds-test bucket.

The ingest_bulkrax_prep task yielded 99 works with associated files. The metadata sheet seems correctly formatted, etc.

I then created an importer for these 99 works. Most works were successfully imported, but a few errors did appear:

Derivatives failed to be created for at least three filesets. See the screenshots below from the Sidekiq logs.
- In at least one case, the log revealed a Ghostscript error. The file is attached and downloaded to the work, but the derivative is lacking.
- In another case, the logs showed a timeout error. Based on what I was seeing in the logs, I think it might have been triggered by a Java heap error on the Solr app. The PDF in question is pretty large: Li_gwu_0075A_16600.pdf.
One work's page (work, not fileset) is not accessible at all: Investigating Multicolor Affine Urn Models with Multiple Drawings. I can retrieve the ActiveRecord instance in the Rails console, and the metadata looks fine, but the app throws an error in the UI -- I think because it's looking for an image method on the WorkShowPresenter instance.

Embargoing seems to work, but I am seeing that files that are supposed to be embargoed as restricted -- per the metadata CSV -- are actually marked private in GWSS. It's my understanding that the restricted permissions are supposed to allow GW community access (but maybe it has a different meaning in the Bullkrax context?).

Another observation, which may be more of a Bulkrax bug, is that these errors didn't bubble up to the Importer UI in GWSS. All of the works were marked as Completed, not Completed with failures, despite the aforementioned errors' appearing in the logs and the derivatives' not being created.

Let me know if you want me to re-test anything with a smaller batch. Perhaps it was the sheer amount of files that caused some of these errors.

kerchner · 2024-06-10T18:57:04Z

As per discussion, an embargo value of restricted in the bulkrax metadata CSV seems to map to private visibility in the app (and a label of embargo is correctly applied as well).

@dolsysmith is going to try re-loading the ETDs in question via the UI to see if the failure to generate derivatives occurs there as well, and whether the Investigating Multicolor Affine Urn Models with Multiple Drawings ETD can be uploaded or not.

alepbloyd · 2024-06-10T19:02:15Z

If additional context is helpful, I only ran the import on the ETDs numbered up through 1029184 when I was testing.

I'm working on trying to get some RSpec tests going for this, so if there's any additional troubleshooting y'all want me to try on my server - just let me know.

dolsysmith · 2024-06-10T19:44:22Z

@kerchner I was able to replicate the errors in creating derivatives by uploading the works through the UI. One of these is throwing the ImageMagick/GhostScript error, and the other (the 17 MB file) is just causing a timeout -- I can't see any other errors for that in the logs. But that's the same behavior as when uploaded via Bulkrax, so I don't think the problem is with the code in this branch.

Regarding the work "nvestigating Multicolor Affine Urn Models with Multiple Drawings, I was able to upload this work directly into the UI without issue. So I then tried to replicate by re-uploading it as part of small batch with Bulkrax (only 4 works), and this also worked. So I'm afraid I'm not sure what happened the first time. I deleted the original, error-inducing work using the console, so I'm afraid I don't have a record of it anymore for troubleshooting.

dolsysmith · 2024-06-10T19:45:49Z

Also, I notice that both of my Bulkrax imports still show a "pending" status in the Bulkrax menu, even though when you click on either, there are no pending jobs. Are you seeing the same on your instance?

kerchner · 2024-06-10T20:11:01Z

Also, I notice that both of my Bulkrax imports still show a "pending" status in the Bulkrax menu, even though when you click on either, there are no pending jobs. Are you seeing the same on your instance?

At least when I try the full 99 test ETDs, the importer itself gets a green checkmark pretty quickly, even while its component objects are still pending. In other situations, I've also gotten the red x quickly as well. Not sure why yours still shows pending.

kerchner · 2024-06-10T23:36:02Z

@dolsysmith I tried reloading all 99 ETDs as a single import. Investigating Multicolor Affine Urn Models with Multiple Drawings looks like it loaded fine, including thumbnail. Items with issues:

Li loaded without a problem, but there is no thumbnail. (and note that this batch has ETDs much larger than Li, so I'm not sure that file size per se is the issue)
Clementi loaded, but there is no thumbnail.
Moore loaded, and shows a thumbnail for the catalog result, but not on the item view.

The errors I'm seeing with the Sidekiq jobs are consistent with the above, in that for each of the 3, the CreateRelationships job is not stuck, but the CreateDerivatives job is. Each has a different error.

I haven't tried reloading via the UI, but I would expect to get the same results as you did (i.e. ok but no thumbnail).

I believe these are errors that we cannot easily fix in Hyrax/Ghostscript/Minimagick, but we can live with. Plus they are neither new nor unique to this new Bulkrax loader. Over the past few years they have become less frequent, but they do still occur as we see here.

@alepbloyd @dolsysmith @kilahimm What are your thoughts on proceeding with merge given these results of testing?

dolsysmith · 2024-06-11T12:56:51Z

lib/tasks/ingest_bulkrax_prep.rake

+      etd_date_created = get_date_created(doc)
+      work_metadata['date_created'] = etd_date_created unless etd_date_created.nil?
+      work_metadata['committee_member'] = get_committee_members(doc).join(';')
+      work_metadata['rights_statement'] = 'http://rightsstatements.org/vocab/InC/1.0/'


I think we should add the license and resource_type fields, since those are required when updating/editing a work in the UI.

Thanks for catching this - I'll fix.

I will add resource_type and @dolsysmith you'll add license in the branch for #458

dolsysmith · 2024-06-11T12:58:20Z

@kerchner Merging makes sense. See my comment above about a minor change to ensure that the ETD's created have values for all of the fields required by the UI. If required values are missing, one cannot save changes when. manually editing the work, which may prove confusing for users.

kerchner · 2024-06-11T23:01:53Z

Work-in-progress: Adding the "All rights reserved" license will require some problem-solving because Bulkrax wants to append a / to the license ID value in the license metadata CSV field (see https://github.com/samvera/bulkrax/blob/v8.1.0/app/models/concerns/bulkrax/import_behavior.rb#L145-L146 ) - and the "corrected" value no longer matches the value in the authority yaml file of http://www.europeana.eu/portal/rights/rr-r.html (which does not have a trailing /, nor should it)

Also noting that we do not seem to ever display the license field on works in GW ScholarSpace to the user, yet the field is required. Should it be required? I've sent the question to Barrett.

…ut from ScholComm

kerchner · 2024-06-14T19:36:54Z

@dolsysmith @kilahimm @alepbloyd
I would propose the following while the question of the desired value for license is reviewed by ScholComm: I can create a new issue to track the deficiency where the ingest task is not setting license - which does have the aforementioned effect on content-admin users. Proceed with merge. If desired, we can use a script or task to set license for all ETDs lacking it, to either the currently used value or some new value. This would be a temporary measure to address the next batch of ETDs. If the currently used URI for "All Rights Reserved" is confirmed going forward, we will need to find a solution to work around Bulkrax aggressively slash-terminating values in the license column of the metadata CSV.

Thoughts on this?

dolsysmith · 2024-06-17T15:17:18Z

I updated this ticket to reflect ScholComm's discussion of the license field.

dolsysmith · 2024-08-08T15:19:13Z

@kerchner The deprecated (and problematic) license value has been replaced in #458 with a value that doesn't get parsed as a URI, so it doesn't seem to cause a problem for Bulkrax. I think the license field needs to be added to the ingest_bulkrax_prep task, and it should be set to the new value, All rights reserved.

kerchner added 22 commits May 8, 2024 18:05

Upgrade from Bulkrax 2.3.0 to 8.0.0, no configuration just yet

18de431

Fixes uploads-with-files issue by pointing to bulkrax branch

c9a3b2b

Work in Progress - tasks to ingest ProQuest ETD zips

b73746c

WIP - next need to create CSV from array of metadata hashes

c9b1c42

WIP - fixed problem creating header row

3018d19

Fixed embargo logic; fixed CSV structure

8c69d97

Eliminated folder names from metadata csv FileSet entries; copy files…

7d7b47e

… to bulkrax zip staging directory, but will need to segregate files from each ETD into separate directoroes

Adds 'bulkrax_identifier' metadata; fixes imports of works w/files, u…

dd9eb7c

…sing prerelease of next bulkrax release.

implemented parent work/child FileSet bulkrax_identifier, repaired em…

c13eb43

…bargo attributes

refactor file paths for extracted zip; parse creator/contributors

5fdf6a7

Repair attachment filenames with spaces (or else bulkrax will); fix a…

e97b032

…uthor parsing

Add degree, advisors, committee members

d5342b1

Add gw_affiliation, date_created

d6a515d

Simplify embargo date; add rights statement; clean up

c47c8ec

Fix truncated file; clarify configs, set default rights

4419701

Update bulkrax hash, now contains db migration fix

caff6dc

Code cleanup for PR

d3e87e5

Add scholarspace-ingest directory and volume mapping

01d91df

Add mapping for scholarspace-ingest directory

020ad50

Add CI directive to create ingest folder

99cbb75

Upgrade Bulkrax to 8.1.0

9fd14dd

Allow admin user to visit /importers and /exporters even when there i…

c764271

…sn't yet an admin set for admin to deposit to

kerchner marked this pull request as ready for review June 7, 2024 05:56

kerchner requested review from alepbloyd and dolsysmith June 7, 2024 05:57

alepbloyd approved these changes Jun 10, 2024

View reviewed changes

dolsysmith reviewed Jun 10, 2024

View reviewed changes

dolsysmith reviewed Jun 11, 2024

View reviewed changes

alepbloyd added 6 commits June 11, 2024 21:53

Add fixture zips for bulkrax rspec testing

22ea974

Add sidekiq inline testing setting

7a09fa8

Set testing queue for inline sidekiq

b3aa9c1

Modify ingest_bulkrax_prep when in test mode

86762e0

Add bulkrax importer tests

906e622

Simplify bulkrax tests

d33a54a

Populates degree and resource_type. License is still WIP, pending inp…

9030ee8

…ut from ScholComm

Added resource_type field

bf8c139

dolsysmith mentioned this pull request Aug 8, 2024

T458 license field display #562

Merged

kerchner self-assigned this Aug 27, 2024

dolsysmith merged commit 5dfd206 into master Sep 4, 2024
1 check passed

dolsysmith deleted the t300-bulkrax8.0.0 branch September 4, 2024 18:16

This was referenced Sep 24, 2024

Test and configure ETD Loader and Bulkrax #386

Closed

Replacing ETD bulk loader with Bulkrax #300

Closed

ingest_etd breaks when metadata contains control characters #96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T300 bulkrax8.0.0 #555

T300 bulkrax8.0.0 #555

kerchner commented Jun 3, 2024 •

edited

Loading

alepbloyd left a comment

kerchner commented Jun 10, 2024 •

edited

Loading

dolsysmith left a comment

kerchner commented Jun 10, 2024

alepbloyd commented Jun 10, 2024

dolsysmith commented Jun 10, 2024

dolsysmith commented Jun 10, 2024

kerchner commented Jun 10, 2024

kerchner commented Jun 10, 2024

dolsysmith Jun 11, 2024

kerchner Jun 11, 2024

kerchner Aug 27, 2024

dolsysmith commented Jun 11, 2024

kerchner commented Jun 11, 2024 •

edited

Loading

kerchner commented Jun 14, 2024 •

edited

Loading

dolsysmith commented Jun 17, 2024

dolsysmith commented Aug 8, 2024

T300 bulkrax8.0.0 #555

T300 bulkrax8.0.0 #555

Conversation

kerchner commented Jun 3, 2024 • edited Loading

This PR does the following:

Set up the code branch for testing:

Set up data for testing:

Test:

Validate the test results

Other notes to the reviewer

Final thoughts

alepbloyd left a comment

Choose a reason for hiding this comment

kerchner commented Jun 10, 2024 • edited Loading

dolsysmith left a comment

Choose a reason for hiding this comment

kerchner commented Jun 10, 2024

alepbloyd commented Jun 10, 2024

dolsysmith commented Jun 10, 2024

dolsysmith commented Jun 10, 2024

kerchner commented Jun 10, 2024

kerchner commented Jun 10, 2024

dolsysmith Jun 11, 2024

Choose a reason for hiding this comment

kerchner Jun 11, 2024

Choose a reason for hiding this comment

kerchner Aug 27, 2024

Choose a reason for hiding this comment

dolsysmith commented Jun 11, 2024

kerchner commented Jun 11, 2024 • edited Loading

kerchner commented Jun 14, 2024 • edited Loading

dolsysmith commented Jun 17, 2024

dolsysmith commented Aug 8, 2024

kerchner commented Jun 3, 2024 •

edited

Loading

kerchner commented Jun 10, 2024 •

edited

Loading

kerchner commented Jun 11, 2024 •

edited

Loading

kerchner commented Jun 14, 2024 •

edited

Loading