-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T300 bulkrax8.0.0 #555
T300 bulkrax8.0.0 #555
Conversation
… to bulkrax zip staging directory, but will need to segregate files from each ETD into separate directoroes
…sing prerelease of next bulkrax release.
…sn't yet an admin set for admin to deposit to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything seems to work as described! Woo! I took a quick look at the ingest code, and I think it's fine as is - if it's working properly and probably won't be changing frequently, I don't think it's worth trying to pick it apart to make it more ruby-ish.
Two thoughts for future iterations:
- I'll try to take a stab at writing a handful of RSpec tests for this. I think we could make a dummy ETD zip with embargoed files as a fixture, and then run the ingest task against that zip file. I don't think that needs to hold up merging this though and can do that as a separate PR.
- It seems like there's a way to use AWS Lambda to trigger events when the contents of an S3 bucket change (https://docs.aws.amazon.com/lambda/latest/dg/lambda-services.html#supported-event-source-s3). I wonder if we could set that up to run the ingest when a new ETD zip is deposited in the S3 bucket and remove the need to manually run this? Definitely something we'd want to discuss more, but I might mess around and see if I can get a proof of concept running when I have some time.
Thanks @alepbloyd ! Re: 1. I am definitely interested to see what you come up with for RSpec tests, I realize that those need to be written as well. Re: 2. That sounds like a nice technological solution. However - and we should discuss this - I don't think we're near the point where we'd want any works at all being loaded into production without a human triggering the process and without a human doing QC. In practice we should first run any ingest in the test environment and watch the process to observe any issues and check the results. There are ETDs that can choke, especially on the steps where ghostscript characterizes them and where ImageMagick generates thumbnails; in theory we could also encounter issues parsing the ProQuest XML. So, running the ingest is always well served by having a person monitor the process. @dolsysmith do you also want to test this PR, or should we proceed with merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the tests on the complete contents of the etds
folder in the proquest-etds-test
bucket.
The ingest_bulkrax_prep
task yielded 99 works with associated files. The metadata sheet seems correctly formatted, etc.
I then created an importer for these 99 works. Most works were successfully imported, but a few errors did appear:
- Derivatives failed to be created for at least three filesets. See the screenshots below from the Sidekiq logs.
- In at least one case, the log revealed a Ghostscript error. The file is attached and downloaded to the work, but the derivative is lacking.
- In another case, the logs showed a timeout error. Based on what I was seeing in the logs, I think it might have been triggered by a Java heap error on the Solr app. The PDF in question is pretty large:
Li_gwu_0075A_16600.pdf
.
- One work's page (work, not fileset) is not accessible at all:
Investigating Multicolor Affine Urn Models with Multiple Drawings.
I can retrieve the ActiveRecord instance in the Rails console, and the metadata looks fine, but the app throws an error in the UI -- I think because it's looking for animage
method on theWorkShowPresenter
instance.
Embargoing seems to work, but I am seeing that files that are supposed to be embargoed as restricted
-- per the metadata CSV -- are actually marked private
in GWSS. It's my understanding that the restricted
permissions are supposed to allow GW community access (but maybe it has a different meaning in the Bullkrax context?).
Another observation, which may be more of a Bulkrax bug, is that these errors didn't bubble up to the Importer UI in GWSS. All of the works were marked as Completed
, not Completed with failures
, despite the aforementioned errors' appearing in the logs and the derivatives' not being created.
Let me know if you want me to re-test anything with a smaller batch. Perhaps it was the sheer amount of files that caused some of these errors.
As per discussion, an embargo value of @dolsysmith is going to try re-loading the ETDs in question via the UI to see if the failure to generate derivatives occurs there as well, and whether the |
If additional context is helpful, I only ran the import on the ETDs numbered up through I'm working on trying to get some RSpec tests going for this, so if there's any additional troubleshooting y'all want me to try on my server - just let me know. |
@kerchner I was able to replicate the errors in creating derivatives by uploading the works through the UI. One of these is throwing the ImageMagick/GhostScript error, and the other (the 17 MB file) is just causing a timeout -- I can't see any other errors for that in the logs. But that's the same behavior as when uploaded via Bulkrax, so I don't think the problem is with the code in this branch. Regarding the work |
@dolsysmith I tried reloading all 99 ETDs as a single import.
The errors I'm seeing with the Sidekiq jobs are consistent with the above, in that for each of the 3, the CreateRelationships job is not stuck, but the CreateDerivatives job is. Each has a different error. I haven't tried reloading via the UI, but I would expect to get the same results as you did (i.e. ok but no thumbnail). I believe these are errors that we cannot easily fix in Hyrax/Ghostscript/Minimagick, but we can live with. Plus they are neither new nor unique to this new Bulkrax loader. Over the past few years they have become less frequent, but they do still occur as we see here. @alepbloyd @dolsysmith @kilahimm What are your thoughts on proceeding with merge given these results of testing? |
etd_date_created = get_date_created(doc) | ||
work_metadata['date_created'] = etd_date_created unless etd_date_created.nil? | ||
work_metadata['committee_member'] = get_committee_members(doc).join(';') | ||
work_metadata['rights_statement'] = 'http://rightsstatements.org/vocab/InC/1.0/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add the license and resource_type fields, since those are required when updating/editing a work in the UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this - I'll fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add resource_type
and @dolsysmith you'll add license
in the branch for #458
@kerchner Merging makes sense. See my comment above about a minor change to ensure that the ETD's created have values for all of the fields required by the UI. If required values are missing, one cannot save changes when. manually editing the work, which may prove confusing for users. |
Work-in-progress: Adding the "All rights reserved" license will require some problem-solving because Bulkrax wants to append a Also noting that we do not seem to ever display the |
…ut from ScholComm
@dolsysmith @kilahimm @alepbloyd Thoughts on this? |
I updated this ticket to reflect ScholComm's discussion of the license field. |
@kerchner The deprecated (and problematic) license value has been replaced in #458 with a value that doesn't get parsed as a URI, so it doesn't seem to cause a problem for Bulkrax. I think the |
This PR does the following:
Set up the code branch for testing:
bundle install
to upgrade Bulkrax.db:migrate
-- the Bulkrax upgrade affects Bulkrax-related database tables./opt/scholarspace/scholarspace-ingest
directory. (Creation of this directory, if not already present, has been added toDockerfile
.)Set up data for testing:
There is an AWS S3 bucket called
proquest-etds-test
containing a set of ProQuest ETD zip files suitable for testing. Steps to set up the data so that it can be accessed from within the docker containerapp-server
andsidekiq
volumes:${HOME}/.passwd-s3fs
- please see me or @kilahimm for the bulkrax access key credentials./data-s3-etds
directory and then runs3fs proquest-etds-test /data-s3-etds
/opt/scholarspace/scholarspace-ingest
or a directory beneath it.Test:
ingest_bulkrax_prep
rake task inside the container (or from the outside usingdocker exec
. The task requires an argument, which is the path to the directory containing the ProQuest zips you wish to include in the ingest. For example,bundle exec rails gwss:ingest_pq_etds['/opt/scholarspace/scholarspace-ingest/etd-zips'] # where etds are in /opt/scholarspace/scholarspace-ingest/etd-zips
/opt/scholarspace/scholarspace-hyrax/tmp/bulkrax_zip
and within this folder,metadata.csv
andfiles/
containing attachments (within subdirectories, one per ETD)scp
tmp/bulkrax_zip/metadata.csv
to your local computer so that you can more conveniently view it, for example using Excel, in order to validate test results.Before starting the import, open a tab to the Sidekiq administrator (at
/sidekiq
) so that you can watch progress of the queues.Then proceed and click Create and Import.
*If you wish to re-run the task to generate the bulkrax-ready metadata and files, then you'll need to first clear out the results of the previous run:
rm -r /opt/scholarspace/scholarspace-hyrax/tmp/bulkrax_zip
Validate the test results
metadata.csv
(and that at least the count of ETDs matches the number of ETD zip files).Other notes to the reviewer
I recognize that the code for the rake task, as currently written, is probably not the most ruby-esque in style. I am very open to suggestions for improvements to the style (and, of course, function) of the code.
Final thoughts
While this procedure for parsing the ProQuest zips, then loading via the front end, is still somewhat manual, requiring running back-end tasks, loading ETDs is generally something that we do just a few times per year. I also recognize that separating out ETDs from the S3 mount to be newly loaded, and avoiding re-loading ETDs that we already loaded, will also be somewhat of a manual process, although I intend to post suggested bash commands to make this a bit easier. The concept is to get this much working, try it for a few cycles of loads, and take what we learn and then decide what enhancements to the overall workflow would be most worthwhile.