- Inital export from Hyrax
- Export thumbnails
- Export derivatives
- Extract images
- Convert to Pyramidal Tiffs
- Extract text
- Create manifest
Export scripts are in \scripts
To run using the provided docker container:
docker compose run python python /code/getThumbnails.py <args>
To run in the container manually:
docker compose up -d
docker exec -it python1 bash
To run locally in wsl:
wsl
mkdir /mnt/host/t
mount -t drvfs T: /mnt/host/t
There is rake task in our Hyrax instance to initally export objects from Hyrax. This includes metadata.yml
and content files in subdirectories by file extension:
SPE_DAO/
└── u531/
└── 9019sk86q/
└── v1/
├── csv/
├── pdf/
├── xslx/
└── metadata.yml
You can export individual objects by ID:
rake export:export_files ID=8336hn42n
Or export all objects in a collection with a collection ID:
rake export:export_files COLLECTION=ua531
Both of these options will not export files where the object folder already exists in SPE_DAO
, such as if this is present:
\\Lincoln\Library\SPE_DAO\ua531\8910kc626
You can override this with FORCE=true
. This will overwrite the metadata.yml
and content files for object(s):
rake export:export_files ID=8336hn42n FORCE=true
rake export:export_files COLLECTION=ua531 FORCE=true
There's not a good way to get the thumbnails out of Hyrax via the DB. Yet, we can get them over http based on an identifier in metadata.yml
:
python getThumbnails.py ua200
Running this with a collection ID will re-download and overwrite all thumbnails for each exported object in a collection.
Running it without an arg will download thumbnails for every object in SPE_DAO
python getThumbnails.py
Hyrax also creates useful derivative files, such as webm for videos, pdfs for office docs, etc.
python getDerivatives.py ua200
This will also run for everything in SPE_DAO
if you don't give it a collection ID.
This also updates metadata.yml
to list the original file and format.
extractImages.py
has some Linux dependancies so we should run it in docker.
docker compose up
In new terminal:
docker exec -it python1 bash
python extractImages.py
python extractImages.py apap214
python extractImages.py apap214 pk02cv45j
Add Collection and object ids to limit to those collections/objects.
Needs to be run either on Windows or railsdev for access.
python findOriginals.py
docker compose up
In new terminal:
docker exec -it python1 bash
python makeTiffs.py
python makeTiffs.py ua807
python makeTiffs.py ua807 pk02cv4fj
Add Collection and object ids to limit to those collections/objects.
This creates structured text, which we eventially need but will take awhile.
docker compose up
In new terminal:
docker exec -it python1 bash
python tesseract.py apap042
python extractText.py apap015
This will either extract existing OCR text within scanned PDFs as a temporary measure, or will extract better text from born-digital PDFs.
For born-digital PDFs, if this is run after tesseract, it will not override the structured HOCR, but will produce better content.txt
files for indexing.
docker compose up
In new terminal:
docker exec -it python1 bash
python createTranscription.py apap138
IIIF Presentation API v3 manifests
docker compose up
In new terminal:
docker exec -it python1 bash
python manifest.py apa042
This will also run for everything in SPE_DAO
if you don't give it a collection ID.
This still needs work.