-
Notifications
You must be signed in to change notification settings - Fork 0
Bulkrax
Bulkrax is a tool developed by Samvera Labs which deals with importing metadata from CSV, XML or other sources. For more detailed documentation see the project's wiki.
Bulkrax Importer
s and Exporter
s are the models that are listed in the importers and exports pages. They can have many Entries
,
each corresponding to a work or collection of Hyrax. The entries use the Parser
s to extract the work metadata from its
source during a ImporterRun
or ExporterRun
.
These two represent an attempt to execute the Importer
or Exporter
and create a Status
for each work when running it.
The Status
instances will store the result of the attempt to import or export the data and save any possible exception
stack if something fails.
Bulkrax uses a metadata mapping file that defines how the work is imported and exported. In HykuAddons we use a custom one defined in
the hyku_addons.bulkrax_overrides
part of the engine.rb
file.
Three important aspects here to consider:
- Bulkrax needs a
system_identifier
field to know how to reference the identifiers of the imported and exported works. We usesource_identifier
for that, so make sure the imported CSV files always have one. - Bulkrax needs to know how to map the CSV data fields to work metadata. All the fields that can potentially have Arrays need
an entry in the field mappings. We use the pipe
|
character to separate the multiple values in the CSV file, which is configured with thesplit: '\|'
options. Also note that the escaping behaviour for this split string is somewhat mind-bending and after much investigation the team has decided that'\|'
is definitely the best option. - We define a custom Parser that adds a lot of custom behaviour at
app/parsers/hyku_addons/csv_parser.rb
. Details on this at [ParserCustomisation] It is also possible to exclude fields from the import and export process, which is done by settingfield_name => { exclude: true }
.
The create action at Bulkrax::ImportersController
initializes the Bulkrax::Importer
class with its parser options and
sets up the attachment file we use to parse the import entries.
After successfully saving the importer in the DB, Bulkrax delegates to the Bulkrax::ImporterJob
to perform the import.
This job calls the methods create_collection
and create_works
from HykuAddons::CsvParser
, which first create a HykuAddons::CsvEntry
for every work and collection found on the CSV file and then launch a child job for each item (either ImportWorkJob
or ImportCollectionJob
depending on the item's type).
In Bulkrax v0.1 it is not possible to import a collection's title, but since title is a required field on collections, the importer must choose a default. Bulkrax sets this to be the collection's id. HykuAddons::CollectionBehavior
overrides this by allowing the import of collection titles. If no title is set, this defaults to "New collection #{i + 1}"
where i
is the current index of the list of collections the importer has found. In Bulkrax has depreciated the name collection
in favour of parent
and version 3 only supports the latter. See CollectionBehavior for more details.
The ImportWorkJob
loads the Bulkrax::Entry
, which will automatically be casted to whatever subclass we use on the imports
which is HykuAddons::CsvEntry
for us using Rails' single table inheritance.
The importers and exporters metadata extraction logic is done in our CsvEntry
class at app/models/hyku_addons/csv_entry.rb
.
TYhe method build_metadata
defined there sets each of the metadata fields calling add_metadata
and then handles all
the fields that require special treatment: files, file subfields, visibility, etc. Each of these methods add or transforms
the values at the parsed_metadata
hash that is finally returned.
Once the work metadata is done, we need to emulate the work form that cerates the work, stores its metadata in Solr and
triggers all the actors stack to ensure it behaves exactly as the work form. That is done using a Bulkrax::ObjectFactory
that
is initialized with the work metadata and some options that customise its behaviour:
The run
method of the factory ensures that the work is created or updated and the appropriate callbacks are run.
- We create an importer with parser options and a CSV file
-
Bulkrax::ImporterJob
callsimport_collections
andimport_works
, which create aBulkrax::Entry
for every collection and work found in the CSV file. - It calls
ImportWorkJob
orImportCollectionJob
job for each entry to parse all the work or collection metadata. - It passes this metadata to the
Bulkrax::ObjectFactory
which behaves like a work form and executes all the actors stack.
HykuAddons::ExportersControllerOverride overwrites Bulkrax's original to create a Bulkrax::Exporter
, save it and delegate
the heavy lifting to a HykuAddons::MultitenantExporterJob
that performs the export as a background job.
This job elevates makes the account elevation for the job and creates a Bulkrax::ExporterJob
to run in its same thread.
The selection of the works to export is done in the current_work_ids
method. When creating the export, there is an option to set what data do we want to export: the works on a collection,
all the works of a given work type or an importer created previously -usefull for round-tripping-. The current_work_ids
method
will make the Solr query it needs to fetch the works to export.
The export of the work metadata is done in the export
method of the Bulkrax::Exporter
itself. This will call the
Bulkrax::CsvParser
parser's create_new_entries
method create a Bulkrax::CsvEntry
for every work to export and run a Bulkrax::ExportWorkJob
to do the metadata extraction.
When the all the work's metadata is ready, it calls write
method to store it in the format the user selected (CSV, XML, etc) and zip it.
HykuAddons::CsvParser
at app/parsers/hyku_addons/csv_parser.rb
makes the majority of the Csv parsing customisation for imports and exports.
- Sets
entry_class
andadmin_set_entry_class
so that theBulkrax::Entry
instances the importer and exporters create have the subclasses we want. - Overwrite
file_paths
to allow splitting multiple files by the:;|
characters. - Overwrite
retrieve_cloud_files
, originally defined by Bulkrax::ApplicationParser. - Customize the
create_collections
behavior to create not only collections but admin sets too. Thecollection_totals
method is customised to include the total of admin sets too.
TG TODO
The importers need to be run in Import Mode because we store the files
in a Google cloud folder that we create for every tenant and are only available to the workers specific to the tenants.
For example, a tenant with name demo
will use the worker with name demo-import-workers
to process the importers.
These means that only these workers can process the imports so we need to make sure that Import mode
is on on the
settings page and the workers have active pods on the Kubernetes platform.
- Importers appears not to be running or progressing: Import mode is off and/or the tenant import workers are not there or they have no pods assigned.
- Some of the imported items show a
FileNotFound
error. This normally means the Google Cloud folder is not correctly mounted.
If you see the following error on your work fields it could mean that the field has not been added to the Bulkrax.field_map
method which is found inside config/initializers/bulkrax.rb
1.4) Failure/Error: expect(expectation[:actual]).to eq(expectation[:test]), error_message
expected place_of_publication to equal ["place_of_publication-0", "place_of_publication-1", "place_of_publication-2"] but got #<ActiveTriples::Relation:0x00005643c816cc88>
Diff:
@@ -1 +1 @@
-["place_of_publication-0", "place_of_publication-1", "place_of_publication-2"]
+["place_of_publication-0|place_of_publication-1|place_of_publication-2"]
# ./spec/features/bulkrax_dynamic_import_spec.rb:77:in `block (6 levels) in <top (required)>'
# ./spec/features/bulkrax_dynamic_import_spec.rb:75:in `each'
# ./spec/features/bulkrax_dynamic_import_spec.rb:75:in `block (5 levels) in <top (required)>'
# ./spec/features/bulkrax_dynamic_import_spec.rb:71:in `block (4 levels) in <top (required)>'
- Go to Importers
- Click on import
- See the errors possibly listed
- Click to see stack trace
- View jobs in sidekiq
To rerun inside of console:
e = Bulkrax::Entry.find(ID_FROM_DASHBOARD_JOB)
e.build
There are 3 feature specs which test the behaviour of Bulkrax:
- /spec/features/bulkrax_import_spec.rb
- /spec/features/bulkrax_export_spec.rb
- /spec/features/bulkrax_dynamic_import_spec.rb
The first uses fixtures to test the importer. This tests that some basic fields are populated, files are imported with the correct visibility, collections are imported, DOIs are minted. The export tests that basic fields can be exported, that works can be imported that have been exported (round-tripping), that files are correctly exported and zipped.
The last test uses the schema to build a CSV of faked data which tests all fields for a schema work types. When a WorkType is migrated to use a schema, it should be added to this test. In spec/support/bulkrax/csv_writer_helper.rb
you will find methods which use schema driven work's field_configs to build CSV headers and data. As there is little validation, data is either date, singular text values, multiple text values, textareas, or values of a particular authority. Note that dates can be in full, or just years, or months within a particular year. spec/support/bulkrax/csv_reader_helper.rb
uses the faked data in the CSV file to build expectations about the state of the works Bulkrax should have created.