Skip to content

Bulkrax

Paul Danelli edited this page Feb 28, 2022 · 13 revisions

How Bulkrax works

Bulkrax is a tool developed by Samvera Labs which deals with importing metadata from CSV, XML or other sources. For more detailed documentation see the project's wiki.

Data model:

Bulkrax Importers and Exporters are the models that are listed in the importers and exports pages. They can have many Entries, each corresponding to a work or collection of Hyrax. The entries use the Parsers to extract the work metadata from its source during a ImporterRun or ExporterRun. These two represent an attempt to execute the Importer or Exporter and create a Status for each work when running it. The Status instances will store the result of the attempt to import or export the data and save any possible exception stack if something fails.

Bulkrax uses a metadata mapping file that defines how the work is imported and exported. In HykuAddons we use a custom one defined in the hyku_addons.bulkrax_overrides part of the engine.rb file. Three important aspects here to consider:

  • Bulkrax needs a system_identifier field to know how to reference the identifiers of the imported and exported works. We use source_identifier for that, so make sure the imported CSV files always have one.
  • Bulkrax needs to know how to map the CSV data fields to work metadata. All the fields that can potentially have Arrays need an entry in the field mappings. We use the pipe | character to separate the multiple values in the CSV file, which is configured with the split: '\|' options. Also note that the escaping behaviour for this split string is somewhat mind-bending and after much investigation the team has decided that '\|' is definitely the best option.
  • We define a custom Parser that adds a lot of custom behaviour at app/parsers/hyku_addons/csv_parser.rb. Details on this at [ParserCustomisation] It is also possible to exclude fields from the import and export process, which is done by setting field_name => { exclude: true }.

Importers

The create action at Bulkrax::ImportersController initializes the Bulkrax::Importer class with its parser options and sets up the attachment file we use to parse the import entries. After successfully saving the importer in the DB, Bulkrax delegates to the Bulkrax::ImporterJob to perform the import. This job calls the methods create_collection and create_works from HykuAddons::CsvParser, which first create a HykuAddons::CsvEntry for every work and collection found on the CSV file and then launch a child job for each item (either ImportWorkJob or ImportCollectionJob depending on the item's type). In Bulkrax v0.1 it is not possible to import a collection's title, but since title is a required field on collections, the importer must choose a default. Bulkrax sets this to be the collection's id. HykuAddons::CollectionBehavior overrides this by allowing the import of collection titles. If no title is set, this defaults to "New collection #{i + 1}" where i is the current index of the list of collections the importer has found. In Bulkrax has depreciated the name collection in favour of parent and version 3 only supports the latter. See CollectionBehavior for more details.

The ImportWorkJob loads the Bulkrax::Entry, which will automatically be casted to whatever subclass we use on the imports which is HykuAddons::CsvEntry for us using Rails' single table inheritance. The importers and exporters metadata extraction logic is done in our CsvEntry class at app/models/hyku_addons/csv_entry.rb. TYhe method build_metadata defined there sets each of the metadata fields calling add_metadata and then handles all the fields that require special treatment: files, file subfields, visibility, etc. Each of these methods add or transforms the values at the parsed_metadata hash that is finally returned.

Once the work metadata is done, we need to emulate the work form that cerates the work, stores its metadata in Solr and triggers all the actors stack to ensure it behaves exactly as the work form. That is done using a Bulkrax::ObjectFactory that is initialized with the work metadata and some options that customise its behaviour: The run method of the factory ensures that the work is created or updated and the appropriate callbacks are run.

Resume:

  1. We create an importer with parser options and a CSV file
  2. Bulkrax::ImporterJob calls import_collections and import_works, which create a Bulkrax::Entry for every collection and work found in the CSV file.
  3. It calls ImportWorkJob or ImportCollectionJob job for each entry to parse all the work or collection metadata.
  4. It passes this metadata to the Bulkrax::ObjectFactory which behaves like a work form and executes all the actors stack.

Exporters

HykuAddons::ExportersControllerOverride overwrites Bulkrax's original to create a Bulkrax::Exporter, save it and delegate the heavy lifting to a HykuAddons::MultitenantExporterJob that performs the export as a background job. This job elevates makes the account elevation for the job and creates a Bulkrax::ExporterJob to run in its same thread.

The selection of the works to export is done in the current_work_ids method. When creating the export, there is an option to set what data do we want to export: the works on a collection, all the works of a given work type or an importer created previously -usefull for round-tripping-. The current_work_ids method will make the Solr query it needs to fetch the works to export.

The export of the work metadata is done in the export method of the Bulkrax::Exporter itself. This will call the Bulkrax::CsvParser parser's create_new_entries method create a Bulkrax::CsvEntry for every work to export and run a Bulkrax::ExportWorkJob to do the metadata extraction.

When the all the work's metadata is ready, it calls write method to store it in the format the user selected (CSV, XML, etc) and zip it.

Parser Customisation

HykuAddons::CsvParser at app/parsers/hyku_addons/csv_parser.rb makes the majority of the Csv parsing customisation for imports and exports.

  • Sets entry_class and admin_set_entry_class so that the Bulkrax::Entry instances the importer and exporters create have the subclasses we want.
  • Overwrite file_paths to allow splitting multiple files by the :;| characters.
  • Overwrite retrieve_cloud_files, originally defined by Bulkrax::ApplicationParser.
  • Customize the create_collections behavior to create not only collections but admin sets too. The collection_totals method is customised to include the total of admin sets too.

Collection customisation

TG TODO

Things to consider

The importers need to be run in Import Mode because we store the files in a Google cloud folder that we create for every tenant and are only available to the workers specific to the tenants. For example, a tenant with name demo will use the worker with name demo-import-workers to process the importers.

These means that only these workers can process the imports so we need to make sure that Import mode is on on the settings page and the workers have active pods on the Kubernetes platform.

Common Issues and How to debug and fix them

  • Importers appears not to be running or progressing: Import mode is off and/or the tenant import workers are not there or they have no pods assigned.
  • Some of the imported items show a FileNotFound error. This normally means the Google Cloud folder is not correctly mounted.

Debugging

If you see the following error on your work fields it could mean that the field has not been added to the Bulkrax.field_map method which is found inside config/initializers/bulkrax.rb

    1.4) Failure/Error: expect(expectation[:actual]).to eq(expectation[:test]), error_message

            expected place_of_publication to equal ["place_of_publication-0", "place_of_publication-1", "place_of_publication-2"] but got #<ActiveTriples::Relation:0x00005643c816cc88>
            Diff:
            @@ -1 +1 @@
            -["place_of_publication-0", "place_of_publication-1", "place_of_publication-2"]
            +["place_of_publication-0|place_of_publication-1|place_of_publication-2"]

          # ./spec/features/bulkrax_dynamic_import_spec.rb:77:in `block (6 levels) in <top (required)>'
          # ./spec/features/bulkrax_dynamic_import_spec.rb:75:in `each'
          # ./spec/features/bulkrax_dynamic_import_spec.rb:75:in `block (5 levels) in <top (required)>'
          # ./spec/features/bulkrax_dynamic_import_spec.rb:71:in `block (4 levels) in <top (required)>'

Debugging Process

Login to Dashboard

  • Go to Importers
  • Click on import
  • See the errors possibly listed
  • Click to see stack trace
  • View jobs in sidekiq

In the console

To rerun inside of console:

e = Bulkrax::Entry.find(ID_FROM_DASHBOARD_JOB)
e.build

Testing

There are 3 feature specs which test the behaviour of Bulkrax:

  • /spec/features/bulkrax_import_spec.rb
  • /spec/features/bulkrax_export_spec.rb
  • /spec/features/bulkrax_dynamic_import_spec.rb

The first uses fixtures to test the importer. This tests that some basic fields are populated, files are imported with the correct visibility, collections are imported, DOIs are minted. The export tests that basic fields can be exported, that works can be imported that have been exported (round-tripping), that files are correctly exported and zipped. The last test uses the schema to build a CSV of faked data which tests all fields for a schema work types. When a WorkType is migrated to use a schema, it should be added to this test. In spec/support/bulkrax/csv_writer_helper.rb you will find methods which use schema driven work's field_configs to build CSV headers and data. As there is little validation, data is either date, singular text values, multiple text values, textareas, or values of a particular authority. Note that dates can be in full, or just years, or months within a particular year. spec/support/bulkrax/csv_reader_helper.rb uses the faked data in the CSV file to build expectations about the state of the works Bulkrax should have created.

Clone this wiki locally