Skip to content

Imports Validation

Eneko Taberna edited this page Oct 27, 2021 · 3 revisions

Validation

How Validation works

The validation works comparing the metadata of the work in its source or original form, against its current or destination form. It iterates over every metadata field and add diff entries for every difference it finds. We call them diff entries because they can represent metadata additions (fields that are present in the destination metadata but not in the source), deletions (present in the source buit not in the destination) or modifications of the value. These diffs are stored in the error_backtrace column of the latest Bulkrax::Status created for the Bulkrax::Entry, as we consider that the validation is run against the result of the last import attempt.

These diffs are then presented in the "Validation issues" panel of the importer entries pages. Please remember to enable "Bulkrax validations" on the tenant settings page to be able to see the validation results. In order to avoid making the customers afraid of seeing validations issues, you may want to switch it off again after checking that the import went OK.

Code structure and interesting methods

The base class of validations is HykuAddons::Validations::EntryValidationService at app/services/hyku_addons/validations/entry_validation_service.rb. It takes an account and any subclass of Bulkrax::Entry as params and we just invoke the validate method to do the work.

This class serves as base to make more specific validators overwriting the excluded_fields, renamed_fields, the separator and the data source of the metadata to compare.

The excluded fields are not going to trigger any validation issues. The excluded_fields_with_names will do the same but also check that the fields have the values passed in the hash to ignore them. The renamed fields are used to map "old" field names to the ones defined in Hyku.

For example, CsvEntryValidationService is a validation subclass that customises part of its behaviour (overwrites the destination_metadata method to get it from the export of the current work metadata instead of SolrService). RedlandsEntryValidationService subclasses CsvEntryValidationService mentioned before to basicaly customise the excluded and renamed fields.

SolrEntryValidationService is a subclass (with a very bad name, it should be Hyku1MigrationValidationService) fitted to make validations against migrations from Hyku 1 to 2, and hopefully Hyku 2 to 3 as well. It allows connecting to a Blacklight endpoint as source metadata using HTTP Authentication or a valid cookie passed as params of the constructor.

Apart of its definition of fields to rename and exclude, this files points shows how we can apply data transformations to each of the metadata fields to ensure we make semantically valid comparisons. This means we need to ensure that, for example, the value for creator and contributor fields are compared as hashes instead of strings, because "{'a': 1, 'b': 2}" is a different string than "{'b': 2, 'a': 1}" even if they represent the same hash. The same happens with fields that convert to different values in the destination, like the resource_type, which change from "ArticleWork" to just "Article" in Hyku2.

All these transformations are done defining methods with name reevaluate_#{field_name}, like reevaluate_resource_type_tesim or reevaluate_creator_tesim. The validator will detect if any there is any method with name reevaluate_#{field_name} before comparing the values of this field in the source and destination metadata.

Launching Validations

All the valdiations are run using the rake tasks included at lib/tasks/imports.rake. It is divided into two namespaces: hyku:validations:importers and hyku:validations:entries where the former has tasks to validate Importers and the later has tasks to validate a single entry. Both namespaces has tasks to launch HTTP, Cookies or CSV validations.

HTTP validations and Cookie validations are used for Hyku 1 to 2 migrations using HTTP Authentication or a valid Cookie on the blacklight endpoints.

The most versatile validation is ValidateCsvImporterEntryJob which allows passing 3 arguments: An account, an entry and a string with the validator class name that the job instantiates to run the validation. This allows running any validator subclass directly from the rake tasks, using an invocation with the format of:

rake hyku:validations:importers:csv[tenant_uuid:entry_id:validator_class_name]

like:

rake hyku:validations:importers:csv[123abc:42:HykuAddons::Validations::RedlandsCsvValidationsService]

Which will run a RedlandsCsvValidationsService against the Bulkrax::Entry with id 42 included in the account with tenant uuid 123abc

Clone this wiki locally