Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allow Bulkrax to assign queue based on number of rows in CSV #671

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

thenapking
Copy link
Contributor

@thenapking thenapking commented Feb 28, 2022

Note: In addition to this description a Wiki article has been created to expand on this information

This PR replaces the import_mode feature flipper with code that checks how many works are being imported as part of an import run and switches to bulk processing queues if a threshold has passed.

#Methodology
This adapts the currently logic for import mode, and essentially flips the feature on based on a database lookup. The ImportMode module is included on ActiveJob::Base.

The portable_object

For any arbitrary job we need a way of fetch the work or collection the job is operating on (which we call the portable_object) and looking up the parser class associated with that object in order to fetch the total number of objects being imported. There are a number of different ways a job may refer to a portable_object. Some jobs will be initialised with the UUID of a work or collection. From that UUID we would be able to find whether it is currently being updated by Bulkrax, and if so, how many objects are in the importer run. In some cases the work or collection may not yet exist (perhaps because the job will create it), but a Bulkrax::Entry will exist. Once we have the Bulkrax::Entry we can again fetch the parser and count the total number of objects

The strategy classes

There are four strategies the jobs use to identify the portable_object. The most simple is the PortableBulkraxEntryBehavior which finds the Bulkrax::Entry from the id passed to the job. The PortableActiveFedoraBehavior fetches an ActiveFedora record and finds an associated Bulkrax::Entry either from the source_id or the id itself. The PortableGenericBehavior is used when the job is passed the ActiveFedora object itself rather than its id. The final strategy is the PortableBulkraxImporterBehavior. This strategy applies only to the job which kicks off the import (Bulkrax::ImportJob), which is called on neither a work nor a collection, but a table internal to Bulkrax.

Import mode

Once we have identified the portable_object, we look up its parser and return the total attribute. If this is above a threshold then import mode is triggered. The tests for this can proceed as previously. We create a generic job class and include ImportMode. All we need to add is some mocks which will return the count of the number of works being imported. Which mocks we setup depends on which strategy is used. The spec loops through the four strategies and sets up slightly different mocks.

TODO

Whilst the tests demonstrate in principle that the queues are being assigned correctly in practice it appears that the web process picks up most of the Bulkrax jobs, at least in the development environment. This is because the jobs are not being called asynchronously at all, but are called with perform now. Thus it is difficult to tell whether this PR will do what we expect without manual testing on a separate test environment.

Warning: This PR needs to be thoroughly tested before it is merged, and the change will need to be monitored for several weeks after deployment. It risks breaking import mode which will severely impact the business's ability to onboard new clients.

@thenapking thenapking force-pushed the REPO-1625-allow-bulkrax-to-assign-queue-based-on-number-of-rows-in-csv branch from 9950d29 to 0f5520c Compare February 28, 2022 15:09
@thenapking thenapking marked this pull request as draft February 28, 2022 15:10
module HykuAddons
module ImportMode
extend ActiveSupport::Concern

def queue_name
return super if non_tenant_job?
switch do
return super unless Flipflop.enabled?(:import_mode)
puts "Job: #{self.class}. Bulk is #{bulk?}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These logger/puts statements are very useful for debugging but must be removed before this is merged.

@@ -0,0 +1,15 @@
# frozen_string_literal: true
# rubocop:disable Rails/Output
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This disable statement should be removed, along with the puts statement before merge

@@ -0,0 +1,15 @@
# frozen_string_literal: true
# rubocop:disable Rails/Output
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This disable statement should be removed, along with the puts statement before merge

@@ -0,0 +1,14 @@
# frozen_string_literal: true
# rubocop:disable Rails/Output
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This disable statement should be removed, along with the puts statement before merge

@@ -0,0 +1,18 @@
# frozen_string_literal: true
# rubocop:disable Rails/Output
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This disable statement should be removed, along with the puts statement before merge

end
end

private

def works
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes clean up the class a little and allow the new code to share logic with the existing code. However they are not strictly needed for this update. Only the total_entries method is needed and that could be refactored if you do not wish to add all these refactors.

@prdanelli prdanelli changed the title Allow Bulkrax to assign queue based on number of rows in CSV [WIP] Allow Bulkrax to assign queue based on number of rows in CSV Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant