Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support batch record collections #402

Closed
wants to merge 1 commit into from

Conversation

adrianna-chang-shopify
Copy link
Contributor

@adrianna-chang-shopify adrianna-chang-shopify commented May 5, 2021

Closes: #401

This PR allows users to specify batch collections via the following API:

# app/tasks/maintenance/update_posts_in_batches_task.rb
module Maintenance
  class UpdatePostsInBatchesTask < MaintenanceTasks::Task
    in_batches 50

    def collection
      Post.all
    end

    def count
      collection.count
    end

    def process(batch_of_posts)
      Post.where(id: batch_of_posts.map(&:id)).update_all(content: "New content added on #{Time.now.utc}")    
    end
  end
end

What are you trying to accomplish?

  • Batch support has been requested by numerous users because it can make Tasks that need to do batch operations much more efficient
  • I'm proposing a simple API that allows users to specify that their Task should process batches
  • Batch size is 100 by default (to match JobIteration's default batch_size), but can be customized
  • #process will accept a batch of records instead of a single record

Approach Taken

  • Users will call in_batches when defining their batch Task in order to denote it as a batch task
  • This sets a class variable @batch_size, defining the batch size for the Task
  • Define methods so we can retrieve whether a Task is #in_batches? and it's #batch_size at the instance level
  • If a Task is in batches, use Job Iteration's #active_record_on_batches method with the Task's batch size when building the enumerator in TaskJob#build_enumerator
  • Tweaker Ticker#tick to take an increment, defaulting to 1.
  • In TaskJob#each_iteration, I've tweaked the call to #tick to use input.size. This makes it so that tick count is still based on the number of records, rather than the number of batches. I felt that having progress in terms of records processed made more sense than the number of batches processed. Does this make sense?

What about using ActiveRecord::Batches::BatchEnumerator?

@etiennebarrie
Copy link
Member

This is unfortunately not compatible with Job Iteration's EnumeratorBuilder#build_active_record_enumerator_on_records because it explicitly requires an ActiveRecord::Relation

I think that's unfortunate. Like you said, records and batches both go through JobIteration::ActiveRecordEnumerator which uses ActiveRecordCursor to do most of the work, and the private build_active_record_enumerator really wants a Relation, but maybe we could extract the relation and batch size back from the ActiveRecord::Batches::BatchEnumerator and go from there?

@adrianna-chang-shopify
Copy link
Contributor Author

@etiennebarrie I can put up a PR that accepts ActiveRecord::Batches::BatchEnumerator collections for our API and then turns it back into a relation to appease Job Iteration, although I'm not entirely convinced yet that that's better than introducing our own DSL for defining batches that is more compatible with Job Iteration's API 😄

end

def process(batch_of_posts)
Post.where(id: batch_of_posts.map(&:id)).update_all(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get the batches, we shouldn't need the call to where, shouldn't we be able to call update_all directly on the batch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@adrianna-chang-shopify adrianna-chang-shopify May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although there's no reason we couldn't turn this back into a Relation / Batch Enumerator again on our end, so it probably makes sense for us to do that (despite being annoying to be moving this back and forth between a BatchEnumerator / Array / Relation 😛 )

@etiennebarrie
Copy link
Member

better than introducing our own DSL

I think that's much better than introducing a class-level method because we're actually changing the items to be processed, and not just something about them, like in the example task, we're now getting a batch_of_posts, so I really think this should be expressed inside collection, and not outside of it.

A DSL would be fine it was just changing the size of the batches that loads the records, but still passing the record one by one to the process method, e.g.:

module Maintenance
  class UpdatePostsInBatchesTask < MaintenanceTasks::Task
    batch_size 50 # here adding or removing this line doesn't change anything about the rest of the code
    # we load 50 records at a time, but still feed them one at a time to `process`

    def collection
      Post.all
    end

    def count
      collection.count
    end

    def process(post)
       post.update(content: "New content added on #{Time.now.utc}")    
    end
  end
end

In our proposition here, adding a call to the in_batches method does not require changing collection, but requires changing process to handle batches instead.

Whereas if we choose whether to batch inside the collection, it feels natural to update process because now we're not iterating over the same thing anymore, it's not longer a collection of records, but a collection of batches, so we need to process a batch and not a record anymore.

@adrianna-chang-shopify
Copy link
Contributor Author

@etiennebarrie okay fair, that's a sensible argument to me 👍 I'll make the changes

@adrianna-chang-shopify
Copy link
Contributor Author

Closing in favour of #409

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support batched collections
2 participants