Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds as_relation flag to EnumeratorBuilder#active_record_on_batches #86

Conversation

adrianna-chang-shopify
Copy link
Contributor

Currently, the resulting batch from #active_record_on_batches is an Array of records. Many job authors end up converting this into a relation to perform batch operations such as #update_all (even the existing BatchesJob example in the README does this!). By expanding the API of #active_record_on_batches to return a relation if a flag is set, users who require their batch as an ActiveRecord::Relation will no longer need to perform an additional query to turn records back into a relation.

This has no effect on existing jobs, since by default as_relation will be set to false and batches will continue to be converted to Arrays.

Not sure whether it's worth documenting an example of this in the README, or whether the API documentation is sufficient.

@adrianna-chang-shopify
Copy link
Contributor Author

cc @Shopify/job-patterns

Our motivation for doing this is to offer an API in the Maintenance Tasks gem that supports Active Record's Batch Enumerator (see this issue for the proposed approach). Obviously Job Iteration won't work directly with Batch Enumerators, and we intend to use the existing batch enum mechanism under the hood, but we'd like the enumerator to still yield Active Record relations instead of arrays of records.

Beyond our use case with the MT gem's API, I think it's also nice to give users the flexibility of having their batches come in as relations.

@GustavoCaso
Copy link
Contributor

GustavoCaso commented May 11, 2021

(even the existing BatchesJob example in the README does this!)

The README example actually gets an array of ActiveRecord objects, query the DB to get instances and apply update_all
Comment.where(id: batch_of_comments.map(&:id)).update_all(deleted: true)
Which seems unnecessary since we could avoid that extra call to the DB 🙁

Not sure whether it's worth documenting an example of this in the README

I believe it would be worth it since this is a new feature and people might need to figure out how to use it.

@GustavoCaso
Copy link
Contributor

I'm debating if having this as a flag is the right approach.
Maybe it would be better to have a different enumerator builder method that uses a different ActiveRecordCursor

We could use composition and we could pass the cursor class to the ActiveRecordEnumerator

I was thinking something like:

# frozen_string_literal: true

module JobIteration
  class ActiveRecordCursor 
    # OMITTED for ease of read
    def next_batch(batch_size)
      return nil if @reached_end

      relation = @base_relation.limit(batch_size)

      if (conditions = self.conditions).any?
        relation = relation.where(*conditions)
      end

      records = relation.uncached do
        materialize(relation)
      end

      update_from_record(records.last) unless records.empty?
      @reached_end = records.size < batch_size

      records.empty? ? nil : records
    end

    protected

    def materialize(relation)
      relation.to_a
    end
  end
end

module JobIteration
  class ActiveRecordCursorAsRelation < ActiveRecordCursor # @private
    def materialize(relation)
      relation
    end
  end
end


def build_active_record_enumerator_on_batches_as_relation(scope, cursor:, **args)
  enum = build_active_record_enumerator(
    scope,
    cursor: cursor,
    cursor_klass: ActiveRecordCursorAsRelation,
    **args
  ).batches
  wrap(self, enum)
end

@adrianna-chang-shopify Let me know what you think 😄

@@ -65,7 +65,7 @@ def next_batch(batch_size)
end

records = relation.uncached do
relation.to_a
as_relation ? relation : relation.to_a
end

update_from_record(records.last) unless records.empty?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think actually this line here is an issue, as mentioned in Shopify/maintenance_tasks#409
records.empty? will make an additional SQL COUNT query, then records.last a SELECT.

So it looks like we're trading one SELECT that allocates a bunch of memory in Ruby against a COUNT + a SELECT *.

I think we could aim towards a pluck(*@columns) and a nil check in Ruby, which would be just SELECT cursor columns and a nil check?

@adrianna-chang-shopify
Copy link
Contributor Author

Hey @GustavoCaso ! @etiennebarrie and I had actually discussed a separate API for batch relations, but then opted for a flag on the existing active_record_on_batches. The reason for this is because we largely believe that #build_active_record_enumerator_on_batches should yield relations in most cases. From the use cases we've seen (internally in Shopify Core / other apps), it seems like most of the time users want a relation. We're thinking it makes sense to keep this single API with the intent to migrate over to batches as a relation by default, rather than adding API surface now only to change most use cases to use #build_active_record_enumerator_on_batches_as_relation moving forward.

What do you think?

@GustavoCaso
Copy link
Contributor

GustavoCaso commented May 11, 2021

We're thinking it makes sense to keep this single API with the intent to migrate over to batches as a relation by default, rather than adding API surface now only to change most use cases to use #build_active_record_enumerator_on_batches_as_relation moving forward.

Would migrate batches as a relation by default, would mean that it would be a breaking change, no? We will have to update a bunch of code in core to change the jobs that expect to have an object to actually works with a relation.

Having the two options doesn't seem like such a bad idea.

  • The current API build_active_record_enumerator_on_batches return batches of records
  • build_active_record_enumerator_on_batches_as_relation return batches of activerecord relations

Also, I'm not an expert on ActiveRecord by any means, so @adrianna-chang-shopify if you think that I'm missing something please let me know.

@adrianna-chang-shopify
Copy link
Contributor Author

Would migrate batches as a relation by default, would mean that it would be a breaking change, no?

Oh yes absolutely, this would definitely require a major version update & changes to the jobs in Core (although most of these jobs want a relation, so we'd be reducing db calls here at least).

You're right, I think having the two options is fine for now. If we decide we want to make relations the default moving forward, it will indeed be easier to have users migrate to a new API first rather than flipping a default on an existing API.

I'll make the changes, and see if I can reduce some of those queries with the optimization suggestions Étienne made.

@etiennebarrie
Copy link
Member

would mean that it would be a breaking change, no? We will have to update a bunch of code in core to change the jobs

Yes and no: it's probably fair to say it's a breaking change, but most code won't have to change, e.g. records.map(&:id) or records.each would still work fine with a relation.

@adrianna-chang-shopify
Copy link
Contributor Author

adrianna-chang-shopify commented May 11, 2021

+1 to @etiennebarrie's point, although the flip side of that is that if we're bothering to make batches relations by default to remove that extra call, we'd obviously want to go in there and remove any .where(id: batch.map(&:id)) calls to actually reap the benefits of switching to a relation. So there would still be changes involved.

If our end goal is to come back to a single API for active record batch enumerators, I think a flag would make more sense because we could:

  1. Ship this with the flag; 2) Change the jobs in Core to use as_relation: true and remove excess db calls 3) Ship a major version that defaults the flag to true

But to @GustavoCaso's point, it might make sense to ship an additional API and unblock the work in the Maintenance Tasks gem, and defer any decisions about yielding batches as relations by default until a further date. If we're unsure about whether we'll have a single default moving forward, it would make sense to offer separate entrypoints in BuildEnumerator.

@adrianna-chang-shopify adrianna-chang-shopify force-pushed the active-record-on-batches-with-relation-option branch from 5d7ddce to 8de86c8 Compare May 12, 2021 14:05
…ions

Co-authored-by: Étienne Barrié <etienne.barrie@shopify.com>
@adrianna-chang-shopify adrianna-chang-shopify force-pushed the active-record-on-batches-with-relation-option branch 2 times, most recently from e9881d6 to c996046 Compare May 12, 2021 14:13
@adrianna-chang-shopify adrianna-chang-shopify force-pushed the active-record-on-batches-with-relation-option branch from c996046 to bae9b71 Compare May 12, 2021 14:17
@adrianna-chang-shopify
Copy link
Contributor Author

👋 @GustavoCaso sorry for the delay on this, we made a couple of changes! Notably:

  • Optimized the logic in #next_batch when returning relations to avoid calling #size, which was producing an extra COUNT db query as @etiennebarrie mentioned here
    - Note that before, each batch was performing a SELECT (records.last) and a COUNT (records.empty), and even if we removed the records.empty, we had a records.size that would end up doing a COUNT
    - Now, we call next_batch one extra time (once with an empty batch to set @reached_end, and then again where we return early)
    - next_batch on records still works the same, and since we can use #size on the records without a db call, we can set @reached_end based on this, without needing to do an extra next_batch call like with records
    - Tests for all this logic are here: 1d0c570#diff-51044c0cc0b5d8318cedf297cb91496754e8fb32f2820b44cf031ee04ce5c5f0
    - I also made #update_from_record private, I don't see a reason why it needs public visibility
  • Separate enumerator builder for relations as you'd suggested, #active_record_on_batches_as_relation
  • I've opted to keep it as a flag rather than splitting out Cursor classes for now - I think it makes sense to eventually separate out different ActiveRecord cursors, but I don't think it necessarily makes sense for the relation one to be a subclass of the records one (I'd prefer for them both to subclass an abstract Active Record Cursor class). I think there's also room to split ActiveRecordEnumerator into multiple enumerators as well - one for records and one for batches? Anyways, TL;DR is I think there's room for some refactoring in general, and I don't want to commit to defining an abstraction yet. So, I think it makes sense to leave as a flag for now

Let me know what you think!

@etiennebarrie
Copy link
Member

etiennebarrie commented May 12, 2021

  • Now, we call next_batch one extra time (once with an empty batch to set @reached_end, and then again where we return early)

I think we can clarify that we need to do this because we can't rely on the "optimization" that checks whether the current batch is full.

@reached_end = records.size < batch_size

Like one test shows, there are cases where actually this optimization does not avoid fetching one more empty batch:

test "#batches yielding records performs one query for each batch when relation_size % batch_size isn't 0" do

Also one thing that I had mentioned in #86 (comment) that we gave up on for now is to only pluck the columns we need for the next position. We still load the whole record, because that wouldn't save us anything because it's still loaded on the enumerator side:

yielder.yield(records, cursor_value(records.last)) if records.any?

Hum that lead me to investigate, because we should not be calling any? here, it would cause another query. But we're still not loading only the last one, so this still doesn't change much at all 😬
That's because ActiveRecord::Relation.last will still load everything if the relation has a limit, which it does because we're batching:
https://github.com/rails/rails/blob/v6.1.3.2/activerecord/lib/active_record/relation/finder_methods.rb#L147-L148
https://github.com/rails/rails/blob/v6.1.3.2/activerecord/lib/active_record/relation.rb#L744-L745

So the problem being in the enumerator calling last on the batch, I don't think we have a choice but to write a new enumerator if we want to enumerate on batch relations without loading all the records…

@adrianna-chang-shopify
Copy link
Contributor Author

@etiennebarrie I could be missing something, but seems to me that

yielder.yield(records, cursor_value(records.last)) if records.any?

doesn't actually load everything again.

We're doing relation.last here in #prepare_next_batch_as_relation: https://github.com/Shopify/job-iteration/pull/86/files#diff-a62e3d57d0bb79088cbdb20f8f1480e8422c2b02222057de2624ed0c1d4eac82R101, which ends up calling load on the relation and setting @records

So even though we're doing:
https://github.com/rails/rails/blob/75ac626c4e21129d8296d4206a1960563cc3d4aa/activerecord/lib/active_record/relation/finder_methods.rb#L147-L148 and then using #find_last because we have a limit on our relation, this ends up using records.last, and in #records when we #load it won't actually perform any more db queries because we already are loaded?.

Similarly, any? should be fine because we're also loaded, so it'll just use @records.empty? without another query.

@etiennebarrie
Copy link
Member

etiennebarrie commented May 12, 2021

Yes you got that right but what that means, is that even using relations, the records are loaded, which is basically the same as calling to_a.
There's a difference, because doing relation.update_all will re-use the previous conditions on the relation, including the limit, whereas doing Record.where(id: records.map(&:id)).update_all will use the ids as the condition to update records, but the end result is still that we loaded the full fields of 100 records from the DB and created active record objects with them.

For example we could test this by changing our test to:

    def test_activerecord_batches_as_relation
      push(BatchActiveRecordRelationIterationJob)

      work_one_job
      assert_jobs_in_queue(0)

      records_performed = BatchActiveRecordIterationJob.records_performed
      assert_equal([3, 3, 3, 1], records_performed.map(&:size))
      assert(records_performed.all? { |relation|
        relation.is_a?(ActiveRecord::Relation) && !relation.loaded?
      })
    end

You can also see this by logging the SQL queries:

diff --git c/test/unit/active_record_enumerator_test.rb i/test/unit/active_record_enumerator_test.rb
index f5a17ee..494f5a7 100644
--- c/test/unit/active_record_enumerator_test.rb
+++ i/test/unit/active_record_enumerator_test.rb
@@ -140,7 +140,7 @@ def build_enumerator(relation: Product.all, batch_size: 2, columns: nil, cursor:
     def count_uncached_queries(&block)
       count = 0
 
-      query_cb = ->(*, payload) { count += 1 unless payload[:cached] }
+      query_cb = ->(*, payload) { p payload.slice(:sql, :cached); count += 1 unless payload[:cached] }
       ActiveSupport::Notifications.subscribed(query_cb, "sql.active_record", &block)
       count
     end

You'll see that even using the relations, we load the records:

{:sql=>"SELECT `products`.* FROM `products` ORDER BY products.id LIMIT 2"}
{:sql=>"SELECT `products`.* FROM `products` WHERE (products.id > 2) ORDER BY products.id LIMIT 2"}
{:sql=>"SELECT `products`.* FROM `products` WHERE (products.id > 4) ORDER BY products.id LIMIT 2"}
{:sql=>"SELECT `products`.* FROM `products` WHERE (products.id > 6) ORDER BY products.id LIMIT 2"}
{:sql=>"SELECT `products`.* FROM `products` WHERE (products.id > 8) ORDER BY products.id LIMIT 2"}
{:sql=>"SELECT `products`.* FROM `products` WHERE (products.id > 10) ORDER BY products.id LIMIT 2"}

When we would expect completely different queries, e.g. something like

SELECT `products`.* FROM `products` ORDER BY products.id OFFSET 1 LIMIT 1
SELECT `products`.* FROM `products` WHERE (products.id > 2) ORDER BY products.id OFFSET 1 LIMIT 1
SELECT `products`.* FROM `products` WHERE (products.id > 4) ORDER BY products.id OFFSET 1 LIMIT 1

i.e. loading the last record of the batch only.

@adrianna-chang-shopify
Copy link
Contributor Author

We are going to take a different approach of building a new enumerator so that we can pluck only the cursor columns and don't have to load full records in order to compute the cursor value. This will also allow us to return a relation based on PK ids, rather than the original relation, which in some cases may be quite complex. We'll open this on a separate branch 😁

@adrianna-chang-shopify adrianna-chang-shopify deleted the active-record-on-batches-with-relation-option branch May 17, 2021 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants