-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds as_relation
flag to EnumeratorBuilder#active_record_on_batches
#86
Adds as_relation
flag to EnumeratorBuilder#active_record_on_batches
#86
Conversation
cc @Shopify/job-patterns Our motivation for doing this is to offer an API in the Maintenance Tasks gem that supports Active Record's Batch Enumerator (see this issue for the proposed approach). Obviously Job Iteration won't work directly with Batch Enumerators, and we intend to use the existing batch enum mechanism under the hood, but we'd like the enumerator to still yield Active Record relations instead of arrays of records. Beyond our use case with the MT gem's API, I think it's also nice to give users the flexibility of having their batches come in as relations. |
79ffa6c
to
5a00479
Compare
The README example actually gets an array of ActiveRecord objects, query the DB to get instances and apply
I believe it would be worth it since this is a new feature and people might need to figure out how to use it. |
I'm debating if having this as a flag is the right approach. We could use composition and we could pass the cursor class to the I was thinking something like: # frozen_string_literal: true
module JobIteration
class ActiveRecordCursor
# OMITTED for ease of read
def next_batch(batch_size)
return nil if @reached_end
relation = @base_relation.limit(batch_size)
if (conditions = self.conditions).any?
relation = relation.where(*conditions)
end
records = relation.uncached do
materialize(relation)
end
update_from_record(records.last) unless records.empty?
@reached_end = records.size < batch_size
records.empty? ? nil : records
end
protected
def materialize(relation)
relation.to_a
end
end
end
module JobIteration
class ActiveRecordCursorAsRelation < ActiveRecordCursor # @private
def materialize(relation)
relation
end
end
end
def build_active_record_enumerator_on_batches_as_relation(scope, cursor:, **args)
enum = build_active_record_enumerator(
scope,
cursor: cursor,
cursor_klass: ActiveRecordCursorAsRelation,
**args
).batches
wrap(self, enum)
end @adrianna-chang-shopify Let me know what you think 😄 |
@@ -65,7 +65,7 @@ def next_batch(batch_size) | |||
end | |||
|
|||
records = relation.uncached do | |||
relation.to_a | |||
as_relation ? relation : relation.to_a | |||
end | |||
|
|||
update_from_record(records.last) unless records.empty? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think actually this line here is an issue, as mentioned in Shopify/maintenance_tasks#409
records.empty?
will make an additional SQL COUNT query, then records.last
a SELECT.
So it looks like we're trading one SELECT that allocates a bunch of memory in Ruby against a COUNT + a SELECT *.
I think we could aim towards a pluck(*@columns)
and a nil check in Ruby, which would be just SELECT cursor columns and a nil check?
Hey @GustavoCaso ! @etiennebarrie and I had actually discussed a separate API for batch relations, but then opted for a flag on the existing What do you think? |
Would migrate batches as a relation by default, would mean that it would be a breaking change, no? We will have to update a bunch of code in core to change the jobs that expect to have an object to actually works with a relation. Having the two options doesn't seem like such a bad idea.
Also, I'm not an expert on ActiveRecord by any means, so @adrianna-chang-shopify if you think that I'm missing something please let me know. |
Oh yes absolutely, this would definitely require a major version update & changes to the jobs in Core (although most of these jobs want a relation, so we'd be reducing db calls here at least). You're right, I think having the two options is fine for now. If we decide we want to make relations the default moving forward, it will indeed be easier to have users migrate to a new API first rather than flipping a default on an existing API. I'll make the changes, and see if I can reduce some of those queries with the optimization suggestions Étienne made. |
Yes and no: it's probably fair to say it's a breaking change, but most code won't have to change, e.g. |
+1 to @etiennebarrie's point, although the flip side of that is that if we're bothering to make batches relations by default to remove that extra call, we'd obviously want to go in there and remove any If our end goal is to come back to a single API for active record batch enumerators, I think a flag would make more sense because we could:
But to @GustavoCaso's point, it might make sense to ship an additional API and unblock the work in the Maintenance Tasks gem, and defer any decisions about yielding batches as relations by default until a further date. If we're unsure about whether we'll have a single default moving forward, it would make sense to offer separate entrypoints in |
5d7ddce
to
8de86c8
Compare
…ions Co-authored-by: Étienne Barrié <etienne.barrie@shopify.com>
e9881d6
to
c996046
Compare
c996046
to
bae9b71
Compare
👋 @GustavoCaso sorry for the delay on this, we made a couple of changes! Notably:
Let me know what you think! |
I think we can clarify that we need to do this because we can't rely on the "optimization" that checks whether the current batch is full.
Like one test shows, there are cases where actually this optimization does not avoid fetching one more empty batch:
Also one thing that I had mentioned in #86 (comment) that we gave up on for now is to only pluck the columns we need for the next position. We still load the whole record, because that wouldn't save us anything because it's still loaded on the enumerator side:
Hum that lead me to investigate, because we should not be calling So the problem being in the enumerator calling |
@etiennebarrie I could be missing something, but seems to me that
doesn't actually load everything again. We're doing So even though we're doing: Similarly, |
Yes you got that right but what that means, is that even using relations, the records are loaded, which is basically the same as calling For example we could test this by changing our test to: def test_activerecord_batches_as_relation
push(BatchActiveRecordRelationIterationJob)
work_one_job
assert_jobs_in_queue(0)
records_performed = BatchActiveRecordIterationJob.records_performed
assert_equal([3, 3, 3, 1], records_performed.map(&:size))
assert(records_performed.all? { |relation|
relation.is_a?(ActiveRecord::Relation) && !relation.loaded?
})
end You can also see this by logging the SQL queries: diff --git c/test/unit/active_record_enumerator_test.rb i/test/unit/active_record_enumerator_test.rb
index f5a17ee..494f5a7 100644
--- c/test/unit/active_record_enumerator_test.rb
+++ i/test/unit/active_record_enumerator_test.rb
@@ -140,7 +140,7 @@ def build_enumerator(relation: Product.all, batch_size: 2, columns: nil, cursor:
def count_uncached_queries(&block)
count = 0
- query_cb = ->(*, payload) { count += 1 unless payload[:cached] }
+ query_cb = ->(*, payload) { p payload.slice(:sql, :cached); count += 1 unless payload[:cached] }
ActiveSupport::Notifications.subscribed(query_cb, "sql.active_record", &block)
count
end You'll see that even using the relations, we load the records:
When we would expect completely different queries, e.g. something like SELECT `products`.* FROM `products` ORDER BY products.id OFFSET 1 LIMIT 1
SELECT `products`.* FROM `products` WHERE (products.id > 2) ORDER BY products.id OFFSET 1 LIMIT 1
SELECT `products`.* FROM `products` WHERE (products.id > 4) ORDER BY products.id OFFSET 1 LIMIT 1 i.e. loading the last record of the batch only. |
We are going to take a different approach of building a new enumerator so that we can pluck only the cursor columns and don't have to load full records in order to compute the cursor value. This will also allow us to return a relation based on PK ids, rather than the original relation, which in some cases may be quite complex. We'll open this on a separate branch 😁 |
Currently, the resulting batch from
#active_record_on_batches
is an Array of records. Many job authors end up converting this into a relation to perform batch operations such as#update_all
(even the existingBatchesJob
example in the README does this!). By expanding the API of#active_record_on_batches
to return a relation if a flag is set, users who require their batch as anActiveRecord::Relation
will no longer need to perform an additional query to turn records back into a relation.This has no effect on existing jobs, since by default
as_relation
will be set to false and batches will continue to be converted to Arrays.Not sure whether it's worth documenting an example of this in the README, or whether the API documentation is sufficient.