-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add while enumerator #446
Add while enumerator #446
Conversation
480fe14
to
e055d2d
Compare
1594139
to
58803cb
Compare
58803cb
to
f0d26be
Compare
Hello @angeloocana, can't you use |
Hey @rporrasluc ! Thanks for the suggestion, here are the results using the code from the original PR: TestingQueries for
|
Hello @angeloocana can you share the code where you use def build_enumerator(params, cursor:)
enumerator_builder. active_record_on_batch_relations(
your_query_that_returns_a_relation_using_your_input_params,
cursor: cursor,
batch_size: 1000,
)
end
def each_iteration(relation, _params)
relation.delete_all
end From your queries, it seems you are still plucking ids + using them to build a new query in order to use |
@angeloocana Just wanted to remind that this is a public repo so we should be careful with exposing what may be considered sensitive data. Also links to private repos are not accessible so we shouldn't be justifying changes to the library based on something that can't be accessed by the community I don't want to share link to a private repo here but @rporrasluc is right, the job is calling
I think this is also a misunderstanding. Unless the job itself does it, While there still might be a valid use-case for a new enumerator, I just want it to come with a clear understanding of the use-case for it and what are the actual disadvantages of the existing one |
Ah, well, the original job uses And in this case the benefit is not that clear. Specifically I wonder if deleting by a primary key will perform better due to database doing less row locks. I'm not super familiar with this area but can imagine that |
Thanks for your insights here. I might be wrong, but my understanding is that using I think an example of the job you are trying to implement based on the proposal to use Thanks! |
README.md
Outdated
include JobIteration::Iteration | ||
|
||
def build_enumerator(params) | ||
enumerator_builder.while query_model_xyz(params).exists? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is a valid ruby syntax, is is an intended one? I believe it was supposed to be a block
enumerator_builder.while query_model_xyz(params).exists? | |
enumerator_builder.while { query_model_xyz(params).exists? } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks! Updating it now.
def build_while_enumerator(&condition) | ||
count = 0 | ||
Enumerator.new do |yielder| | ||
yielder << count += 1 while condition.call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be a use-case for yielding count
? It will reset to 0
between interruptions and unless there was no interruptions I don't think this number will be useful in any way.
There should be a way to start counter from the cursor
position so at least this way count
will represent the number of times we iterated regardless of whether the job was interrupted or not.
But perhaps it would be even better to avoid yielding anything until we have a solid use-case for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree 100% , I wasn't sure about what to yield to the job. I'll remove the counter.
Hey @rporrasluc and @nvasilevski , Thanks for the suggestions! Good call on removing the Here is the example code, I replaced the real table name with class DeleteXyzJob < ApplicationJob
queue_as :low
def build_enumerator(params, cursor:)
enumerator_builder.active_record_on_batch_relations(
Xyz.where(
shop_id: params[:shop_id],
owner_id: params[:owner_id]
),
cursor: cursor,
batch_size: 100
)
end
def each_iteration(xyz_relation, params)
xyz_relation.delete_all
end
end
SELECT `xyz`.`id` FROM `xyz` WHERE `xyz`.`shop_id` = 26371970 AND `xyz`.`owner_id` = 80485634 ORDER BY xyz.id LIMIT 100
SELECT `xyz`.`id`, `xyz`.`shop_id` FROM `xyz` WHERE `xyz`.`shop_id` = 26371970 AND `xyz`.`owner_id` = 80485634 AND `xyz`.`id` IN (585506933, 585506934, 585506935) ORDER BY xyz.id
DELETE FROM `xyz` WHERE `xyz`.`id` IN (585506933, 585506934, 585506935) AND `xyz`.`shop_id` = 26371970
|
f0d26be
to
fc5ca4b
Compare
@angeloocana sorry I'm jumping in a little late here, but that second query you've highlighted is specific to code in our private repo. I can elaborate further on the related PR, but in the context of job-iteration, we'd see the same number of queries for the batch relation enumerator as for the while enumerator introduced here (what @nvasilevski described here). Perhaps we could run a benchmark to compare the performance of the two? @rporrasluc regarding
We do pluck the ids here: job-iteration/lib/job-iteration/active_record_batch_enumerator.rb Lines 74 to 86 in f3a89f8
This is so that we can build a relation that is already filtered by the primary key values, and yield it to
|
@adrianna-chang-shopify Thanks for asking for a benchmark, that is what we really needed! Here are the astonishing results: Batches of 1000
Both solutions have a similar performance for a few thousand rows, with the while solution being a little bit faster. But when the number of records is higher than 100k the Batches of 100
I also made one test using batches of 100 items (value from the original PR) and it takes 3 to 5 times longer than 1k batches on both solutions. Average individual query timeUsing batches of 1000 While
active_record_on_batch_relations
ConclusionThe solution using Example queriesExists for the
Pluck on id for
Both queries filter the data using the same columns but the |
Thanks @adrianna-chang-shopify, that was the obvious part I was missing 😅 |
@angeloocana thanks for the investigation! To be honest I'm a little surprised that To be honest I wasn't expecting these results and was mainly leaning towards Anyway, I wanted to thank you for your proposal as I find the solution very clean and even though our current use-case may not need a new enumerator I can totally see how in the future someone may need "an enumerator that allows the job to run in an interruptible manner until certain condition is met" so I wouldn't be surprised if this PR gets re-opened or implementation will be borrowed for a similar proposal. Great work! |
Follow up for:
The original implementation of that PR was querying some records in batches using the
enumerator_builder.active_record_on_batches
, collecting the ids for iteration and then deleting the records. The problem is that we don't really need to query and load all the records into memory just to grab the ids to deleted them. We can just run a super fast query like.exists?
, and then run the delete command in batches while there are records.This PR adds a new
enumerator_builder.while
, where any code block can be based and the iteration will be allowed when the returned value is true.