Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add while enumerator #446

Closed
wants to merge 1 commit into from
Closed

Add while enumerator #446

wants to merge 1 commit into from

Conversation

angeloocana
Copy link
Contributor

@angeloocana angeloocana commented Jan 3, 2024

Follow up for:

The original implementation of that PR was querying some records in batches using the enumerator_builder.active_record_on_batches, collecting the ids for iteration and then deleting the records. The problem is that we don't really need to query and load all the records into memory just to grab the ids to deleted them. We can just run a super fast query like .exists?, and then run the delete command in batches while there are records.

This PR adds a new enumerator_builder.while, where any code block can be based and the iteration will be allowed when the returned value is true.

@rporrasluc
Copy link

Hello @angeloocana, can't you use active_record_on_batch_relations in order to avoid the records from being eager loaded?

@angeloocana
Copy link
Contributor Author

Hello @angeloocana, can't you use active_record_on_batch_relations in order to avoid the records from being eager loaded?

Hey @rporrasluc ! Thanks for the suggestion, here are the results using the code from the original PR:

Testing

Queries for active_record_on_batches:

  1. Select all record columns:
SELECT `delivery_promise_sku_settings`.* FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`delivery_promise_provider_id` = 80485628 ORDER BY delivery_promise_sku_settings.id ASC LIMIT 100
  1. Delete records using IDs returned from the previous query:
DELETE FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`id` IN (585506915, 585506916, 585506917)

A new iteration is not triggered because the job is aware that there is no next page for the cursor.

Queries for active_record_on_batch_relations:

  1. Select record ids:
SELECT `delivery_promise_sku_settings`.`id` FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`delivery_promise_provider_id` = 80485629 ORDER BY delivery_promise_sku_settings.id LIMIT 100
  1. Select records again, this time returning id and shop_id using the ids from the previous query:
SELECT `delivery_promise_sku_settings`.`id`, `delivery_promise_sku_settings`.`shop_id` FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`delivery_promise_provider_id` = 80485629 AND `delivery_promise_sku_settings`.`id` IN (585506918, 585506919, 585506920) ORDER BY delivery_promise_sku_settings.id
  1. Select records again, this time returning only the id for the given ids:
SELECT `delivery_promise_sku_settings`.`id` FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`id` IN (585506918, 585506919, 585506920) AND `delivery_promise_sku_settings`.`shop_id` = 26371970
  1. Delete records using IDs from the previous query:
DELETE FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`id` IN (585506918, 585506919, 585506920)
  1. On the next iteration run query 1) again to check if there are remaining records.

Queries for new while proposed solution:

  1. Select if there are records to delete (only one or no line is returned):
SELECT 1 AS one FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`delivery_promise_provider_id` = 80485632 LIMIT 1
  1. Delete records using the same filter as the previous query to check if records exist:
DELETE FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`delivery_promise_provider_id` = 80485633 LIMIT 100
  1. On the next iteration run query 1) again to check if there are remaining records

Considerations

Using active_record_on_batch_relations made the problem worse, instead of 2 queries, we now have 5 queries.

I really don't see the benefit of querying for the data when we just need to delete it. The job was created because we expect thousands to millions of rows to be deleted. Why do we need to load the data? Maybe I'm missing something...

@rporrasluc
Copy link

rporrasluc commented Jan 8, 2024

Hello @angeloocana can you share the code where you use active_record_on_batch_relations? I see you are selecting ids, if you use select, a relation won't be returned, but the actual IDs, the query will be executed. The idea is:

def build_enumerator(params, cursor:)
  enumerator_builder. active_record_on_batch_relations(
    your_query_that_returns_a_relation_using_your_input_params,
    cursor: cursor,
    batch_size: 1000,
)
end

def each_iteration(relation, _params)
  relation.delete_all
end

From your queries, it seems you are still plucking ids + using them to build a new query in order to use delete_all. I can't be sure as I am not seeing the code, but your job can likely be written in a way to avoid that many queries.

@nvasilevski
Copy link
Contributor

@angeloocana Just wanted to remind that this is a public repo so we should be careful with exposing what may be considered sensitive data. Also links to private repos are not accessible so we shouldn't be justifying changes to the library based on something that can't be accessed by the community

I don't want to share link to a private repo here but @rporrasluc is right, the job is calling pluck(:id) on the relation to build a new relation. So at least one query from active_record_on_batch_relations can be avoided.

The problem is that we don't really need to query and load all the records into memory

I think this is also a misunderstanding. Unless the job itself does it, job-iteration should not be loading records into memory when using active_record_on_batch_relations.
Also since your examples are coming from core we have our custom enumerators that may perform additional necessary checks so again, not something a regular user of job-iteration will experience.

While there still might be a valid use-case for a new enumerator, I just want it to come with a clear understanding of the use-case for it and what are the actual disadvantages of the existing one

@nvasilevski
Copy link
Contributor

Ah, well, the original job uses pluck(:id) on an Array and not a relation since it uses active_record_on_batches(not relations) so pluck itself shouldn't cause a query.
But anyway, what we should be comparing is:
2 queries (SELECT ids for the batch, DELETE by ids) in active_record_on_batch_relations case
2 queries (SELECT 1 to check if exists, DELETE with a limit)

And in this case the benefit is not that clear. Specifically I wonder if deleting by a primary key will perform better due to database doing less row locks. I'm not super familiar with this area but can imagine that DELETE from 'records' where tenant_id=1 LIMIT 100 may end up performing more locks on the db-level than DELETE from 'records' where id in (1,2,3, ...100)

@rporrasluc
Copy link

Ah, well, the original job uses pluck(:id) on an Array and not a relation since it uses active_record_on_batches(not relations) so pluck itself shouldn't cause a query. But anyway, what we should be comparing is: 2 queries (SELECT ids for the batch, DELETE by ids) in active_record_on_batch_relations case 2 queries (SELECT 1 to check if exists, DELETE with a limit)

And in this case the benefit is not that clear. Specifically I wonder if deleting by a primary key will perform better due to database doing less row locks. I'm not super familiar with this area but can imagine that DELETE from 'records' where tenant_id=1 LIMIT 100 may end up performing more locks on the db-level than DELETE from 'records' where id in (1,2,3, ...100)

Thanks for your insights here. I might be wrong, but my understanding is that using active_record_on_batch_relations we should do only 1 query. Building the relation won't perform any query, we can call delete_all on the relation and it will trigger a DELETE with conditions and a limit. We do not need to pluck ids or check if the relation EXISTS unless I am missing something very obvious.

I think an example of the job you are trying to implement based on the proposal to use active_record_on_batch_relations (with no private details) will help to understand the issue.

Thanks!

README.md Outdated
include JobIteration::Iteration

def build_enumerator(params)
enumerator_builder.while query_model_xyz(params).exists?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is a valid ruby syntax, is is an intended one? I believe it was supposed to be a block

Suggested change
enumerator_builder.while query_model_xyz(params).exists?
enumerator_builder.while { query_model_xyz(params).exists? }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks! Updating it now.

def build_while_enumerator(&condition)
count = 0
Enumerator.new do |yielder|
yielder << count += 1 while condition.call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be a use-case for yielding count? It will reset to 0 between interruptions and unless there was no interruptions I don't think this number will be useful in any way.
There should be a way to start counter from the cursor position so at least this way count will represent the number of times we iterated regardless of whether the job was interrupted or not.

But perhaps it would be even better to avoid yielding anything until we have a solid use-case for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 100% , I wasn't sure about what to yield to the job. I'll remove the counter.

@angeloocana
Copy link
Contributor Author

Hey @rporrasluc and @nvasilevski , Thanks for the suggestions!

Good call on removing the pluck(:id), it is definitely not needed but it is still not perfect... Now instead of 5 queries there are 4. Do you understand why?

Here is the example code, I replaced the real table name with Xyz:

class DeleteXyzJob < ApplicationJob
    queue_as :low

    def build_enumerator(params, cursor:)
      enumerator_builder.active_record_on_batch_relations(
        Xyz.where(
          shop_id: params[:shop_id],
          owner_id: params[:owner_id]
        ),
        cursor: cursor,
        batch_size: 100
      )
    end

    def each_iteration(xyz_relation, params)
      xyz_relation.delete_all
    end
  end
  1. Query for ids
SELECT `xyz`.`id` FROM `xyz` WHERE `xyz`.`shop_id` = 26371970 AND `xyz`.`owner_id` = 80485634 ORDER BY xyz.id LIMIT 100
  1. Query for ids and shop id (Not sure why this triggers)
SELECT `xyz`.`id`, `xyz`.`shop_id` FROM `xyz` WHERE `xyz`.`shop_id` = 26371970 AND `xyz`.`owner_id` = 80485634 AND `xyz`.`id` IN (585506933, 585506934, 585506935) ORDER BY xyz.id
  1. Delete
DELETE FROM `xyz` WHERE `xyz`.`id` IN (585506933, 585506934, 585506935) AND `xyz`.`shop_id` = 26371970
  1. Query 1. again to check if the job should do another iteration

@angeloocana angeloocana force-pushed the add-while-enumerator branch from f0d26be to fc5ca4b Compare January 8, 2024 16:45
@adrianna-chang-shopify
Copy link
Contributor

@angeloocana sorry I'm jumping in a little late here, but that second query you've highlighted is specific to code in our private repo. I can elaborate further on the related PR, but in the context of job-iteration, we'd see the same number of queries for the batch relation enumerator as for the while enumerator introduced here (what @nvasilevski described here).

Perhaps we could run a benchmark to compare the performance of the two?

@rporrasluc regarding

Building the relation won't perform any query, we can call delete_all on the relation and it will trigger a DELETE with conditions and a limit. We do not need to pluck ids or check if the relation EXISTS unless I am missing something very obvious.

We do pluck the ids here:

def pluck_columns(relation)
if @pluck_columns.size == 1 # only the primary key
column_values = relation.pluck(*@pluck_columns)
return [column_values, column_values]
end
column_values = relation.pluck(*@pluck_columns)
primary_key_index = @primary_key_index || -1
primary_key_values = column_values.map { |values| values[primary_key_index] }
serialize_column_values!(column_values)
[column_values, primary_key_values]
end

This is so that we can build a relation that is already filtered by the primary key values, and yield it to #each_iteration:

@base_relation.where(@primary_key => ids)

@angeloocana
Copy link
Contributor Author

@adrianna-chang-shopify Thanks for asking for a benchmark, that is what we really needed!

Here are the astonishing results:

Batches of 1000

Rows while active_record_on_batch_relations
10 0.493995923s 0.524343613s
100 0.558978765s 0.565930096s
1000 0.540043806s 0.543253054s
10000 0.938633587s 1.368673145s
100000 27.29757184s 16.55144373s
1000000 120.7427221s 89.3870157s

Both solutions have a similar performance for a few thousand rows, with the while solution being a little bit faster. But when the number of records is higher than 100k the active_record_on_batch_relations is almost twice as fast 🤯🤯🤯

Batches of 100

Rows while active_record_on_batch_relations
100000 120.6691102s 49.36938663s

I also made one test using batches of 100 items (value from the original PR) and it takes 3 to 5 times longer than 1k batches on both solutions.

Average individual query time

Using batches of 1000

While

  1. Exists? ~160ms
  2. Delete all ~300ms

active_record_on_batch_relations

  1. Pluck on id (table to delete data) ~15ms
  2. Shop Pluck (shops table) ~0.8ms
  3. ShopRoute Pluck (shops table) ~0.8ms
  4. Pluck on id and shop_id (table to delete data) ~20ms
  5. Delete all ~15ms

Conclusion

The solution using active_record_on_batch_relations performs more queries and transfer more data but it runs faster because it uses the database indexes better.

Example queries

Exists for the while solution:

SELECT  1 AS one FROM `xyz` WHERE `xyz`.`shop_id` = 26371970 AND `xyz`.`owner_id` = 80485674 
LIMIT 1

Pluck on id for active_record_on_batch_relations

SELECT `xyz`.`id` FROM `xyz` WHERE `xyz`.`shop_id` = 26371970 AND `xyz`.`owner_id` = 80485673 AND (xyz.id > '598770502') ORDER BY xyz.id 
LIMIT 1000

Both queries filter the data using the same columns but the active_record_on_batch_relations also filters by the id value from the last cursor with an order by. I think this makes the database knows exactly what part of the index to use making the query a lot faster.

@angeloocana angeloocana closed this Jan 9, 2024
@rporrasluc
Copy link

We do pluck the ids here:

Thanks @adrianna-chang-shopify, that was the obvious part I was missing 😅

@nvasilevski
Copy link
Contributor

@angeloocana thanks for the investigation! To be honest I'm a little surprised that exists? which is (SELECT 1) takes 160ms, I would expect it to be much faster as long as there is an index on shop_id, owner_id. But I guess if the 300ms for DELETE measurement is correct it won't matter since this will still make the while approach slower

To be honest I wasn't expecting these results and was mainly leaning towards active_record_on_batch_relations because it's a well tested approach that unlikely will bring any surprises while being unnoticeable slower. I was also expecting deletion by the primary key to be somewhat healthier for the platform but it looks like it end up being significantly faster.

Anyway, I wanted to thank you for your proposal as I find the solution very clean and even though our current use-case may not need a new enumerator I can totally see how in the future someone may need "an enumerator that allows the job to run in an interruptible manner until certain condition is met" so I wouldn't be surprised if this PR gets re-opened or implementation will be borrowed for a similar proposal.

Great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants