Add while enumerator #446

angeloocana · 2024-01-03T16:17:34Z

Follow up for:

https://github.com/Shopify/shopify/pull/470718

The original implementation of that PR was querying some records in batches using the enumerator_builder.active_record_on_batches, collecting the ids for iteration and then deleting the records. The problem is that we don't really need to query and load all the records into memory just to grab the ids to deleted them. We can just run a super fast query like .exists?, and then run the delete command in batches while there are records.

This PR adds a new enumerator_builder.while, where any code block can be based and the iteration will be allowed when the returned value is true.

rporrasluc · 2024-01-05T15:50:38Z

Hello @angeloocana, can't you use active_record_on_batch_relations in order to avoid the records from being eager loaded?

angeloocana · 2024-01-05T17:41:51Z

Hello @angeloocana, can't you use active_record_on_batch_relations in order to avoid the records from being eager loaded?

Hey @rporrasluc ! Thanks for the suggestion, here are the results using the code from the original PR:

Testing

Queries for `active_record_on_batches`:

Select all record columns:

SELECT `delivery_promise_sku_settings`.* FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`delivery_promise_provider_id` = 80485628 ORDER BY delivery_promise_sku_settings.id ASC LIMIT 100

Delete records using IDs returned from the previous query:

DELETE FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`id` IN (585506915, 585506916, 585506917)

A new iteration is not triggered because the job is aware that there is no next page for the cursor.

Queries for `active_record_on_batch_relations`:

Select record ids:

SELECT `delivery_promise_sku_settings`.`id` FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`delivery_promise_provider_id` = 80485629 ORDER BY delivery_promise_sku_settings.id LIMIT 100

Select records again, this time returning id and shop_id using the ids from the previous query:

SELECT `delivery_promise_sku_settings`.`id`, `delivery_promise_sku_settings`.`shop_id` FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`delivery_promise_provider_id` = 80485629 AND `delivery_promise_sku_settings`.`id` IN (585506918, 585506919, 585506920) ORDER BY delivery_promise_sku_settings.id

Select records again, this time returning only the id for the given ids:

SELECT `delivery_promise_sku_settings`.`id` FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`id` IN (585506918, 585506919, 585506920) AND `delivery_promise_sku_settings`.`shop_id` = 26371970

Delete records using IDs from the previous query:

DELETE FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`id` IN (585506918, 585506919, 585506920)

On the next iteration run query 1) again to check if there are remaining records.

Queries for new `while` proposed solution:

Select if there are records to delete (only one or no line is returned):

SELECT 1 AS one FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`delivery_promise_provider_id` = 80485632 LIMIT 1

Delete records using the same filter as the previous query to check if records exist:

DELETE FROM `delivery_promise_sku_settings` WHERE `delivery_promise_sku_settings`.`shop_id` = 26371970 AND `delivery_promise_sku_settings`.`delivery_promise_provider_id` = 80485633 LIMIT 100

On the next iteration run query 1) again to check if there are remaining records

Considerations

Using active_record_on_batch_relations made the problem worse, instead of 2 queries, we now have 5 queries.

I really don't see the benefit of querying for the data when we just need to delete it. The job was created because we expect thousands to millions of rows to be deleted. Why do we need to load the data? Maybe I'm missing something...

rporrasluc · 2024-01-08T08:28:43Z

Hello @angeloocana can you share the code where you use active_record_on_batch_relations? I see you are selecting ids, if you use select, a relation won't be returned, but the actual IDs, the query will be executed. The idea is:

def build_enumerator(params, cursor:)
  enumerator_builder. active_record_on_batch_relations(
    your_query_that_returns_a_relation_using_your_input_params,
    cursor: cursor,
    batch_size: 1000,
)
end

def each_iteration(relation, _params)
  relation.delete_all
end

From your queries, it seems you are still plucking ids + using them to build a new query in order to use delete_all. I can't be sure as I am not seeing the code, but your job can likely be written in a way to avoid that many queries.

nvasilevski · 2024-01-08T14:17:28Z

@angeloocana Just wanted to remind that this is a public repo so we should be careful with exposing what may be considered sensitive data. Also links to private repos are not accessible so we shouldn't be justifying changes to the library based on something that can't be accessed by the community

I don't want to share link to a private repo here but @rporrasluc is right, the job is calling pluck(:id) on the relation to build a new relation. So at least one query from active_record_on_batch_relations can be avoided.

The problem is that we don't really need to query and load all the records into memory

I think this is also a misunderstanding. Unless the job itself does it, job-iteration should not be loading records into memory when using active_record_on_batch_relations.
Also since your examples are coming from core we have our custom enumerators that may perform additional necessary checks so again, not something a regular user of job-iteration will experience.

While there still might be a valid use-case for a new enumerator, I just want it to come with a clear understanding of the use-case for it and what are the actual disadvantages of the existing one

nvasilevski · 2024-01-08T14:32:51Z

Ah, well, the original job uses pluck(:id) on an Array and not a relation since it uses active_record_on_batches(not relations) so pluck itself shouldn't cause a query.
But anyway, what we should be comparing is:
2 queries (SELECT ids for the batch, DELETE by ids) in active_record_on_batch_relations case
2 queries (SELECT 1 to check if exists, DELETE with a limit)

And in this case the benefit is not that clear. Specifically I wonder if deleting by a primary key will perform better due to database doing less row locks. I'm not super familiar with this area but can imagine that DELETE from 'records' where tenant_id=1 LIMIT 100 may end up performing more locks on the db-level than DELETE from 'records' where id in (1,2,3, ...100)

rporrasluc · 2024-01-08T14:38:25Z

Ah, well, the original job uses pluck(:id) on an Array and not a relation since it uses active_record_on_batches(not relations) so pluck itself shouldn't cause a query. But anyway, what we should be comparing is: 2 queries (SELECT ids for the batch, DELETE by ids) in active_record_on_batch_relations case 2 queries (SELECT 1 to check if exists, DELETE with a limit)

And in this case the benefit is not that clear. Specifically I wonder if deleting by a primary key will perform better due to database doing less row locks. I'm not super familiar with this area but can imagine that DELETE from 'records' where tenant_id=1 LIMIT 100 may end up performing more locks on the db-level than DELETE from 'records' where id in (1,2,3, ...100)

Thanks for your insights here. I might be wrong, but my understanding is that using active_record_on_batch_relations we should do only 1 query. Building the relation won't perform any query, we can call delete_all on the relation and it will trigger a DELETE with conditions and a limit. We do not need to pluck ids or check if the relation EXISTS unless I am missing something very obvious.

I think an example of the job you are trying to implement based on the proposal to use active_record_on_batch_relations (with no private details) will help to understand the issue.

Thanks!

nvasilevski · 2024-01-08T14:40:59Z

README.md

+  include JobIteration::Iteration
+
+  def build_enumerator(params)
+    enumerator_builder.while query_model_xyz(params).exists?


While this is a valid ruby syntax, is is an intended one? I believe it was supposed to be a block

Suggested change

enumerator_builder.while query_model_xyz(params).exists?

enumerator_builder.while { query_model_xyz(params).exists? }

Good catch! Thanks! Updating it now.

nvasilevski · 2024-01-08T14:46:18Z

lib/job-iteration/enumerator_builder.rb

+    def build_while_enumerator(&condition)
+      count = 0
+      Enumerator.new do |yielder|
+        yielder << count += 1 while condition.call


What would be a use-case for yielding count? It will reset to 0 between interruptions and unless there was no interruptions I don't think this number will be useful in any way.
There should be a way to start counter from the cursor position so at least this way count will represent the number of times we iterated regardless of whether the job was interrupted or not.

But perhaps it would be even better to avoid yielding anything until we have a solid use-case for that

I agree 100% , I wasn't sure about what to yield to the job. I'll remove the counter.

angeloocana · 2024-01-08T16:34:42Z

Hey @rporrasluc and @nvasilevski , Thanks for the suggestions!

Good call on removing the pluck(:id), it is definitely not needed but it is still not perfect... Now instead of 5 queries there are 4. Do you understand why?

Here is the example code, I replaced the real table name with Xyz:

class DeleteXyzJob < ApplicationJob
    queue_as :low

    def build_enumerator(params, cursor:)
      enumerator_builder.active_record_on_batch_relations(
        Xyz.where(
          shop_id: params[:shop_id],
          owner_id: params[:owner_id]
        ),
        cursor: cursor,
        batch_size: 100
      )
    end

    def each_iteration(xyz_relation, params)
      xyz_relation.delete_all
    end
  end

Query for ids

SELECT `xyz`.`id` FROM `xyz` WHERE `xyz`.`shop_id` = 26371970 AND `xyz`.`owner_id` = 80485634 ORDER BY xyz.id LIMIT 100

Query for ids and shop id (Not sure why this triggers)

SELECT `xyz`.`id`, `xyz`.`shop_id` FROM `xyz` WHERE `xyz`.`shop_id` = 26371970 AND `xyz`.`owner_id` = 80485634 AND `xyz`.`id` IN (585506933, 585506934, 585506935) ORDER BY xyz.id

Delete

DELETE FROM `xyz` WHERE `xyz`.`id` IN (585506933, 585506934, 585506935) AND `xyz`.`shop_id` = 26371970

Query 1. again to check if the job should do another iteration

adrianna-chang-shopify · 2024-01-08T20:17:16Z

@angeloocana sorry I'm jumping in a little late here, but that second query you've highlighted is specific to code in our private repo. I can elaborate further on the related PR, but in the context of job-iteration, we'd see the same number of queries for the batch relation enumerator as for the while enumerator introduced here (what @nvasilevski described here).

Perhaps we could run a benchmark to compare the performance of the two?

@rporrasluc regarding

Building the relation won't perform any query, we can call delete_all on the relation and it will trigger a DELETE with conditions and a limit. We do not need to pluck ids or check if the relation EXISTS unless I am missing something very obvious.

We do pluck the ids here:

job-iteration/lib/job-iteration/active_record_batch_enumerator.rb

Lines 74 to 86 in f3a89f8

    
           def pluck_columns(relation) 
        
             if @pluck_columns.size == 1 # only the primary key 
        
               column_values = relation.pluck(*@pluck_columns) 
        
               return [column_values, column_values] 
        
             end 
        
             column_values = relation.pluck(*@pluck_columns) 
        
             primary_key_index = @primary_key_index || -1 
        
             primary_key_values = column_values.map { |values| values[primary_key_index] } 
        
             serialize_column_values!(column_values) 
        
             [column_values, primary_key_values] 
        
           end

This is so that we can build a relation that is already filtered by the primary key values, and yield it to #each_iteration:

job-iteration/lib/job-iteration/active_record_batch_enumerator.rb

Line 71 in f3a89f8

@base_relation.where(@primary_key => ids)

angeloocana · 2024-01-09T02:47:58Z

@adrianna-chang-shopify Thanks for asking for a benchmark, that is what we really needed!

Here are the astonishing results:

Batches of 1000

Rows	while	active_record_on_batch_relations
10	0.493995923s	0.524343613s
100	0.558978765s	0.565930096s
1000	0.540043806s	0.543253054s
10000	0.938633587s	1.368673145s
100000	27.29757184s	16.55144373s
1000000	120.7427221s	89.3870157s

Both solutions have a similar performance for a few thousand rows, with the while solution being a little bit faster. But when the number of records is higher than 100k the active_record_on_batch_relations is almost twice as fast 🤯🤯🤯

Batches of 100

Rows	while	active_record_on_batch_relations
100000	120.6691102s	49.36938663s

I also made one test using batches of 100 items (value from the original PR) and it takes 3 to 5 times longer than 1k batches on both solutions.

Average individual query time

Using batches of 1000

While

Exists? ~160ms
Delete all ~300ms

active_record_on_batch_relations

Pluck on id (table to delete data) ~15ms
Shop Pluck (shops table) ~0.8ms
ShopRoute Pluck (shops table) ~0.8ms
Pluck on id and shop_id (table to delete data) ~20ms
Delete all ~15ms

Conclusion

The solution using active_record_on_batch_relations performs more queries and transfer more data but it runs faster because it uses the database indexes better.

Example queries

Exists for the while solution:

SELECT  1 AS one FROM `xyz` WHERE `xyz`.`shop_id` = 26371970 AND `xyz`.`owner_id` = 80485674 
LIMIT 1

Pluck on id for active_record_on_batch_relations

SELECT `xyz`.`id` FROM `xyz` WHERE `xyz`.`shop_id` = 26371970 AND `xyz`.`owner_id` = 80485673 AND (xyz.id > '598770502') ORDER BY xyz.id 
LIMIT 1000

Both queries filter the data using the same columns but the active_record_on_batch_relations also filters by the id value from the last cursor with an order by. I think this makes the database knows exactly what part of the index to use making the query a lot faster.

rporrasluc · 2024-01-09T09:34:19Z

We do pluck the ids here:

Thanks @adrianna-chang-shopify, that was the obvious part I was missing 😅

nvasilevski · 2024-01-09T15:46:35Z

@angeloocana thanks for the investigation! To be honest I'm a little surprised that exists? which is (SELECT 1) takes 160ms, I would expect it to be much faster as long as there is an index on shop_id, owner_id. But I guess if the 300ms for DELETE measurement is correct it won't matter since this will still make the while approach slower

To be honest I wasn't expecting these results and was mainly leaning towards active_record_on_batch_relations because it's a well tested approach that unlikely will bring any surprises while being unnoticeable slower. I was also expecting deletion by the primary key to be somewhat healthier for the platform but it looks like it end up being significantly faster.

Anyway, I wanted to thank you for your proposal as I find the solution very clean and even though our current use-case may not need a new enumerator I can totally see how in the future someone may need "an enumerator that allows the job to run in an interruptible manner until certain condition is met" so I wouldn't be surprised if this PR gets re-opened or implementation will be borrowed for a similar proposal.

Great work!

angeloocana force-pushed the add-while-enumerator branch from 480fe14 to e055d2d Compare January 3, 2024 16:22

angeloocana mentioned this pull request Jan 3, 2024

Fix supported Ruby and Rails versions #448

Merged

angeloocana force-pushed the add-while-enumerator branch 3 times, most recently from 1594139 to 58803cb Compare January 3, 2024 19:06

angeloocana marked this pull request as ready for review January 3, 2024 19:08

angeloocana requested review from robsterner-shopify, acjohnso25, MatthewRBruce and Mangara January 3, 2024 19:08

angeloocana force-pushed the add-while-enumerator branch from 58803cb to f0d26be Compare January 3, 2024 19:19

angeloocana requested review from basicBrogrammer, etiennebarrie, rporrasluc, YuLeven, nvasilevski, adrianna-chang-shopify and kirs January 3, 2024 19:21

nvasilevski reviewed Jan 8, 2024

View reviewed changes

Add while enumerator

fc5ca4b

angeloocana force-pushed the add-while-enumerator branch from f0d26be to fc5ca4b Compare January 8, 2024 16:45

angeloocana closed this Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add while enumerator #446

Add while enumerator #446

angeloocana commented Jan 3, 2024 •

edited

Loading

rporrasluc commented Jan 5, 2024

angeloocana commented Jan 5, 2024

rporrasluc commented Jan 8, 2024 •

edited

Loading

nvasilevski commented Jan 8, 2024

nvasilevski commented Jan 8, 2024

rporrasluc commented Jan 8, 2024

nvasilevski Jan 8, 2024

angeloocana Jan 8, 2024

nvasilevski Jan 8, 2024

angeloocana Jan 8, 2024

angeloocana commented Jan 8, 2024

adrianna-chang-shopify commented Jan 8, 2024

angeloocana commented Jan 9, 2024

rporrasluc commented Jan 9, 2024

nvasilevski commented Jan 9, 2024

	enumerator_builder.while query_model_xyz(params).exists?
	enumerator_builder.while { query_model_xyz(params).exists? }

Add while enumerator #446

Add while enumerator #446

Conversation

angeloocana commented Jan 3, 2024 • edited Loading

rporrasluc commented Jan 5, 2024

angeloocana commented Jan 5, 2024

Testing

Queries for active_record_on_batches:

Queries for active_record_on_batch_relations:

Queries for new while proposed solution:

Considerations

rporrasluc commented Jan 8, 2024 • edited Loading

nvasilevski commented Jan 8, 2024

nvasilevski commented Jan 8, 2024

rporrasluc commented Jan 8, 2024

nvasilevski Jan 8, 2024

Choose a reason for hiding this comment

angeloocana Jan 8, 2024

Choose a reason for hiding this comment

nvasilevski Jan 8, 2024

Choose a reason for hiding this comment

angeloocana Jan 8, 2024

Choose a reason for hiding this comment

angeloocana commented Jan 8, 2024

adrianna-chang-shopify commented Jan 8, 2024

angeloocana commented Jan 9, 2024

Batches of 1000

Batches of 100

Average individual query time

While

active_record_on_batch_relations

Conclusion

Example queries

rporrasluc commented Jan 9, 2024

nvasilevski commented Jan 9, 2024

angeloocana commented Jan 3, 2024 •

edited

Loading

Queries for `active_record_on_batches`:

Queries for `active_record_on_batch_relations`:

Queries for new `while` proposed solution:

rporrasluc commented Jan 8, 2024 •

edited

Loading