🚀 [READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results #5541

johnnyshields · 2023-02-08T21:40:39Z

This uses the $unwind aggregation operator under the hood.

It turned out pretty nicely, it works for embedded and localized fields too.

Using this code I was able to run an array tally on a collection with 100,000,000 docs in 5 minutes 😎

lib/mongoid/contextual/memory.rb

alexbevi · 2023-02-09T19:35:44Z

lib/mongoid/contextual/mongo.rb

-        pipeline.unshift("$match" => view.filter) unless view.filter.blank?
+        pipeline = []
+        pipeline << { "$match" => view.filter } if view.filter.present?
+        pipeline << { "$project" => { "#{projected}" => "$#{name}" } } if projected


Having a $project stage in the middle of a pipeline can negatively impact performance.

I would be extremely wary of introducing a change like this

See https://www.mongodb.com/docs/rapid/core/aggregation-pipeline-optimization/#-project-stage-placement

@alexbevi if you don't use $project, the $unwind on nested fields will yield an empty result. I have raised SERVER-73888 for this because I believe it is a bug; other operators such as $group handle nested fields just fine.

As you can see in the PR code, the $project is only used in the case it needed:

the :splat_arrays arg is true (note it is false by default) AND

the field arg is a nested field (either . is present, or it is translation)

# Must add a $project stage when using $unwind with nested fields projected = 'p' if splat_arrays && (is_translation || name.include?('.'))

Note that the code in my PR does not apply the $project stage under default conditions, so there is no behavior change for any existing usage of the #tally method.

Re: performance, fortunately I have a large production database we can try it and see what happens😄. Using it on MongoDB 6.0.x collection with 109 million+ records:

Foobar.estimated_count #=> 109601406 # non-nested array<string> field, does *not* use $project Foobar.tally('tags', splat_arrays: true) #=> 481 seconds #=> 761774 tally keys, total tallied value count 8031404 # nested field (array<document> -> string), uses $project Foobar.tally('addresses.city', splat_arrays: true) #=> 592 seconds #=> 65362 tally keys, total tallied value count 3284490

CPU, memory, disk util etc. were all equivalent & healthy on both throughout. Clearly, nothing catastrophic/"hockey-stick" happens when adding $project where it is needed due to as-is server behavior.

Note: (a) I had already run the tally on tags earlier, but not yet on addresses.city, so caches were possibly pre-warmed for tags; (b) there is an index on tags, not on addresses.city. If anything, the "with $project" case should be even better than the above.

TLDR;

One final note:

The link you posted: https://www.mongodb.com/docs/rapid/core/aggregation-pipeline-optimization/#-project-stage-placement does not say putting a $project in the beginning will hurt performance, it merely says it will not improve performance. (Of course, we aren't using it for performance but rather to workaround a bug.)

If SERVER-73888 is resolved in a future server release, the $project we have here won't affect the behavior, it will just become unnecessary.

alexbevi · 2023-03-13T16:16:17Z

Closing this PR. We can revisit array tally functionality once SERVER-59713 is available as this would provide a huge performance boost over a client-side implementation.

johnnyshields · 2023-03-13T16:23:35Z

@alexbevi what do you mean by "client-side implementation"?

SERVER-59713 is very unlikely to change any performance here at all--there is no evidence that it will. It will just add a virtual $project stage where this PR declares it explicitly.

On the contrary, my data above demonstrates there is not a significant difference between the $project/non-$project cases, i.e. we should not expect any performance change from SERVER-59713.

Please reopen this.

alexbevi · 2023-03-13T16:32:57Z

SERVER-59713 is very unlikely to change any performance here at all--there is no evidence that it will. It will just add a virtual $project stage where this PR declares it explicitly.

I appreciate your feedback, however I am not best suited to make this determination. I will have to defer to our query execution teams to benchmark this appropriately once they've implemented the functionality.

We have this PR accessible via MONGOID-5556 when the time comes to review this functionality again.

johnnyshields · 2023-03-13T17:32:17Z

@alexbevi The question here should not be about potential performance improvements SERVER-59713. SERVER-59713 is not a performance ticket, it is merely a functional bug fix. (Moreover, there is no timeline to deliver it.)

The question should be: "does adding a $project stage in front of an $unwind on the current MongoDB versions cause any performance regression?"

My benchmark data above clearly indicates that it does not, at least not any sort of hockey-stick or O(n^2) explosion, even when working with a 100 million document collection.

If you still have concern, why not ask someone at MongoDB who is knowledgeable about the aggregation pipeline about the effects of putting a $project before an $unwind?

I'm also happy to provide additional benchmarks if you can provide specifics of what you'd like to see.

johnnyshields · 2023-03-13T17:55:32Z

I'd also like to remind that the $project workaround is ONLY applied for nested fields, i.e. its a workaround for a corner-case. We're losing the forest for the trees here.

Would you be willing to merge this if I remove that workaround (i.e. drop support for nested fields), and then we add the workaround once additional homework is done?

amitbeck · 2023-03-16T10:31:28Z

Hi, reporter of SERVER-59713 here.

SERVER-59713 is not a performance ticket, it is merely a functional bug fix.

Indeed, it's either a feature request or functional bug fix. My sole concern is the misalignment and unexpected behavior - $projecting "$items.name" results in an array (as expected), but attempting to $unwind that field results in nothing although "$items.name" is allegedly an array.
Seems like a legitimate issue to me, but it got little attention in the past 1.5 years.

johnnyshields · 2023-04-11T09:06:49Z

Merged into Mongoid Ultra 🎸

#tally should support splatting array results

98515b3

johnnyshields marked this pull request as draft February 8, 2023 21:40

johnnyshields added 3 commits February 9, 2023 06:46

More WIP

28bdf79

Refactoring for clarity

98315fb

Use $unwind operator

dc1561c

johnnyshields changed the title ~~DRAFT: #tally should support splatting array results~~ [DRAFT] MONGOID-5556: #tally should support splatting array results Feb 8, 2023

johnnyshields commented Feb 9, 2023

View reviewed changes

lib/mongoid/contextual/memory.rb Show resolved Hide resolved

johnnyshields added 3 commits February 9, 2023 23:08

Add support for $unwind + nested fields

811d1f6

Add docs

ce20201

Fix contextual/memory

2821753

johnnyshields changed the title ~~[DRAFT] MONGOID-5556: #tally should support splatting array results~~ [READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results Feb 9, 2023

johnnyshields marked this pull request as ready for review February 9, 2023 14:31

johnnyshields added 3 commits February 9, 2023 23:37

Clarify comment

b9521ae

Get rid of whitespace

5270bd4

Be positive

aa24c1b

alexbevi reviewed Feb 9, 2023

View reviewed changes

johnnyshields added 3 commits February 20, 2023 07:46

Update mongo.rb

3751178

Update mongo.rb

2871753

Update mongo.rb

c8766ce

alexbevi closed this Mar 13, 2023

This was referenced Apr 11, 2023

Mongoid Ultra Roadmap tablecheck/mongoid-ultra#13

Open

MONGOID-5556: #tally should support :unwind arg to splat array results tablecheck/mongoid-ultra#19

Merged

johnnyshields changed the title ~~[READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results~~ 🚀 [READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 [READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results #5541

🚀 [READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results #5541

johnnyshields commented Feb 8, 2023 •

edited

Loading

alexbevi Feb 9, 2023

alexbevi Feb 9, 2023

johnnyshields Feb 10, 2023 •

edited

Loading

johnnyshields Feb 10, 2023 •

edited

Loading

alexbevi commented Mar 13, 2023

johnnyshields commented Mar 13, 2023 •

edited

Loading

alexbevi commented Mar 13, 2023

johnnyshields commented Mar 13, 2023 •

edited

Loading

johnnyshields commented Mar 13, 2023 •

edited

Loading

amitbeck commented Mar 16, 2023 •

edited

Loading

johnnyshields commented Apr 11, 2023

🚀 [READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results #5541

🚀 [READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results #5541

Conversation

johnnyshields commented Feb 8, 2023 • edited Loading

alexbevi Feb 9, 2023

Choose a reason for hiding this comment

alexbevi Feb 9, 2023

Choose a reason for hiding this comment

johnnyshields Feb 10, 2023 • edited Loading

Choose a reason for hiding this comment

johnnyshields Feb 10, 2023 • edited Loading

Choose a reason for hiding this comment

alexbevi commented Mar 13, 2023

johnnyshields commented Mar 13, 2023 • edited Loading

alexbevi commented Mar 13, 2023

johnnyshields commented Mar 13, 2023 • edited Loading

johnnyshields commented Mar 13, 2023 • edited Loading

amitbeck commented Mar 16, 2023 • edited Loading

johnnyshields commented Apr 11, 2023

johnnyshields commented Feb 8, 2023 •

edited

Loading

johnnyshields Feb 10, 2023 •

edited

Loading

johnnyshields Feb 10, 2023 •

edited

Loading

johnnyshields commented Mar 13, 2023 •

edited

Loading

johnnyshields commented Mar 13, 2023 •

edited

Loading

johnnyshields commented Mar 13, 2023 •

edited

Loading

amitbeck commented Mar 16, 2023 •

edited

Loading