-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚀 [READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results #5541
Conversation
pipeline.unshift("$match" => view.filter) unless view.filter.blank? | ||
pipeline = [] | ||
pipeline << { "$match" => view.filter } if view.filter.present? | ||
pipeline << { "$project" => { "#{projected}" => "$#{name}" } } if projected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a $project
stage in the middle of a pipeline can negatively impact performance.
I would be extremely wary of introducing a change like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexbevi if you don't use $project
, the $unwind
on nested fields will yield an empty result. I have raised SERVER-73888 for this because I believe it is a bug; other operators such as $group
handle nested fields just fine.
As you can see in the PR code, the $project
is only used in the case it needed:
- the
:splat_arrays
arg istrue
(note it isfalse
by default) AND - the
field
arg is a nested field (either.
is present, or it is translation)
# Must add a $project stage when using $unwind with nested fields
projected = 'p' if splat_arrays && (is_translation || name.include?('.'))
Note that the code in my PR does not apply the $project
stage under default conditions, so there is no behavior change for any existing usage of the #tally
method.
Re: performance, fortunately I have a large production database we can try it and see what happens😄. Using it on MongoDB 6.0.x collection with 109 million+ records:
Foobar.estimated_count #=> 109601406
# non-nested array<string> field, does *not* use $project
Foobar.tally('tags', splat_arrays: true)
#=> 481 seconds
#=> 761774 tally keys, total tallied value count 8031404
# nested field (array<document> -> string), uses $project
Foobar.tally('addresses.city', splat_arrays: true)
#=> 592 seconds
#=> 65362 tally keys, total tallied value count 3284490
CPU, memory, disk util etc. were all equivalent & healthy on both throughout. Clearly, nothing catastrophic/"hockey-stick" happens when adding $project
where it is needed due to as-is server behavior.
- Note: (a) I had already run the tally on
tags
earlier, but not yet onaddresses.city
, so caches were possibly pre-warmed fortags
; (b) there is an index ontags
, not onaddresses.city
. If anything, the "with $project" case should be even better than the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final note:
- The link you posted: https://www.mongodb.com/docs/rapid/core/aggregation-pipeline-optimization/#-project-stage-placement does not say putting a
$project
in the beginning will hurt performance, it merely says it will not improve performance. (Of course, we aren't using it for performance but rather to workaround a bug.) - If SERVER-73888 is resolved in a future server release, the
$project
we have here won't affect the behavior, it will just become unnecessary.
Closing this PR. We can revisit array tally functionality once SERVER-59713 is available as this would provide a huge performance boost over a client-side implementation. |
@alexbevi what do you mean by "client-side implementation"? SERVER-59713 is very unlikely to change any performance here at all--there is no evidence that it will. It will just add a virtual On the contrary, my data above demonstrates there is not a significant difference between the $project/non-$project cases, i.e. we should not expect any performance change from SERVER-59713. Please reopen this. |
I appreciate your feedback, however I am not best suited to make this determination. I will have to defer to our query execution teams to benchmark this appropriately once they've implemented the functionality. We have this PR accessible via MONGOID-5556 when the time comes to review this functionality again. |
@alexbevi The question here should not be about potential performance improvements SERVER-59713. SERVER-59713 is not a performance ticket, it is merely a functional bug fix. (Moreover, there is no timeline to deliver it.) The question should be: "does adding a $project stage in front of an $unwind on the current MongoDB versions cause any performance regression?" My benchmark data above clearly indicates that it does not, at least not any sort of hockey-stick or O(n^2) explosion, even when working with a 100 million document collection. If you still have concern, why not ask someone at MongoDB who is knowledgeable about the aggregation pipeline about the effects of putting a $project before an $unwind? I'm also happy to provide additional benchmarks if you can provide specifics of what you'd like to see. |
I'd also like to remind that the $project workaround is ONLY applied for nested fields, i.e. its a workaround for a corner-case. We're losing the forest for the trees here. Would you be willing to merge this if I remove that workaround (i.e. drop support for nested fields), and then we add the workaround once additional homework is done? |
Hi, reporter of SERVER-59713 here.
Indeed, it's either a feature request or functional bug fix. My sole concern is the misalignment and unexpected behavior - |
Merged into Mongoid Ultra 🎸 |
This uses the
$unwind
aggregation operator under the hood.It turned out pretty nicely, it works for embedded and localized fields too.
Using this code I was able to run an array tally on a collection with 100,000,000 docs in 5 minutes 😎