Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀 [READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results #5541

Closed
wants to merge 13 commits into from

Conversation

johnnyshields
Copy link
Contributor

@johnnyshields johnnyshields commented Feb 8, 2023

This uses the $unwind aggregation operator under the hood.

It turned out pretty nicely, it works for embedded and localized fields too.

Using this code I was able to run an array tally on a collection with 100,000,000 docs in 5 minutes 😎

@johnnyshields johnnyshields marked this pull request as draft February 8, 2023 21:40
@johnnyshields johnnyshields changed the title DRAFT: #tally should support splatting array results [DRAFT] MONGOID-5556: #tally should support splatting array results Feb 8, 2023
@johnnyshields johnnyshields changed the title [DRAFT] MONGOID-5556: #tally should support splatting array results [READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results Feb 9, 2023
@johnnyshields johnnyshields marked this pull request as ready for review February 9, 2023 14:31
pipeline.unshift("$match" => view.filter) unless view.filter.blank?
pipeline = []
pipeline << { "$match" => view.filter } if view.filter.present?
pipeline << { "$project" => { "#{projected}" => "$#{name}" } } if projected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a $project stage in the middle of a pipeline can negatively impact performance.

I would be extremely wary of introducing a change like this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@johnnyshields johnnyshields Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexbevi if you don't use $project, the $unwind on nested fields will yield an empty result. I have raised SERVER-73888 for this because I believe it is a bug; other operators such as $group handle nested fields just fine.

As you can see in the PR code, the $project is only used in the case it needed:

  • the :splat_arrays arg is true (note it is false by default) AND
  • the field arg is a nested field (either . is present, or it is translation)
        # Must add a $project stage when using $unwind with nested fields
        projected = 'p' if splat_arrays && (is_translation || name.include?('.'))

Note that the code in my PR does not apply the $project stage under default conditions, so there is no behavior change for any existing usage of the #tally method.

Re: performance, fortunately I have a large production database we can try it and see what happens😄. Using it on MongoDB 6.0.x collection with 109 million+ records:

Foobar.estimated_count #=> 109601406

# non-nested array<string> field, does *not* use $project
Foobar.tally('tags', splat_arrays: true)
#=> 481 seconds
#=> 761774 tally keys, total tallied value count 8031404

# nested field (array<document> -> string), uses $project
Foobar.tally('addresses.city', splat_arrays: true)
#=> 592 seconds
#=> 65362 tally keys, total tallied value count 3284490

CPU, memory, disk util etc. were all equivalent & healthy on both throughout. Clearly, nothing catastrophic/"hockey-stick" happens when adding $project where it is needed due to as-is server behavior.
image

  • Note: (a) I had already run the tally on tags earlier, but not yet on addresses.city, so caches were possibly pre-warmed for tags; (b) there is an index on tags, not on addresses.city. If anything, the "with $project" case should be even better than the above.

TLDR;
image

Copy link
Contributor Author

@johnnyshields johnnyshields Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final note:

@alexbevi
Copy link
Contributor

Closing this PR. We can revisit array tally functionality once SERVER-59713 is available as this would provide a huge performance boost over a client-side implementation.

@alexbevi alexbevi closed this Mar 13, 2023
@johnnyshields
Copy link
Contributor Author

johnnyshields commented Mar 13, 2023

@alexbevi what do you mean by "client-side implementation"?

SERVER-59713 is very unlikely to change any performance here at all--there is no evidence that it will. It will just add a virtual $project stage where this PR declares it explicitly.

On the contrary, my data above demonstrates there is not a significant difference between the $project/non-$project cases, i.e. we should not expect any performance change from SERVER-59713.

Please reopen this.

@alexbevi
Copy link
Contributor

SERVER-59713 is very unlikely to change any performance here at all--there is no evidence that it will. It will just add a virtual $project stage where this PR declares it explicitly.

I appreciate your feedback, however I am not best suited to make this determination. I will have to defer to our query execution teams to benchmark this appropriately once they've implemented the functionality.

We have this PR accessible via MONGOID-5556 when the time comes to review this functionality again.

@johnnyshields
Copy link
Contributor Author

johnnyshields commented Mar 13, 2023

@alexbevi The question here should not be about potential performance improvements SERVER-59713. SERVER-59713 is not a performance ticket, it is merely a functional bug fix. (Moreover, there is no timeline to deliver it.)

The question should be: "does adding a $project stage in front of an $unwind on the current MongoDB versions cause any performance regression?"

My benchmark data above clearly indicates that it does not, at least not any sort of hockey-stick or O(n^2) explosion, even when working with a 100 million document collection.

If you still have concern, why not ask someone at MongoDB who is knowledgeable about the aggregation pipeline about the effects of putting a $project before an $unwind?

I'm also happy to provide additional benchmarks if you can provide specifics of what you'd like to see.

@johnnyshields
Copy link
Contributor Author

johnnyshields commented Mar 13, 2023

I'd also like to remind that the $project workaround is ONLY applied for nested fields, i.e. its a workaround for a corner-case. We're losing the forest for the trees here.

Would you be willing to merge this if I remove that workaround (i.e. drop support for nested fields), and then we add the workaround once additional homework is done?

@amitbeck
Copy link

amitbeck commented Mar 16, 2023

Hi, reporter of SERVER-59713 here.

SERVER-59713 is not a performance ticket, it is merely a functional bug fix.

Indeed, it's either a feature request or functional bug fix. My sole concern is the misalignment and unexpected behavior - $projecting "$items.name" results in an array (as expected), but attempting to $unwind that field results in nothing although "$items.name" is allegedly an array.
Seems like a legitimate issue to me, but it got little attention in the past 1.5 years.

@johnnyshields johnnyshields changed the title [READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results 🚀 [READY FOR REVIEW] MONGOID-5556: #tally should support splatting array results Apr 11, 2023
@johnnyshields
Copy link
Contributor Author

Merged into Mongoid Ultra 🎸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants