Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Concurrent Segment Search] Perform buildAggregation in parallel #11673

Closed
jed326 opened this issue Dec 26, 2023 · 2 comments · Fixed by #12697
Closed

[Concurrent Segment Search] Perform buildAggregation in parallel #11673

jed326 opened this issue Dec 26, 2023 · 2 comments · Fixed by #12697
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance v2.14.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@jed326
Copy link
Collaborator

jed326 commented Dec 26, 2023

Is your feature request related to a problem? Please describe

In the current concurrent search paradigm we will create slice_count times the number of collectors compared to non-concurrent search and for these collectors collect will be called in concurrently. Afterwards, reduce is called on the search threadpool sequentially for all of these collectors.

There are 2 main scenarios where this sequential behavior can be problematic -- [1] whenever the aggregators are nested and buildAggregation needs to BFS/DFS through the collector tree and [2] when buildAggregation itself is an expensive operation for the given collector.

One such example of this is the keyword-terms-numeric-terms, which is a nested terms aggregation. Even in the non-concurrent search case nested terms aggregations will suffer from combinatorial explosion of buckets for each additional nested layer and for concurrent search this problem is essentially multiplied by slice_count as the bucket creation during buildAggregation is done sequentially. The combinatorial explosion was partially addressed by #11585 however the sequential work is still a bottleneck.

Here is some basic query profiler breakdown to further illustrate this point:

Metric weather-data-2016 - disabled - size=500 (in ms) weather-data-2016 - slice=2 - slice_size=1150 (1.5 * shard_size + 10 — size=500) (in ms)
Took 10918   17567  
ConstantScoreQuery — time_in_nanos 3021298801 3021.2988 8023  
MatchAllDocsQuery — time_in_nanos 1087697018 1087.69702 2338  
rewrite_time 27406   10733  
collector — time_in_nanos 4758597960 4758.59796 3919422108 3919.42211
GlobalOrdinalsStringTermsAggregator — time_in_nanos 7655394436 7655.39444 9238307858 9238.30786
collect 2670211530 2670.21153 2349533379 2349.53338
avg_collect     2223026375 2223.02638
collect_count 33659481   33659481  
build_aggregation 4976192164 4976.19216 12500064156 12500.06416
avg_build_aggregation     6250026755 6250.02676
build_aggregation_count 1   2  
NumericTermsAggregator — time_in_nanos 4167152226 4167.15223 5953806596 5953.8066
collect 281349935 281.34994 7061608537 7061.60854
avg_collect     271087497 271.0875
collect_count 832960   1329446  
build_aggregation 3885328997 3885.329 11216763404 11216.7634
avg_build_aggregation     5220390427 5220.39043
build_aggregation_count 1   2  
MaxAggregator — time_in_nanos 540335 0.54034 2229363 2.22936
collect 458491 0.45849 5622515353 5622.51535
avg_collect     1576283 1.57628
collect_count 8360   59800  
build_aggregation 69834 0.06983 5617813957 5617.81396
avg_build_aggregation     251549 0.25155
build_aggregation_count 1   2

We can see that for the NumericTermsAggregator build_aggregation is taking 3-4x as long and happening sequentially as well as the combinatorial explosion in the collect_count.

Describe the solution you'd like

The additional combinatorial explosion is due to slice_size > shard_size, which is something we can revisit (ie does slice_size really need to be 1.5*shard_size + 10 or can slice_size == shard_size?). However the long pole is actually the build_aggregation as a whole taking a long time as happening sequentially. I propose that we move the buildAggregation steps to processPostCollection so that it can happen in parallel on the index_searcher thread.

I will follow-up with a PR with this change as well as some benchmarking data to further discuss.

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

@jed326 jed326 added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 26, 2023
@jed326 jed326 self-assigned this Dec 26, 2023
@jed326 jed326 removed the untriaged label Dec 27, 2023
@jed326
Copy link
Collaborator Author

jed326 commented Dec 27, 2023

I added a small POC for this here: jed326@42ba232

However, there is an AssertingLeafReader problem as the global ordinal doc values are created on the search thread but we are now trying to access them on the index_searcher thread so Lucene is throwing an exception for all aggs that use global ordinals.

I don't see a way forward to parallelize build_aggregation as is for any agg that uses global ordinals as of now. Will also need to evaluate if the lookupOrd() method is thread safe.

@reta
Copy link
Collaborator

reta commented Dec 27, 2023

However the long pole is actually the build_aggregation as a whole taking a long time as happening sequentially. I propose that we move the buildAggregation steps to processPostCollection so that it can happen in parallel on the index_searcher thread.

Besides the AssertingLeafReader problem, we would be keeping the indexer threads busy for longer time, potentially impacting any other searches and aggregations (since index searcher pool is shared between all indices on the node).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance v2.14.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants