Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Profiling deep dive] Default aggregation vs. optimization code path #14438

Open
bowenlan-amzn opened this issue Jun 18, 2024 · 0 comments
Open
Assignees
Labels
Performance This is for any performance related enhancements or bugs Search:Aggregations

Comments

@bowenlan-amzn
Copy link
Member

We do have good performance gain from the optimization on range type aggregation.
In short, this optimization is to get aggregation results from index structure, instead of the default way that iterates every document values and collect into results.

Currently, this optimization is only applied to a single aggregation, our next move is to also apply to aggregation with sub-aggregation. To support sub-agg, we not only need to get agg results from index, but also the docID sets, so sub agg knows which doc to collect in the second pass #12602.

However, even after supporting sub-agg, the supported use cases may still be limited in some real world scenarios, because we don't support user adding a top level query along side the aggregation (currently the only supported query is range query on the same field as aggregation... otherwise it has to be match all — but we do check this on segment level).

Haven't experiemented yet. But to support a more flexiable query execution within the optimization, the query itself would become a conjunction of 2 groups of queries — top level and the ones built from range aggregation. Theoretically this conjunction query could still be faster than default aggregation but as the complexity grows, we should also understand deeper towards the low level query operations, like, which part of the code logic is taking most CPU cycles, allocating most memories, and how are these compared to default way of doing aggregation, etc.

Previously, we created a follow up task #13549 to decide a threshold to apply the optimization, because sometime we see the optimized performs worse than default method, for example, on pmc workload when dataset is small and date histogram interval is also small like minute or second interval.
We can merge that to this task as the research directions are same.

Some previous work: #13171

@bowenlan-amzn bowenlan-amzn self-assigned this Jun 18, 2024
@bowenlan-amzn bowenlan-amzn converted this from a draft issue Jun 18, 2024
@bowenlan-amzn bowenlan-amzn added Search:Aggregations Performance This is for any performance related enhancements or bugs labels Jun 18, 2024
@bowenlan-amzn bowenlan-amzn changed the title [Profiling deep dive] Default aggregation vs. Rewrite optimization code path [Profiling deep dive] Default aggregation vs. optimization code path Jun 18, 2024
@bowenlan-amzn bowenlan-amzn moved this from Now (This Quarter) to In Progress in Performance Roadmap Jun 21, 2024
@bowenlan-amzn bowenlan-amzn moved this from In Progress to Now (This Quarter) in Performance Roadmap Jul 4, 2024
@getsaurabh02 getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance This is for any performance related enhancements or bugs Search:Aggregations
Projects
Status: Now (This Quarter)
Status: Later (6 months plus)
Development

No branches or pull requests

1 participant