-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixing regexp-based and set-based include and exclude in *Terms aggregations #62246
Comments
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
Hiya @hchargois, thanks for raising this issue! I'm going to mark this with a discuss label so that the analytics team can chat about it later this week, will get back to you after we've discussed it. I think the constraint is largely to simplify the code, and for some performance reasons (for most of the specializations, it's easy to check include/exclude at the same time). Splitting include and exclude out into separate checks/specializations should work in theory and probably won't be more expensive, but we might want to work up some benchmarks to verify. There are also some oddities in there, like iirc In any case, it seems like a reasonable request to me... will see what the rest of the team thinks! |
Thanks for your reply! For reference, here is the patch of my proof of concept: hchargois@3d12563 I'm not sure what you mean by "splitting include and exclude out into separate checks/specializations". My idea behind the POC was on the contrary to unify the I didn't try merging the partitions-based filter but I think it shouldn't be too hard either, if desired (I personally don't need it but I guess it may make sense to unify it too). For the previously supported cases, it does the same computations as before so the performance should be the same. Of course it's just a POC that needs to be refined but I think the general idea is nevertheless correct. And obviously I'm not too attached to this or any particular solution, as long as it solves our performance issue, that would be fine by me. |
Ah, I see. Yeah approach would probably work fine as well :) We just discussed this a little earlier today in our team meeting, and the consensus is that it seems like a very reasonable enhancement. If you'd like to polish up the POC and send a PR we'd be happy to review, and we can help work up a Rally benchmark on our end if it seems necessary (it might not though depending on the diff, I agree that if it's largely just moving bits around it's probably not necessary). Thanks for taking this on! :) |
I think this can close, now that #63325 has merged! :) |
Background
The Terms, Significant Terms and Rare Terms aggregations support the
include
andexclude
options to filter the buckets via either:"this.*|that.*"
["thisTerm", "thatTerm"]
You can give both an
include
and anexclude
at the same time, but they have to be the same type, both regexp-based or set-based. If you mix and match, you get an error (and this is not documented BTW).The problem
The problem we faced is that we need both a regexp-based include and a set-based exclude. We thought about converting the set-based exclude into a regexp-based exclude so that both could be regexps, like so:
And obviously both give exactly the same buckets in the aggregation... But the performance is way worse with the regexp.
To give an idea of how worse the performance is, this is the performance of a typical aggregation that we can make on our index (to allow testing and comparing both set and regexp, I removed any include parameter):
So, while setting a set-based exclude is just a bit slower than no exclude, the regexp-based exclude is 20 times slower than the set-based one. We can't afford that kind of performance unfortunately.
The proposed solution
We would like to lift that limitation about having to use the same type of include and exclude. We want to be able to mix and match both kinds of include with both kinds of exclude. That way, we could use a regexp when we need the flexibility of one, and use a set when we can, to keep performance high.
I've seen the relevant code (
server/src/main/java/org/elasticsearch/search/aggregations/bucket/terms/IncludeExclude.java
) and just by a bit of refactoring in this file, I've been able to make a proof of concept that achieves that goal. I've confirmed that mixing a regexp and set give the performance we expect, which is much faster that two regexps.So, I'm opening this issue to gather feedback before hopefully getting the green light to implement this properly in a pull request.
The text was updated successfully, but these errors were encountered: