-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL: Use faster field caps #105067
ESQL: Use faster field caps #105067
Conversation
The field capabilities has an internal-only option to produce un-merged output. This expands that option to be available to any caller inside of Elasticsearch and uses it in ES|QL to speed up queries across many indices with many fields. Across 5,000 indices with a couple thousand fields each (metricbeat) the `FROM *` query went from 600ms to 60ms. Across 50,000 indices that went from 6600ms to 600ms. 600ms is still too slow for such a simple query, but one step at a time! This is faster because field capabilities wants to present a field-centric result but ES|QL actually needs a different flavor of field-centric result with some differences smoothed away. If we take over the merging process we can use a few tools that the field caps API uses internally to be fast - mostly the sha256 of the mapping - to save on doing work that wasn't available in the other view. Also, two merges is more expensive than one. That 90% reduction in runtime doesn't banish field caps from the flamegraphs. You still see it, but it's now much less prominent. And you don't see the merging process at all. Now it's all data-node side operations field caps. Relates to elastic#103369
Hi @nik9000, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine it'd be nice to port all of the resolution infrastructure to un-merged results but there is lot of it, and a lot of layers that I don't understand. So I did the simplest thing I could think of - I made my own version that i could be sure was fairly efficient and then asserted it had the same output as the original.
), | ||
new EsqlIndexResolver( | ||
services.client(), | ||
EsqlDataTypeRegistry.INSTANCE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only keep the original IndexResolver so I can compare the results.
); | ||
esqlIndexResolver.resolveAsMergedMapping(table.index(), fieldNames, false, Map.of(), listener); | ||
|
||
// indexResolver.resolveAsMergedMapping(table.index(), fieldNames, false, Map.of(), new ActionListener<>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been swapping the line above with these lines to make sure I output the same result.
I'm not quit mimicing |
… into faster_field_caps
Sometimes we produce output with a *ton* of `null`s and when we do a significant amount of time is spent on tracking the memory usage of each chunk of null colunms. Whole milliseconds in a request that takes dozens of milliseconds. We can avoid all of this by sharing all constant-null vectors produced for a single block. When running `FROM *` in ESQL on a data-less coordinating node this cuts reading reading pages from 15% of the time to 3.4% of the time, cutting the entire operation from 634ms to 599ms. Note that elastic#105067 is already open and should save about 540ms from the operation. It's unlikely that combining these two will yield a 59ms operation, but we live in hope. Relates to elastic#103369
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like those numbers in performance improvements and it seems we are going in the right direction.
But, the most scary part about the IndexResolver removal is that it has some years of maintenance and fixes and some of the tests that are indirectly using it live in other projects. For example, one of the most important test suites I've created that deals with many edge cases is FieldExtractorTestCase
in ES SQL.
If we are to use our own flavor of the IndexResolver (and I am in favor of this given the preliminary performance gains) we'd need to port some of the existent tests in SQL and QL projects here, in ESQL.
isAlias
is being used to build the hierarchy of fields when the fieldcaps response is not telling us if the type of a field is actually an alias
type. It is possible this was useful to ES SQL to extract data from _source when the fields
API didn't exist but I am not 100% sure it is still needed now (FieldExtractorTestCase.testAliasFromDocValueField
is a test that checks this type of hierarchy).
I talked to @astefan and the plan now is to try and port some tests from SQL that have more complex mappings to ESQL. And to run both resolution mechanisms if assertions are enabled - basically to uncomment that big testing blob I have. He'll do a little more research on his side so he can review this in more depth. I'll port the test. |
I've ported it! |
@astefan, I've ported the tests from SQL and everything seems to be passing. Could you have another look? |
Sometimes we produce output with a *ton* of `null`s and when we do a significant amount of time is spent on tracking the memory usage of each chunk of null colunms. Whole milliseconds in a request that takes dozens of milliseconds. We can avoid all of this by sharing all constant-null vectors produced for a single block. When running `FROM *` in ESQL on a data-less coordinating node this cuts reading reading pages from 15% of the time to 3.4% of the time, cutting the entire operation from 634ms to 599ms. Note that #105067 is already open and should save about 540ms from the operation. It's unlikely that combining these two will yield a 59ms operation, but we live in hope. Relates to #103369
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tests are fine in FieldExtractorTestCase
👍
I started digging and trying out different things. I looked at IndexResolverTests
from sql and I think we should port that one as well (in that one we have a text for fix in #100869 for example). For one of those tests I tried out - testMergeIncompatibleCapabilities
- I get a ql and esql didn't make the same resolution: types differ [INTEGER] != [UNSUPPORTED]
message.
It is because doc_values: false
, which means not aggregateable in one of the indices.
Going further and ignoring the awesome EsqlIndexResolver-IndexResolver checks you added in EsqlSession, for that specific mapping in testMergeIncompatibleCapabilities
and query from test* | stats max(emp_no)
I get Cannot use field [emp_no] due to ambiguities being mapped as aggregatable except in [test2]
which is ok..., BUT we have no tests for these types of errors :-(.
So:
IndexResolverTests
needs some of its tests ported to ESQL. I am not particularly fond of this class because for some tests it's loading the mappings "manually" (not actually relying on field_caps natural output) and then asserting behavior on them- we need some coverage for these mapping incompatibilities that reach the planning (the Analyzer in this case). I'll look into this today
|
||
/** | ||
* <pre> | ||
* "int_field": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is not a big deal, more a consistency issue: this is called int_field
, the test itself has integer_field
. The subfield here is text
while in the test is str
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll clean this one up. I always write int
instead of integer
when I work with that mapper. Not sure why. OTOH, I tend to use str
when I mean text|keyword
. Old habit. I'll double check it.
@@ -185,8 +185,7 @@ public void testIncompatibleMappingsErrors() throws IOException { | |||
assertException("from test_alias | where _size is not null | limit 1", "Unknown column [_size]"); | |||
assertException( | |||
"from test_alias | where message.hash is not null | limit 1", | |||
"Cannot use field [message.hash] due to ambiguities", | |||
"incompatible types: [integer] in [index2], [murmur3] in [index1]" | |||
"Cannot use field [message.hash] with unsupported type [murmur3]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how this works, but I don't think it's a killer for the approach, but we should decide. This example makes it look like using two field types will give you the ambiguities error. But in other examples the QL index resolution stuff will give you the unsupported error. We could do either one on the ESQL side, but I picked doing the unsupported
error because it's the first such error message I found. I'd be happy to switch it, but that'll break other tests which I'll have to coax back to grene.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests here are new (added with https://github.com/elastic/elasticsearch/pull/105544/files) because we didn't cover this in tests.
In any case, my personal opinion is that your approach is the better approach: tell the user early that there is field that's unsupported, rather than telling "these two fields have mixed incompatible types" and the user then trying to make them compatible (reindexing or whatever) and later to find out that the effort was in vain because the type they chose to make compatible is the wrong type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad you like it!
I've talked with @andreabucci and I'm going to port |
sorry, wrong @. I just typed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
private final PreAnalyzer preAnalyzer; | ||
private final FunctionRegistry functionRegistry; | ||
private final Mapper mapper; | ||
private final Metrics metrics; | ||
private final Verifier verifier; | ||
|
||
public PlanExecutor(IndexResolver indexResolver) { | ||
public PlanExecutor(IndexResolver indexResolver, EsqlIndexResolver esqlIndexResolver) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why keep IndexResolver around? To check the behavior is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup
); | ||
String err = EntityUtils.toString(e.getResponse().getEntity()); | ||
assertThat(err, containsString("Cannot use field [f] with unsupported type [ip_range]")); | ||
assertThat(err, containsString("Cannot use field [f.raw] with unsupported type [ip_range]")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great points @astefan . Tl;dr is that ESQL already has and will have better handling of field extraction and thus conflict resolution than SQL and by having a different resolved, we should be able to evolve this better.
in SQL extraction is done through field caps hence why we expected a whole field hierarchy to match - it might not have been perfect but it got us pretty far.
In ESQL that assumption no longer holds - #100875 allowed subfields to allow the following mapping:
index 1 - field.name
keyword
index 2 - field
= keyword
In Index 1, field is object, as parent for field.name. This conflicts with field in index2 which is a keyword - using the SQL behavior we discarded field.name as unsupported. However there's no field.name
in index2 and thus there's no mapping conflict - in other words, field is problematic but field.name is not.
I think we're in a good enough place with this and considering the performance savings I suggest we move forward with it - there's a chance there might be some subtle cases we haven't covered yet and by changing the behavior we're going to run into them but right now we have enough testing in place to be confident about them.
Furthermore I would backport this behavior in 8.13 to test it as much as possible. Potentially introduce an internal flag to have the fallback behavior for a while when running into problems.
A flag conflicts with any behavior changes we do after this. At some point we'll want to diverge and then we'd have to be really careful with testing with the flag on and off. That's trick! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm really happy with this PR. @nik9000 thank you for the patience and the performance improvements this change brings.
Thanks for your patience with me too! |
💔 Backport failedThe backport operation could not be completed due to the following error:
You can use sqren/backport to manually backport by running |
oh. Well. That makes sense. I'll do it myself. |
Too late to backport I think. |
The field capabilities has an internal-only option to produce un-merged output. This expands that option to be available to any caller inside of Elasticsearch and uses it in ES|QL to speed up queries across many indices with many fields. Across 5,000 indices with a couple thousand fields each (metricbeat) the
FROM *
query went from 600ms to 60ms. Across 50,000 indices that went from 6600ms to 600ms. 600ms is still too slow for such a simple query, but one step at a time!This is faster because field capabilities wants to present a field-centric result but ES|QL actually needs a different flavor of field-centric result with some differences smoothed away. If we take over the merging process we can use a few tools that the field caps API uses internally to be fast - mostly the sha256 of the mapping - to save on doing work that wasn't available in the other view. Also, two merges is more expensive than one.
That 90% reduction in runtime doesn't banish field caps from the flamegraphs. You still see it, but it's now much less prominent. And you don't see the merging process at all. Now it's all data-node side operations field caps.
Relates to #103369