Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Use faster field caps #105067

Merged
merged 29 commits into from
Feb 27, 2024
Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Feb 2, 2024

The field capabilities has an internal-only option to produce un-merged output. This expands that option to be available to any caller inside of Elasticsearch and uses it in ES|QL to speed up queries across many indices with many fields. Across 5,000 indices with a couple thousand fields each (metricbeat) the FROM * query went from 600ms to 60ms. Across 50,000 indices that went from 6600ms to 600ms. 600ms is still too slow for such a simple query, but one step at a time!

This is faster because field capabilities wants to present a field-centric result but ES|QL actually needs a different flavor of field-centric result with some differences smoothed away. If we take over the merging process we can use a few tools that the field caps API uses internally to be fast - mostly the sha256 of the mapping - to save on doing work that wasn't available in the other view. Also, two merges is more expensive than one.

That 90% reduction in runtime doesn't banish field caps from the flamegraphs. You still see it, but it's now much less prominent. And you don't see the merging process at all. Now it's all data-node side operations field caps.

Relates to #103369

The field capabilities has an internal-only option to produce un-merged
output. This expands that option to be available to any caller inside of
Elasticsearch and uses it in ES|QL to speed up queries across many
indices with many fields. Across 5,000 indices with a couple thousand
fields each (metricbeat) the `FROM *` query went from 600ms to 60ms.
Across 50,000 indices that went from 6600ms to 600ms. 600ms is still too
slow for such a simple query, but one step at a time!

This is faster because field capabilities wants to present a
field-centric result but ES|QL actually needs a different flavor of
field-centric result with some differences smoothed away. If we take
over the merging process we can use a few tools that the field caps API
uses internally to be fast - mostly the sha256 of the mapping - to save
on doing work that wasn't available in the other view. Also, two merges
is more expensive than one.

That 90% reduction in runtime doesn't banish field caps from the
flamegraphs. You still see it, but it's now much less prominent. And you
don't see the merging process at all. Now it's all data-node side
operations field caps.

Relates to elastic#103369
@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

Copy link
Member Author

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine it'd be nice to port all of the resolution infrastructure to un-merged results but there is lot of it, and a lot of layers that I don't understand. So I did the simplest thing I could think of - I made my own version that i could be sure was fairly efficient and then asserted it had the same output as the original.

),
new EsqlIndexResolver(
services.client(),
EsqlDataTypeRegistry.INSTANCE
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only keep the original IndexResolver so I can compare the results.

);
esqlIndexResolver.resolveAsMergedMapping(table.index(), fieldNames, false, Map.of(), listener);

// indexResolver.resolveAsMergedMapping(table.index(), fieldNames, false, Map.of(), new ActionListener<>() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been swapping the line above with these lines to make sure I output the same result.

@nik9000
Copy link
Member Author

nik9000 commented Feb 2, 2024

I'm not quit mimicing EsField#isAlias and, looking around, I'm not sure what it's used for. As far as I can tell we serialize it in ESQL and don't use it. And in EQL it's passed in and thrown out.

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 2, 2024
Sometimes we produce output with a *ton* of `null`s and when we do a
significant amount of time is spent on tracking the memory usage of each
chunk of null colunms. Whole milliseconds in a request that takes dozens
of milliseconds. We can avoid all of this by sharing all constant-null
vectors produced for a single block.

When running `FROM *` in ESQL on a data-less coordinating node this cuts
reading reading pages from 15% of the time to 3.4% of the time, cutting
the entire operation from 634ms to 599ms. Note that elastic#105067 is
already open and should save about 540ms from the operation. It's
unlikely that combining these two will yield a 59ms operation, but we
live in hope.

Relates to elastic#103369
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like those numbers in performance improvements and it seems we are going in the right direction.

But, the most scary part about the IndexResolver removal is that it has some years of maintenance and fixes and some of the tests that are indirectly using it live in other projects. For example, one of the most important test suites I've created that deals with many edge cases is FieldExtractorTestCase in ES SQL.

If we are to use our own flavor of the IndexResolver (and I am in favor of this given the preliminary performance gains) we'd need to port some of the existent tests in SQL and QL projects here, in ESQL.
isAlias is being used to build the hierarchy of fields when the fieldcaps response is not telling us if the type of a field is actually an alias type. It is possible this was useful to ES SQL to extract data from _source when the fields API didn't exist but I am not 100% sure it is still needed now (FieldExtractorTestCase.testAliasFromDocValueField is a test that checks this type of hierarchy).

@nik9000
Copy link
Member Author

nik9000 commented Feb 6, 2024

I talked to @astefan and the plan now is to try and port some tests from SQL that have more complex mappings to ESQL. And to run both resolution mechanisms if assertions are enabled - basically to uncomment that big testing blob I have. He'll do a little more research on his side so he can review this in more depth. I'll port the test.

@nik9000
Copy link
Member Author

nik9000 commented Feb 9, 2024

For example, one of the most important test suites I've created that deals with many edge cases is FieldExtractorTestCase in ES SQL.

I've ported it!

@nik9000
Copy link
Member Author

nik9000 commented Feb 9, 2024

@astefan, I've ported the tests from SQL and everything seems to be passing. Could you have another look?

nik9000 added a commit that referenced this pull request Feb 9, 2024
Sometimes we produce output with a *ton* of `null`s and when we do a
significant amount of time is spent on tracking the memory usage of each
chunk of null colunms. Whole milliseconds in a request that takes dozens
of milliseconds. We can avoid all of this by sharing all constant-null
vectors produced for a single block.

When running `FROM *` in ESQL on a data-less coordinating node this cuts
reading reading pages from 15% of the time to 3.4% of the time, cutting
the entire operation from 634ms to 599ms. Note that #105067 is
already open and should save about 540ms from the operation. It's
unlikely that combining these two will yield a 59ms operation, but we
live in hope.

Relates to #103369
@nik9000 nik9000 marked this pull request as ready for review February 9, 2024 20:12
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 9, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests are fine in FieldExtractorTestCase 👍

I started digging and trying out different things. I looked at IndexResolverTests from sql and I think we should port that one as well (in that one we have a text for fix in #100869 for example). For one of those tests I tried out - testMergeIncompatibleCapabilities - I get a ql and esql didn't make the same resolution: types differ [INTEGER] != [UNSUPPORTED] message.
It is because doc_values: false, which means not aggregateable in one of the indices.

Going further and ignoring the awesome EsqlIndexResolver-IndexResolver checks you added in EsqlSession, for that specific mapping in testMergeIncompatibleCapabilities and query from test* | stats max(emp_no) I get Cannot use field [emp_no] due to ambiguities being mapped as aggregatable except in [test2] which is ok..., BUT we have no tests for these types of errors :-(.

So:

  • IndexResolverTests needs some of its tests ported to ESQL. I am not particularly fond of this class because for some tests it's loading the mappings "manually" (not actually relying on field_caps natural output) and then asserting behavior on them
  • we need some coverage for these mapping incompatibilities that reach the planning (the Analyzer in this case). I'll look into this today


/**
* <pre>
* "int_field": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not a big deal, more a consistency issue: this is called int_field, the test itself has integer_field. The subfield here is text while in the test is str.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clean this one up. I always write int instead of integer when I work with that mapper. Not sure why. OTOH, I tend to use str when I mean text|keyword. Old habit. I'll double check it.

@@ -185,8 +185,7 @@ public void testIncompatibleMappingsErrors() throws IOException {
assertException("from test_alias | where _size is not null | limit 1", "Unknown column [_size]");
assertException(
"from test_alias | where message.hash is not null | limit 1",
"Cannot use field [message.hash] due to ambiguities",
"incompatible types: [integer] in [index2], [murmur3] in [index1]"
"Cannot use field [message.hash] with unsupported type [murmur3]"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this works, but I don't think it's a killer for the approach, but we should decide. This example makes it look like using two field types will give you the ambiguities error. But in other examples the QL index resolution stuff will give you the unsupported error. We could do either one on the ESQL side, but I picked doing the unsupported error because it's the first such error message I found. I'd be happy to switch it, but that'll break other tests which I'll have to coax back to grene.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests here are new (added with https://github.com/elastic/elasticsearch/pull/105544/files) because we didn't cover this in tests.

In any case, my personal opinion is that your approach is the better approach: tell the user early that there is field that's unsupported, rather than telling "these two fields have mixed incompatible types" and the user then trying to make them compatible (reindexing or whatever) and later to find out that the effort was in vain because the type they chose to make compatible is the wrong type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you like it!

@nik9000
Copy link
Member Author

nik9000 commented Feb 22, 2024

I've talked with @andreabucci and I'm going to port VerifierErrorMessagesTests - at least, as much of it as applies.

@nik9000
Copy link
Member Author

nik9000 commented Feb 22, 2024

I've talked with @andreabucci

sorry, wrong @. I just typed @and and hit enter. Sorry!

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

private final PreAnalyzer preAnalyzer;
private final FunctionRegistry functionRegistry;
private final Mapper mapper;
private final Metrics metrics;
private final Verifier verifier;

public PlanExecutor(IndexResolver indexResolver) {
public PlanExecutor(IndexResolver indexResolver, EsqlIndexResolver esqlIndexResolver) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why keep IndexResolver around? To check the behavior is the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

);
String err = EntityUtils.toString(e.getResponse().getEntity());
assertThat(err, containsString("Cannot use field [f] with unsupported type [ip_range]"));
assertThat(err, containsString("Cannot use field [f.raw] with unsupported type [ip_range]"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points @astefan . Tl;dr is that ESQL already has and will have better handling of field extraction and thus conflict resolution than SQL and by having a different resolved, we should be able to evolve this better.
in SQL extraction is done through field caps hence why we expected a whole field hierarchy to match - it might not have been perfect but it got us pretty far.

In ESQL that assumption no longer holds - #100875 allowed subfields to allow the following mapping:

index 1 - field.name keyword
index 2 - field = keyword

In Index 1, field is object, as parent for field.name. This conflicts with field in index2 which is a keyword - using the SQL behavior we discarded field.name as unsupported. However there's no field.name in index2 and thus there's no mapping conflict - in other words, field is problematic but field.name is not.

I think we're in a good enough place with this and considering the performance savings I suggest we move forward with it - there's a chance there might be some subtle cases we haven't covered yet and by changing the behavior we're going to run into them but right now we have enough testing in place to be confident about them.
Furthermore I would backport this behavior in 8.13 to test it as much as possible. Potentially introduce an internal flag to have the fallback behavior for a while when running into problems.

@nik9000
Copy link
Member Author

nik9000 commented Feb 22, 2024

Potentially introduce an internal flag to have the fallback behavior for a while when running into problems.

A flag conflicts with any behavior changes we do after this. At some point we'll want to diverge and then we'd have to be really careful with testing with the flag on and off. That's trick!

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm really happy with this PR. @nik9000 thank you for the patience and the performance improvements this change brings.

@nik9000
Copy link
Member Author

nik9000 commented Feb 26, 2024

LGTM. I'm really happy with this PR. @nik9000 thank you for the patience and the performance improvements this change brings.

Thanks for your patience with me too!

@nik9000 nik9000 added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge labels Feb 26, 2024
@elasticsearchmachine elasticsearchmachine merged commit 73a170b into elastic:main Feb 27, 2024
14 checks passed
@nik9000 nik9000 deleted the faster_field_caps branch February 27, 2024 12:18
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

The backport operation could not be completed due to the following error:

There are no branches to backport to. Aborting.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 105067

@nik9000
Copy link
Member Author

nik9000 commented Feb 28, 2024

The backport operation could not be completed due to the following error:

oh. Well. That makes sense. I'll do it myself.

@nik9000
Copy link
Member Author

nik9000 commented Mar 4, 2024

Too late to backport I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants