Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark shard failures caused by unsupported aggregations or queries against rolled up data so Kibana can identify them #89252

Conversation

salvatore-campagna
Copy link
Contributor

@salvatore-campagna salvatore-campagna commented Aug 10, 2022

When an unsupported aggregation is executed on a
aggregate_double_metric field an error with a specific
type field is generated. This allows clients like Kibana
to identify these errors and handle them properly.

Also we would like to fail on queries using a date histogram
aggregation with calendar_interval on rollup indices.
Date histogram aggregations are executed on rollup indices
only if using fixed_interval and a UTC timezone.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v8.5.0 labels Aug 10, 2022
@salvatore-campagna salvatore-campagna added >non-issue :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data and removed needs:triage Requires assignment of a team area label labels Aug 10, 2022
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 10, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

Date histogram aggregations must fail if 'calendar_interval' or
a non-utc time zone is used, only if the index the aggregation
is running on is a rollup index. For time series indices, date
histogram aggregations should work as expected without throwing
any error.
@@ -127,6 +128,15 @@ public class IndexMetadata implements Diffable<IndexMetadata>, ToXContentFragmen
EnumSet.of(ClusterBlockLevel.WRITE)
);

public boolean isRollupIndex() {
Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling this is not the best I can do here. I am working on a better way to do this at the moment but I wanted to push this change because it might be good enough. If that feeling is shared but this is considered "good enough" we could proceed merging this to unblock Kibana folks experimenting with catching exceptions. Then I can create another issue to refactor this. N.B. I am on vacation next week and Christos is on vacation too and I would like to avoid Kibana folks being stuck waiting for this.

As a result, if required, I can work on this on another PR whose purpose would be to refactor this "isRollupIndex" logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cool as a temporary thing. Maybe a TODO in the code so we know we thought it was temporary when we added it.

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me, left a nit about another test we could add

- match: { error.root_cause.0.reason: "Field [total_memory_used] of type [aggregate_metric_double] is not supported for aggregation [percentiles]" }

---
"Top-metrics aggregation on aggregate_metric_double field":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether out of scope but what about a test for the mixed case (as this will be extremely common in practice)? Querying both indices at the same time, resulting in one of them succeeding and the other producing a shard failure so there's partial data in the response

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that is a good test I think...I am not sure what the result is in this case. I guess we hava a shard failure for the rollup index and partial results coming from the time series index...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly

Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not necessary for the errors we throw if the aggregation is not supported (aggregation on field of type agrgegate_metric_double). In that case we don't check if the index is a time series index or not...just the field type. Also when doing a rollup on a field, lat's say a histogram aggregation on a gauge field called test_field. In that case we create the rollup index and a new field that is called test_field_histogram. As a result it would not be possible to run a query on both indices on the same field because the field name in the rollup index would be different 'by construction' (original_field_name + aggregation_name).

Anyway, it is a good test for the date histogram case...in that case indeed the field is the timestamp field which is the same in both the source index and the rollup index.

Hitting both indices produces partial results coming from the
source time series index and a shard failure coming from the
rollup index.
calendar_interval: hour
min_doc_count: 1

- match: { _shards.total: 2 }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: here we rely on the fact that the rollup index is created with the same number of shards of the original index, which is set to 1 in the setup block.

@@ -0,0 +1,250 @@
setup:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change this test adding the logic to create the rollup index from the source index because some settings and metadata are applied by the rollup operation and it is not possible to set them from the yaml test. The two settings I use are the source index name and the rollup status which I check to verify if the index is a rollup index or not.

@@ -127,6 +128,15 @@ public class IndexMetadata implements Diffable<IndexMetadata>, ToXContentFragmen
EnumSet.of(ClusterBlockLevel.WRITE)
);

public boolean isRollupIndex() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cool as a temporary thing. Maybe a TODO in the code so we know we thought it was temporary when we added it.

String valuesSourceDescription,
String aggregationName
) {
if (indexSettings.getIndexMetadata().isRollupIndex() && DateIntervalWrapper.IntervalTypeEnum.CALENDAR.equals(intervalType)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want this test for all time rolled up indices. Not that there are any non-time series ones, but I think the addition of isRollupIndex means we don't need these new methods in IndexMode any more.

final boolean rollupSuccess = IndexMetadata.RollupTaskStatus.SUCCESS.name()
.toLowerCase(Locale.ROOT)
.equals(indexRollupStatus != null ? indexRollupStatus.toLowerCase(Locale.ROOT) : IndexMetadata.RollupTaskStatus.UNKNOWN);
return Strings.isNullOrEmpty(sourceIndex) == false && rollupSuccess;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a duplicate.

* Downsampling uses specific types while aggregating some fields (like 'aggregate_metric_double').
* Such field types do not support some aggregations.
*/
public class UnsupportedAggregationOnDownsampledField extends AggregationExecutionException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth leaving a note that the name of this class is part of a contract with Kibana so we shouldn't change it.

from: 401.0

- match: { status: 400 }
- match: { error.root_cause.0.type: unsupported_aggregation_on_downsampled_field }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth leaving a comment here that this type name is part of a contract with kibana. Just extra paranoia so folks don't change it.

Copy link
Contributor

@csoulios csoulios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a very quick look at the code. I only left a small comment about naming the exception, because this is something we cannot change in the future.

* Downsampling uses specific types while aggregating some fields (like 'aggregate_metric_double').
* Such field types do not support some aggregations.
*/
public class UnsupportedAggregationOnDownsampledField extends AggregationExecutionException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we name the action "rollup" everywhere in the code, what about naming this exception UnsupportedAggregationOnRollupField?

Also, since this is more generic than a rollup field we can name it UnsupportedAggregationOnRollupIndex

@nik9000 nik9000 merged commit 4b92e1d into elastic:main Aug 22, 2022
@nik9000
Copy link
Member

nik9000 commented Aug 22, 2022

I've merged this so @flash1293 should be able to grab it sooner rather than later.

@flash1293
Copy link
Contributor

FYI @tsullivan @ppisljar - you should be able to wire this up in your warnings handling pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants