Change internal representation of bucket key of time_series agg #91407

martijnvg · 2022-11-08T14:32:47Z

Currently, the key is a map, which can make reducing large response more memory intense then it should be also. Also data structures used during reduce are not back by bigarrays so not accounted for.

This commit changes how the key is represented internally. By using BytesRef instead of Map. This doesn't commit doesn't change how the key is represented in the response. It also changes the reduce method to make use of the bucket keys are now bytes refs.

Relates to #74660

Currently, the key is a map, which can make reducing large response more memory intense then it should be also. Also data structures used during reduce are not back by bigarrays so not accounted for. This commit changes how the key is represented internally. By using BytesRef instead of Map. This doesn't commit doesn't change how the key is represented in the response. It also changes the reduce method to make use of the bucket keys are now bytes refs. Relates to elastic#74660

elasticsearchmachine · 2022-11-08T14:33:11Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

nik9000 · 2022-11-08T14:38:08Z

server/src/main/java/org/elasticsearch/search/aggregations/timeseries/InternalTimeSeries.java

+            .mapToInt(value -> value.getBuckets().size())
+            .max()
+            .getAsInt();
+        try (LongObjectPagedHashMap<List<InternalBucket>> tsKeyToBuckets = new LongObjectPagedHashMap<>(initialCapacity, bigArrays)) {


Do we send these back in sorted order? Should we? That way we don't need to do this collection thing - we can just walk the results in parallel and marge.

Yes, I think the result is sorted. We collect the tsids in order in BytesKeyedBucketOrds and then we emit the tsids from BytesKeyedBucketOrds in the same way it was inserted.

I think we probably should have a stronger guarantee - that the results are sorted by the bytes representation of the tsid. I think they already are. But we should, like, force it. Or at least assert it.

I will add asserts for this.

I've added asserts for this: 0d20460

nik9000 · 2022-11-08T14:39:20Z

server/src/main/java/org/elasticsearch/search/aggregations/timeseries/TimeSeriesAggregator.java

                ordsEnum.readValue(spareKey);
                InternalTimeSeries.InternalBucket bucket = new InternalTimeSeries.InternalBucket(
-                    TimeSeriesIdFieldMapper.decodeTsid(spareKey),
+                    spareKey,


I don't think it's correct to call it a "spare" any more if you are sending it back- it's just the key.

… in tsid order in the shard level responses. This allows for merging and detecting same TSIDs in multiple shard responses without deduping tsids first in a data structure.

…are sorted by tsid and during the reduce that the tsids are sorted by tsid as well.

nik9000 · 2022-11-09T15:17:37Z

...tions/src/main/java/org/elasticsearch/aggregations/bucket/timeseries/InternalTimeSeries.java

+            .max()
+            .getAsInt();
+
+        final List<IteratorAndCurrent<InternalBucket>> iterators = new ArrayList<>(aggregations.size());


If this were a Lucene PriorityQueue I think you'd get for free finding the earliest tsid. As a bonus it'd have slightly better complexity as the number of shards goes up.

otherwise closing bucketOrds may corrupt the bucket keys

martijnvg · 2022-11-10T12:03:34Z

@elasticmachine run elasticsearch-ci/bwc

martijnvg · 2022-11-10T12:03:48Z

@elasticmachine run elasticsearch-ci/part-1

martijnvg · 2022-11-10T12:04:43Z

Restarting CI builds, because of network issue:

Could not determine the dependencies of task ':x-pack:plugin:ml:explodedBundlePlugin'. |  
-- | --
  | [8.5.1] > Could not resolve all files for configuration ':x-pack:plugin:ml:nativeBundle'. |  
  | [8.5.1]    > Could not download ml-cpp-8.5.1-SNAPSHOT-deps.zip (org.elasticsearch.ml:ml-cpp:8.5.1-SNAPSHOT) |  
  | [8.5.1]       > Could not get resource 'https://artifacts-snapshot.elastic.co/ml-cpp/8.5.1-SNAPSHOT/downloads/ml-cpp/ml-cpp-8.5.1-SNAPSHOT-deps.zip'. |  
  | [8.5.1]          > Premature end of Content-Length delimited message body (expected: 340,418,397; received: 1,343,488)

if pq is popped then update top isn't needed.

martijnvg added >non-issue :StorageEngine/TSDB You know, for Metrics labels Nov 8, 2022

elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.6.0 labels Nov 8, 2022

nik9000 reviewed Nov 8, 2022

View reviewed changes

nik9000 mentioned this pull request Nov 8, 2022

[CI] MixedClusterClientYamlTestSuiteIT test {p0=search.aggregation/110_max_metric/Merging results with unmapped fields} failing #89994

Closed

martijnvg added 4 commits November 8, 2022 17:12

rename variable

a845117

Merge remote-tracking branch 'es/main' into time_series_bucket_key

28e58d1

improve reduce logic and make use of the fact that buckets are sorted…

e92af11

… in tsid order in the shard level responses. This allows for merging and detecting same TSIDs in multiple shard responses without deduping tsids first in a data structure.

added assertions when building shard level agg response that buckets …

0d20460

…are sorted by tsid and during the reduce that the tsids are sorted by tsid as well.

martijnvg requested a review from nik9000 November 9, 2022 14:17

nik9000 reviewed Nov 9, 2022

View reviewed changes

martijnvg added 2 commits November 10, 2022 08:53

use priority queue

c45caec

make a deep copy of bytes ref,

dcbb71c

otherwise closing bucketOrds may corrupt the bucket keys

only update top if iterator advances,

10c4339

if pq is popped then update top isn't needed.

martijnvg requested a review from nik9000 November 10, 2022 13:46

nik9000 approved these changes Nov 10, 2022

View reviewed changes

martijnvg merged commit 3ece828 into elastic:main Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change internal representation of bucket key of time_series agg #91407

Change internal representation of bucket key of time_series agg #91407

martijnvg commented Nov 8, 2022

elasticsearchmachine commented Nov 8, 2022

nik9000 Nov 8, 2022

martijnvg Nov 8, 2022

nik9000 Nov 8, 2022

martijnvg Nov 9, 2022

martijnvg Nov 9, 2022

nik9000 Nov 8, 2022

nik9000 Nov 9, 2022

martijnvg commented Nov 10, 2022

martijnvg commented Nov 10, 2022

martijnvg commented Nov 10, 2022

Change internal representation of bucket key of time_series agg #91407

Change internal representation of bucket key of time_series agg #91407

Conversation

martijnvg commented Nov 8, 2022

elasticsearchmachine commented Nov 8, 2022

nik9000 Nov 8, 2022

Choose a reason for hiding this comment

martijnvg Nov 8, 2022

Choose a reason for hiding this comment

nik9000 Nov 8, 2022

Choose a reason for hiding this comment

martijnvg Nov 9, 2022

Choose a reason for hiding this comment

martijnvg Nov 9, 2022

Choose a reason for hiding this comment

nik9000 Nov 8, 2022

Choose a reason for hiding this comment

nik9000 Nov 9, 2022

Choose a reason for hiding this comment

martijnvg commented Nov 10, 2022

martijnvg commented Nov 10, 2022

martijnvg commented Nov 10, 2022