Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change internal representation of bucket key of time_series agg #91407

Merged
merged 8 commits into from
Nov 10, 2022

Conversation

martijnvg
Copy link
Member

Currently, the key is a map, which can make reducing large response more memory intense then it should be also. Also data structures used during reduce are not back by bigarrays so not accounted for.

This commit changes how the key is represented internally. By using BytesRef instead of Map. This doesn't commit doesn't change how the key is represented in the response. It also changes the reduce method to make use of the bucket keys are now bytes refs.

Relates to #74660

Currently, the key is a map, which can make reducing large response
more memory intense then it should be also. Also data structures
used during reduce are not back by bigarrays so not accounted for.

This commit changes how the key is represented internally.
By using BytesRef instead of Map. This doesn't commit
doesn't change how the key is represented in the response.
It also changes the reduce method to make use of the bucket
keys are now bytes refs.

Relates to elastic#74660
@martijnvg martijnvg added >non-issue :StorageEngine/TSDB You know, for Metrics labels Nov 8, 2022
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.6.0 labels Nov 8, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

.mapToInt(value -> value.getBuckets().size())
.max()
.getAsInt();
try (LongObjectPagedHashMap<List<InternalBucket>> tsKeyToBuckets = new LongObjectPagedHashMap<>(initialCapacity, bigArrays)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we send these back in sorted order? Should we? That way we don't need to do this collection thing - we can just walk the results in parallel and marge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the result is sorted. We collect the tsids in order in BytesKeyedBucketOrds and then we emit the tsids from BytesKeyedBucketOrds in the same way it was inserted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably should have a stronger guarantee - that the results are sorted by the bytes representation of the tsid. I think they already are. But we should, like, force it. Or at least assert it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add asserts for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added asserts for this: 0d20460

ordsEnum.readValue(spareKey);
InternalTimeSeries.InternalBucket bucket = new InternalTimeSeries.InternalBucket(
TimeSeriesIdFieldMapper.decodeTsid(spareKey),
spareKey,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's correct to call it a "spare" any more if you are sending it back- it's just the key.

… in tsid order in the shard level responses. This allows for merging and detecting same TSIDs in multiple shard responses without deduping tsids first in a data structure.
…are sorted by tsid and during the reduce that the tsids are sorted by tsid as well.
@martijnvg martijnvg requested a review from nik9000 November 9, 2022 14:17
.max()
.getAsInt();

final List<IteratorAndCurrent<InternalBucket>> iterators = new ArrayList<>(aggregations.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this were a Lucene PriorityQueue I think you'd get for free finding the earliest tsid. As a bonus it'd have slightly better complexity as the number of shards goes up.

otherwise closing bucketOrds
may corrupt the bucket keys
@martijnvg
Copy link
Member Author

@elasticmachine run elasticsearch-ci/bwc

@martijnvg
Copy link
Member Author

@elasticmachine run elasticsearch-ci/part-1

@martijnvg
Copy link
Member Author

Restarting CI builds, because of network issue:

Could not determine the dependencies of task ':x-pack:plugin:ml:explodedBundlePlugin'. |  
-- | --
  | [8.5.1] > Could not resolve all files for configuration ':x-pack:plugin:ml:nativeBundle'. |  
  | [8.5.1]    > Could not download ml-cpp-8.5.1-SNAPSHOT-deps.zip (org.elasticsearch.ml:ml-cpp:8.5.1-SNAPSHOT) |  
  | [8.5.1]       > Could not get resource 'https://artifacts-snapshot.elastic.co/ml-cpp/8.5.1-SNAPSHOT/downloads/ml-cpp/ml-cpp-8.5.1-SNAPSHOT-deps.zip'. |  
  | [8.5.1]          > Premature end of Content-Length delimited message body (expected: 340,418,397; received: 1,343,488)

if pq is popped then update top isn't needed.
@martijnvg martijnvg requested a review from nik9000 November 10, 2022 13:46
@martijnvg martijnvg merged commit 3ece828 into elastic:main Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants