-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change internal representation of bucket key of time_series agg #91407
Conversation
Currently, the key is a map, which can make reducing large response more memory intense then it should be also. Also data structures used during reduce are not back by bigarrays so not accounted for. This commit changes how the key is represented internally. By using BytesRef instead of Map. This doesn't commit doesn't change how the key is represented in the response. It also changes the reduce method to make use of the bucket keys are now bytes refs. Relates to elastic#74660
Pinging @elastic/es-analytics-geo (Team:Analytics) |
.mapToInt(value -> value.getBuckets().size()) | ||
.max() | ||
.getAsInt(); | ||
try (LongObjectPagedHashMap<List<InternalBucket>> tsKeyToBuckets = new LongObjectPagedHashMap<>(initialCapacity, bigArrays)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we send these back in sorted order? Should we? That way we don't need to do this collection thing - we can just walk the results in parallel and marge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the result is sorted. We collect the tsids in order in BytesKeyedBucketOrds
and then we emit the tsids from BytesKeyedBucketOrds
in the same way it was inserted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably should have a stronger guarantee - that the results are sorted by the bytes representation of the tsid. I think they already are. But we should, like, force it. Or at least assert
it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add asserts for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added asserts for this: 0d20460
ordsEnum.readValue(spareKey); | ||
InternalTimeSeries.InternalBucket bucket = new InternalTimeSeries.InternalBucket( | ||
TimeSeriesIdFieldMapper.decodeTsid(spareKey), | ||
spareKey, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's correct to call it a "spare" any more if you are sending it back- it's just the key
.
… in tsid order in the shard level responses. This allows for merging and detecting same TSIDs in multiple shard responses without deduping tsids first in a data structure.
…are sorted by tsid and during the reduce that the tsids are sorted by tsid as well.
.max() | ||
.getAsInt(); | ||
|
||
final List<IteratorAndCurrent<InternalBucket>> iterators = new ArrayList<>(aggregations.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this were a Lucene PriorityQueue
I think you'd get for free finding the earliest tsid. As a bonus it'd have slightly better complexity as the number of shards goes up.
otherwise closing bucketOrds may corrupt the bucket keys
@elasticmachine run elasticsearch-ci/bwc |
@elasticmachine run elasticsearch-ci/part-1 |
Restarting CI builds, because of network issue:
|
if pq is popped then update top isn't needed.
Currently, the key is a map, which can make reducing large response more memory intense then it should be also. Also data structures used during reduce are not back by bigarrays so not accounted for.
This commit changes how the key is represented internally. By using BytesRef instead of Map. This doesn't commit doesn't change how the key is represented in the response. It also changes the reduce method to make use of the bucket keys are now bytes refs.
Relates to #74660