Add a TSID global ordinal to TimeSeriesIndexSearcher #90035

romseygeek · 2022-09-13T14:35:53Z

Rather than trying to compare BytesRefs in tsdb-related aggregations, it
will be much quicker if we can use a search-global ordinal to detect when
we have moved to a new TSID. This commit adds such an ordinal to the
aggregation execution context.

elasticsearchmachine · 2022-09-13T14:36:18Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

elasticsearchmachine · 2022-09-13T14:36:19Z

Hi @romseygeek, I've created a changelog YAML for you.

…d-ord

martijnvg

Looks good. I left some questions.

martijnvg · 2022-09-13T15:07:27Z

server/src/main/java/org/elasticsearch/search/aggregations/timeseries/TimeSeriesAggregator.java

@@ -42,7 +42,7 @@ public TimeSeriesAggregator(
        CardinalityUpperBound bucketCardinality,
        Map<String, Object> metadata
    ) throws IOException {
-        super(name, factories, context, parent, bucketCardinality, metadata);
+        super(name, factories, context, parent, CardinalityUpperBound.MANY, metadata);


why is this hardcoded?

I think it should be MANY actually. Though this isn't related and I should have caught it earlier. MANY here means "my children will an unbounded number of buckets". It's normal to do stuff like bucketCardinality.multiply(filters.size()) if, say, you were filters and knew precisely how many buckets you'd make. But we never do.

This is actually an unrelated change - I'll back it out. It's necessary for the rate agg to work but I'm not sure that it's the correct way to fix things.

... or I can leave it in if Nik prefers it that way?

FWIW I believe it's correct. I don't care if you keep it or take it out and do it in the rate agg.

martijnvg · 2022-09-13T15:09:53Z

server/src/main/java/org/elasticsearch/search/aggregations/AggregationExecutionContext.java

@@ -21,18 +22,21 @@
 */
 public class AggregationExecutionContext {

-    private final Supplier<BytesRef> tsidProvider;
-    private final Supplier<Long> timestampProvider;
+    private final Supplier<BytesRef> tsidProvider;  // TODO remove this entirely?


Yes, I think we can do this. Maybe in a followup when at the same time adjusting the users of this provider (time_series aggregation and rollup)

martijnvg · 2022-09-13T15:11:57Z

.../src/main/java/org/elasticsearch/search/aggregations/timeseries/TimeSeriesIndexSearcher.java

@@ -65,6 +67,7 @@ public void search(Query query, BucketCollector bucketCollector) throws IOExcept
        int seen = 0;
        query = searcher.rewrite(query);
        Weight weight = searcher.createWeight(query, bucketCollector.scoreMode(), 1);
+        AtomicInteger tsidOrd = new AtomicInteger();


I don't think this has to be an AtomicInteger, because there is one thread here that does the time series search, right? It is just convenient to use AtomicInteger?

Just scanning, it looks like it can be an int. But I might be missing something. FWIW I prefer new int[1] if I need a mutable int box because it doesn't make the reader think "oh god, there are multiple threads now"

Yeah, it's just convenience. It could be a long[] returning [0] each time if you think the locking and stuff will slow things down unnecessarily?

And yes it can be int and not long of course because we're dealing with single indexes here.

uh, sorry, does it need long because it's ordinals across all leaves? it's not likely to get that big, but global ordinals a long for paranoia.

I don't think the locking will slow things down. it's uncontended so it'll get zapped. I think. I just think it's easier to read as an int because I never have to wonder where the threads are. Could you do () -> tsidOrd as the supplier and keep it as just int?

Even across all leaves maxDoc is an int so ordinals will only ever be int as well. A docvalues ordinal would be a long because you could have multiple values per doc, but we will only ever have at most one tsid per document.

We can't do () -> tsidOrd because it needs to be 'effectively final' so it would have to be an array reference. Which is fine.

👍 on int - we know that we have at most one tsid per doc.

Array reference is better for me. It's more code but I can read it faster....

I also lean toward using new int[1] here over AtomicInteger.

nik9000 · 2022-09-13T15:39:19Z

server/src/main/java/org/elasticsearch/search/aggregations/AggregationExecutionContext.java

@@ -21,18 +22,21 @@
 */
 public class AggregationExecutionContext {

-    private final Supplier<BytesRef> tsidProvider;
-    private final Supplier<Long> timestampProvider;
+    private final Supplier<BytesRef> tsidProvider;  // TODO remove this entirely?


nik9000 · 2022-09-13T15:39:38Z

server/src/main/java/org/elasticsearch/search/aggregations/AggregationExecutionContext.java

@@ -44,6 +48,10 @@ public BytesRef getTsid() {
    }

    public Long getTimestamp() {


Should this be long now?

nik9000 · 2022-09-13T15:41:43Z

server/src/main/java/org/elasticsearch/search/aggregations/timeseries/TimeSeriesAggregator.java

@@ -42,7 +42,7 @@ public TimeSeriesAggregator(
        CardinalityUpperBound bucketCardinality,
        Map<String, Object> metadata
    ) throws IOException {
-        super(name, factories, context, parent, bucketCardinality, metadata);
+        super(name, factories, context, parent, CardinalityUpperBound.MANY, metadata);


I think it should be MANY actually. Though this isn't related and I should have caught it earlier. MANY here means "my children will an unbounded number of buckets". It's normal to do stuff like bucketCardinality.multiply(filters.size()) if, say, you were filters and knew precisely how many buckets you'd make. But we never do.

nik9000 · 2022-09-13T15:43:28Z

.../src/main/java/org/elasticsearch/search/aggregations/timeseries/TimeSeriesIndexSearcher.java

@@ -65,6 +67,7 @@ public void search(Query query, BucketCollector bucketCollector) throws IOExcept
        int seen = 0;
        query = searcher.rewrite(query);
        Weight weight = searcher.createWeight(query, bucketCollector.scoreMode(), 1);
+        AtomicInteger tsidOrd = new AtomicInteger();


Just scanning, it looks like it can be an int. But I might be missing something. FWIW I prefer new int[1] if I need a mutable int box because it doesn't make the reader think "oh god, there are multiple threads now"

romseygeek · 2022-09-13T16:22:46Z

@elasticmachine run elasticsearch-ci/part-3

romseygeek · 2022-09-14T09:58:14Z

This is ready for another round

This PR modifies downsampling operation so that it uses global ordinal to track tsid changes PR depends on the work done in #90035 Relates to #74660

Add a TSID global ordinal to TimeSeriesIndexSearcher

c4f3868

romseygeek added >feature :StorageEngine/TSDB You know, for Metrics v8.5.0 labels Sep 13, 2022

romseygeek requested review from nik9000 and martijnvg September 13, 2022 14:35

romseygeek self-assigned this Sep 13, 2022

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 13, 2022

Update docs/changelog/90035.yaml

6108d72

romseygeek added 2 commits September 13, 2022 15:46

duh

d32dd9f

Merge remote-tracking branch 'romseygeek/tsdb/tsid-ord' into tsdb/tsi…

da3d4c4

…d-ord

martijnvg reviewed Sep 13, 2022

View reviewed changes

nik9000 reviewed Sep 13, 2022

View reviewed changes

deef

aabef82

nik9000 approved these changes Sep 14, 2022

View reviewed changes

romseygeek merged commit aed64a6 into elastic:main Sep 14, 2022

csoulios mentioned this pull request Sep 15, 2022

[TSDB] Improve downsampling performance by using tsid ordinals #90088

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a TSID global ordinal to TimeSeriesIndexSearcher #90035

Add a TSID global ordinal to TimeSeriesIndexSearcher #90035

romseygeek commented Sep 13, 2022

elasticsearchmachine commented Sep 13, 2022

elasticsearchmachine commented Sep 13, 2022

martijnvg left a comment

martijnvg Sep 13, 2022

nik9000 Sep 13, 2022

romseygeek Sep 13, 2022

romseygeek Sep 13, 2022

nik9000 Sep 13, 2022

martijnvg Sep 13, 2022

nik9000 Sep 13, 2022

martijnvg Sep 13, 2022

nik9000 Sep 13, 2022

romseygeek Sep 13, 2022

romseygeek Sep 13, 2022

nik9000 Sep 13, 2022

romseygeek Sep 13, 2022

nik9000 Sep 13, 2022

martijnvg Sep 14, 2022

nik9000 Sep 13, 2022

nik9000 Sep 13, 2022

romseygeek Sep 13, 2022

nik9000 Sep 13, 2022

nik9000 Sep 13, 2022

romseygeek commented Sep 13, 2022

romseygeek commented Sep 14, 2022

		@@ -44,6 +48,10 @@ public BytesRef getTsid() {
		}

		public Long getTimestamp() {

Add a TSID global ordinal to TimeSeriesIndexSearcher #90035

Add a TSID global ordinal to TimeSeriesIndexSearcher #90035

Conversation

romseygeek commented Sep 13, 2022

elasticsearchmachine commented Sep 13, 2022

elasticsearchmachine commented Sep 13, 2022

martijnvg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek commented Sep 13, 2022

romseygeek commented Sep 14, 2022