Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a TSID global ordinal to TimeSeriesIndexSearcher #90035

Merged
merged 5 commits into from
Sep 14, 2022

Conversation

romseygeek
Copy link
Contributor

Rather than trying to compare BytesRefs in tsdb-related aggregations, it
will be much quicker if we can use a search-global ordinal to detect when
we have moved to a new TSID. This commit adds such an ordinal to the
aggregation execution context.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 13, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @romseygeek, I've created a changelog YAML for you.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I left some questions.

@@ -42,7 +42,7 @@ public TimeSeriesAggregator(
CardinalityUpperBound bucketCardinality,
Map<String, Object> metadata
) throws IOException {
super(name, factories, context, parent, bucketCardinality, metadata);
super(name, factories, context, parent, CardinalityUpperBound.MANY, metadata);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this hardcoded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be MANY actually. Though this isn't related and I should have caught it earlier. MANY here means "my children will an unbounded number of buckets". It's normal to do stuff like bucketCardinality.multiply(filters.size()) if, say, you were filters and knew precisely how many buckets you'd make. But we never do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually an unrelated change - I'll back it out. It's necessary for the rate agg to work but I'm not sure that it's the correct way to fix things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or I can leave it in if Nik prefers it that way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I believe it's correct. I don't care if you keep it or take it out and do it in the rate agg.

@@ -21,18 +22,21 @@
*/
public class AggregationExecutionContext {

private final Supplier<BytesRef> tsidProvider;
private final Supplier<Long> timestampProvider;
private final Supplier<BytesRef> tsidProvider; // TODO remove this entirely?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can do this. Maybe in a followup when at the same time adjusting the users of this provider (time_series aggregation and rollup)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

@@ -65,6 +67,7 @@ public void search(Query query, BucketCollector bucketCollector) throws IOExcept
int seen = 0;
query = searcher.rewrite(query);
Weight weight = searcher.createWeight(query, bucketCollector.scoreMode(), 1);
AtomicInteger tsidOrd = new AtomicInteger();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this has to be an AtomicInteger, because there is one thread here that does the time series search, right? It is just convenient to use AtomicInteger?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just scanning, it looks like it can be an int. But I might be missing something. FWIW I prefer new int[1] if I need a mutable int box because it doesn't make the reader think "oh god, there are multiple threads now"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's just convenience. It could be a long[] returning [0] each time if you think the locking and stuff will slow things down unnecessarily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes it can be int and not long of course because we're dealing with single indexes here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, sorry, does it need long because it's ordinals across all leaves? it's not likely to get that big, but global ordinals a long for paranoia.

I don't think the locking will slow things down. it's uncontended so it'll get zapped. I think. I just think it's easier to read as an int because I never have to wonder where the threads are. Could you do () -> tsidOrd as the supplier and keep it as just int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even across all leaves maxDoc is an int so ordinals will only ever be int as well. A docvalues ordinal would be a long because you could have multiple values per doc, but we will only ever have at most one tsid per document.

We can't do () -> tsidOrd because it needs to be 'effectively final' so it would have to be an array reference. Which is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on int - we know that we have at most one tsid per doc.

Array reference is better for me. It's more code but I can read it faster....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also lean toward using new int[1] here over AtomicInteger.

@@ -21,18 +22,21 @@
*/
public class AggregationExecutionContext {

private final Supplier<BytesRef> tsidProvider;
private final Supplier<Long> timestampProvider;
private final Supplier<BytesRef> tsidProvider; // TODO remove this entirely?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

@@ -44,6 +48,10 @@ public BytesRef getTsid() {
}

public Long getTimestamp() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be long now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@@ -42,7 +42,7 @@ public TimeSeriesAggregator(
CardinalityUpperBound bucketCardinality,
Map<String, Object> metadata
) throws IOException {
super(name, factories, context, parent, bucketCardinality, metadata);
super(name, factories, context, parent, CardinalityUpperBound.MANY, metadata);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be MANY actually. Though this isn't related and I should have caught it earlier. MANY here means "my children will an unbounded number of buckets". It's normal to do stuff like bucketCardinality.multiply(filters.size()) if, say, you were filters and knew precisely how many buckets you'd make. But we never do.

@@ -65,6 +67,7 @@ public void search(Query query, BucketCollector bucketCollector) throws IOExcept
int seen = 0;
query = searcher.rewrite(query);
Weight weight = searcher.createWeight(query, bucketCollector.scoreMode(), 1);
AtomicInteger tsidOrd = new AtomicInteger();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just scanning, it looks like it can be an int. But I might be missing something. FWIW I prefer new int[1] if I need a mutable int box because it doesn't make the reader think "oh god, there are multiple threads now"

@romseygeek
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/part-3

@romseygeek
Copy link
Contributor Author

This is ready for another round

@romseygeek romseygeek merged commit aed64a6 into elastic:main Sep 14, 2022
csoulios added a commit that referenced this pull request Sep 15, 2022
This PR modifies downsampling operation so that it uses global ordinal to track tsid changes

PR depends on the work done in #90035

Relates to #74660
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants