Implement query pushdown for a subset of aggregations #4917

fpetkovski · 2021-11-30T17:00:55Z

Certain aggregations can be executed safely on leaf nodes without
worrying about data duplication or overlap. One such example is the max
function which can be computed on local data by the leaves before it is
computed globally by the querier.

This commit implements local aggregation in the Prometheus sidecar for
all functions which are safe to execute locally. The feature can be enabled by
passing the --enable-feature evaluate-queries option to the sidecar.

I added CHANGELOG entry for this change.
Change is not relevant to the end user.

Changes

Verification

pkg/store/prometheus.go

bwplotka

Nice, good start! Looks generally good, but the topk/bottomk I don't think can be pushdown-ed easily 🤔 Essentially the top from node X might be not the top from part Y.

bwplotka · 2021-12-01T12:05:36Z

pkg/store/prometheus.go

+		Method:                  "GET",
+	}
+
+	matrix, _, err := p.client.QueryRange(ctx, p.base, r.ToPromQL(), r.MinTime, r.MaxTime, r.QueryHints.Step/1000, queryOpts)


Is it always range query?

I think it needs to be because of lookback, but I can double check.

Looks like we can always use a query range, we just need to use a positive step for instant queries.

pkg/store/storepb/custom.go

pkg/store/storepb/rpc.proto

bwplotka · 2021-12-01T12:10:47Z

cc @GiedriusS what if we start with this slowly into main?

fpetkovski · 2021-12-01T12:15:43Z

We could also add a flag to the sidecar to toggle this feature on and off. Something like a feature flag.

pkg/store/prometheus.go

fpetkovski · 2021-12-01T15:46:29Z

I think we might want to add some tests as well to make sure we don't introduce a regression. I'll try to do that by EOW

bwplotka · 2021-12-02T16:31:29Z

Related to #305

GiedriusS

Could we reuse the same --enable-feature pattern for this instead of a separate flag?

bwplotka

I love this work - solid PR. LGTM, just small nits. Also we are missing, docs but we can add them in separate PR!

cc @GiedriusS @squat @brancz

cmd/thanos/config.go

pkg/store/prometheus.go

pkg/api/query/v1.go

bwplotka · 2021-12-09T17:19:04Z

pkg/store/prometheus.go

+			return err
+		}
+
+		matrix = make(model.Matrix, 0, len(vector))


TODO to myself: Add issue about remote read understanding aggr or streamed PromQL

prometheus/prometheus#10040

pkg/store/prometheus.go

pkg/store/storepb/query_hints.go

pkg/store/storepb/rpc.proto

test/e2e/e2ethanos/services.go

GiedriusS

Played around with this locally - looks good modulo @bwplotka's comments. I'm not sure how I dismissed Bartek's review with a commit that I didn't even author or push 😂 Let's play with it behind a feature flag and see how we can further improve upon this base 👍

GiedriusS · 2021-12-15T14:42:26Z

pkg/store/prometheus.go

+		}
+		matrix = result
+	} else {
+		vector, _, err := p.client.QueryInstant(s.Context(), p.base, r.ToPromQL(), time.Unix(r.MaxTime/1000, 0), opts)


Suggested change

vector, _, err := p.client.QueryInstant(s.Context(), p.base, r.ToPromQL(), time.Unix(r.MaxTime/1000, 0), opts)

vector, _, err := p.client.QueryInstant(s.Context(), p.base, r.ToPromQL(), timestamp.Time(r.MaxTime), opts)

GiedriusS · 2021-12-16T11:14:53Z

@fpetkovski any update on this? I'd like to merge this as a good base so that we could continue further push down related work.

fpetkovski · 2021-12-16T11:16:17Z

Ah sorry I saw it was approved and didn't see the comments that followed up. I'll address them today so that we can merge it 👍

fpetkovski · 2021-12-16T11:59:05Z

@GiedriusS all comments should be addressed. The e2e tests seem to be failing but I don't think the failure is related to this PR

pkg/query/querier.go

Certain aggregations can be executed safely on leaf nodes without worrying about data duplication or overlap. One such example is the max function which can be computed on local data by the leaves before it is computed globally by the querier. This commit implements local aggregation in the Prometheus sidecar for all functions which are safe to execute locally. The feature can be enabled by passing the `--enable-feature evaluate-queries` flag to the sidecar. Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

bwplotka · 2021-12-17T10:37:36Z

💪🏽

juliusv · 2022-06-01T20:26:58Z

pkg/store/storepb/custom.go

+		"max_over_time",
+		"min",
+		"min_over_time",
+		"group",


Why not count() btw.? I guess some others would also work, but I guess you're just starting with a few...

Because then PromQL engine goes ahead and calculates count() again on the already correct results i.e. pushed down data thus corrupting it. I have suggested something here but it got no attention so I haven't pursued it further. An alternative is to edit the AST of the query but it got denied as well.

Ah right, I had assumed there was already some facility for changing the outer operation (in the count case, to sum), but that's not required for min/max/group.

The comment on prometheus/prometheus#10101 is interesting though. Indeed, if the same series can exist on two leafs (I think this is only the case for HA deduplication in Thanos, right?), then pushing rate() or even count() down in a correct way would be impossible (if you do the count across two HA replicas and you get 10 on each, you don't know if you really counted the same 10 series both times or counted different ones, or some overlapping sets).

yeya24 · 2022-10-21T18:59:53Z

pkg/store/storepb/custom.go

+		"max",
+		"max_over_time",
+		"min",
+		"min_over_time",


May I ask why avg_over_time is not safe? Is it because in Thanos we do deduplication at Query time?
If we do write time deduplication like Cortex then this seems fine

Yes, that's the reason. I believe it should be safe if you have unique data.

fpetkovski force-pushed the query-pushdown branch 5 times, most recently from 8cb1b92 to 753be73 Compare December 1, 2021 09:44

bwplotka reviewed Dec 1, 2021

View reviewed changes