Query Pushdown #305

jacksontj · 2018-04-20T15:31:10Z

Hello thanos people! I just ran into your project the other day and wanted to chat with someone about some potential performance improvements for the Querier nodes. Having read a little through the docs it sounds like the querier nodes basically just aggregate data from N prometheus servers into a single endpoint. I have a project https://github.com/jacksontj/promxy which does exactly that. I started back in October of last year, and since then have a lot of perf learnings (specifically mechanisms to reduce the query-load on the prom system).

I'll start by stating some context, to help center the rest of the feedback. The first version of Promxy that I wrote was a simple implementation of the Storage interface for prom. This was then attached to an engine and the API and tada remote querier. The only real magic in that first version was how to merge datasets between hosts etc.

From my reading of the code and docs it seems that Thanos' Query service does the same thing.

There are a variety of optimizations I've done since that first version but the main 2 improvements I've done since are:

1. Marshaling Performance
In the case of a remote querier the amount of data being marshaled and shipped across the wire is a major concern. Since you are starting with the RemoteRead API I expect the marshaling latency isn't going to be a huge issue (its fast and code-gen). The main concern here is the memory pressure this puts on prometheus itself. On the JSON side this was really bad -- but the issue is similar on the proto side of things-- specifically that when you ask for N bytes of data that consumes at minimum 2*N bytes of memory on the prometheus server (in the JSON case it was 14x, so proto is in a much better place). To solve this for the proto side would require implementing streaming to an io.Writer in the proto libraries. In my experience this was the most important feature not so much because of the performance, but because of the stability-- without this feature you can easily exhaust memory on a remote prometheus host with a seemily "safe" query.

EDIT(@bwplotka): This was done already.

2. NodeReplacer API
By implementing the Storage API the querier needs to fetch all data in a window to calculate the response to the promql query. This is fine for single points or small windows, but when you get into the range of some dashboard queries (thousands of timeseries across days of duration) then this becomes overwhelmingly large. In some cases I've seen this "fetch all the data" step can take ~10x the time of funning the query (not to mention the CPU and network util on the remote end). The solution I came up with was the NodeReplacer API. This API allows you to hook up a function to the promql parsing to "replace" sections of the query as you see fit. This API is quite generic, but the use-case I have for it is actually relatively simple. The main goal is to take the large promql query and break it down into sub-queries that can be farmed out to other prom boxes to have them return an "aggregated" set of data over the wire. This will be easier to explain with an example:

Promql: sum(irate(http_request_duration_microseconds[1m]))

This (effectively) get parsed into an AggregateExpr -> Call -> VectorSelector. If we where going soely through the Querier API we'd have to return all datapoints for timeseries that match http_request_duration_microseconds. With the NodeReplacer API we can instead ask the downstream prom node for irate(http_request_duration_microseconds[1m]) and sum those together. This is done by replacing the AggregateExpr.Expr with the fetched data results in a VectorSelector.

TLDR; this means we can dramatically reduce the amount of data that needs to transfer for the vast majority of queries without changing the outcome of the query itself. I had actually submitted this PR to upstream prom last year (prometheus/prometheus#3631) but was rejected, so as of now this is implemented in my fork on github (https://github.com/jacksontj/prometheus/tree/release-2.2_fork_promxy).

There are definitely a lot more things for us to chat about, but I think these are a good starting point to work from. Looking forward to making a better prom ecosystem!

The text was updated successfully, but these errors were encountered:

bwplotka · 2018-04-23T10:07:42Z

Hey @jacksontj, thanks for all of these!

I think for Thanos querier, we have some minor advantage. We exactly know what "types" of underlying leafs we have. It will be either:

sidecar (thing that pulls from Prometheus and exposes in our gRPC API). This is the thing that relates to your, I think:

Marshaling Performance

Totally see this problem, however, I think we avoid this problem, because for Thanos you can have really lightweight Prometheus scrapers (scraper = sidecar+ Prometheus), since all data is uploaded when it appears in disk, thus no need for special optimizations for large number of samples on Prometheus side. We can have relatively short metric retention (24h or even 12h).

BTW you did not finish the sentence:

In my experience this was the most important feature not so much because of the performance, but because of the stability you can

you can... ... achieve?

ruler (process rules and exposes them)
store gateway (fetches old data, most likely all memory optimization are must have there)

NodeReplacer API

I totally see that use case. For remote query, query push down feature might decrease used network bandiwdth. I don't see any immdiate reason for query push down being bad idea for Prometheus, except the fact of making promQL engine more complex and easier to shoot yourself in a foot (:

Not sure how your deduplication code works though, but for Thanos I would be worried about deduplication correctness. It's bit fuzzy on raw data - With aggregated data I can imagine being it way worse or even impossible (we basically have less evidences what happened with the scrapes). How you solved that?

For Thanos we don't have query push down possibility, though: Our Store API fetches raw series only, so no way to push query down. Not without major changes to protocol and adding promQL engine to leafs. Wonder if some caching logic, would help a little bit here against Dashboards. Have you considered some caching layer in your promxy?

jacksontj · 2018-04-23T15:48:21Z

We exactly know what "types" of underlying leafs we have. It will be either:

Presumably the sidecar has the same (or similar) performance issues as prometheus itself? From looking at the code it seems that the sidecar only implements the storage API which means it requires shipping maximum data to the queriers for the querier to calculate an answer.

because for Thanos you can have really lightweight Prometheus scrapers (scraper = sidecar+ Prometheus),

Presumably the same issue happens for wherever the long-term storage exists? Good to know that thanos mitigates the prom issues, but I'd definitely suggest looking into marhasling performance (and memory consumption) of whatever your long-term storage system is.

BTW you did not finish the sentence:

I didn't, thats embarassing :) I've edited the original comment, and pasted the sentence down here:

In my experience this was the most important feature not so much because of the performance, but because of the stability-- without this feature you can easily exhaust memory on a remote prometheus host with a seemily "safe" query.

being bad idea for Prometheus, except the fact of making promQL engine more complex and easier to shoot yourself in a foot (:

The way I wrote the current one it is an API to hook up functions that know how to mutate the tree. My expectation would be that the functions that do the replacing (such as the one in promxy) would expose their structs as packages for others to use. But at the end of the day if you want more sophisticated control there is always chance for more bugs -- the key with the implementation I have is that its completely optional.

Not sure how your deduplication code works though, but for Thanos I would be worried about deduplication correctness.

Its fairly simple. The details are documented in the AntiAffinity config (https://github.com/jacksontj/promxy/blob/master/servergroup/config.go#L58) but the TLDR version is promxy tries to not merge unless there is a sufficiently large hole. In the servergroup configuration you tell promxy which servers are scraping the "same" things, so then when promxy determines that the "hole" is large enough then it will fill it with some other host. This AntiAffinity is required for clock-drift (and to deal with how prom stores times from scrape events). This optional "merging" only happens at the bottom most layer of queries.

adding promQL engine to leafs

Correct, this would require promql at the leafs. Alternatively you could just send the request to prometheus itself, but I'm not 100% clear on the distinction between the prometheus host and the sidecar/leaf (sounds like its mostly to do with metrics retention).

Have you considered some caching layer in your promxy?

Thought about it, but with the query push down it becomes not necessary at all. For caching you'd have to cache sections of the results, and depending on the promql evaluation done it may or may not be mergeable with other sources (for example, an avg calculation over one interval can't be used for another).

bwplotka · 2018-04-23T18:36:25Z

Presumably the same issue happens for wherever the long-term storage exists? Good to know that thanos mitigates the prom issues, but I'd definitely suggest looking into marhasling performance (and memory consumption) of whatever your long-term storage system is

Exactly that! I am only saying that this level of optimization would be required to the component that operates with long-term storage which is Store Gateway for Thanos, so we are not affected by Prometheus Remote Read limitations.

TLDR version is promxy tries to not merge unless there is a sufficiently large hole.

Ok, but it is done on your aggregated query level right? So if you push all queries down, this
works on already aggregated (e.g irated, topk, histogram-ed) data. I still think that it can have different results when deduplicating aggregated samples vs raw samples, but need to look more into that tomorrow (: Not sure now.

Basically:
I don't think there is a rule that says
for any function ABC we guarantee that for resulted ABC(series)'s samples, they have totally the same timestamps and there are equaly same number of samples than in raw series

Alternatively you could just send the request to prometheus itself

Yea, I cannot, because we use gRPC service instead, but also, again, the main point where "query push down" or marshalling improvements would be beneficial is our store gateway which is totally a remote service to the Prometheus server.

For caching you'd have to cache sections of the results

Yea, but that's only the case when you actually do query push down which we don't do. You do - so understand you 100% here.

To sum up:

We can definitely look closely how we pass data / do marshalling on thanos store to optimize memory & CPU usage.
Query push down (doing promQL on leaf nodes) would require protocol and leaf change (sidecar, store, ruler). The biggest result will be from single store gateway, so it's not that critical, however it will avoid sending not necessary raw samples over network between store gateway and query. We need to double check dedup correctness over aggregated data as well - not sure about that.

EDIT: query push down, might be tricky on time boundaries though for rate/irate/increase etc, anything that looks back to evaluate current value. The thing is, that we have downsampling that decrease the number of bytes send around.

bwplotka · 2018-04-23T18:48:58Z

Ah I missed you actually push down only some sub-queries.. hm

stale · 2020-01-11T08:42:36Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

daixiang0 · 2020-01-21T09:13:17Z

@bwplotka still need this?

bwplotka · 2020-11-30T15:13:38Z

Starting formal design doc, feel free to collaborate (: https://docs.google.com/document/d/1ajMPwVJYnedvQ1uJNW2GDBY6Q91rKAw3kgA5fykIRrg/edit#

stale · 2021-01-30T03:41:08Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

kakkoyun · 2021-02-09T12:50:30Z

Still valid.

stale · 2021-06-03T02:15:45Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

kakkoyun · 2021-06-03T09:16:51Z

Still valid.

stale · 2021-08-02T18:44:49Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

GiedriusS · 2021-09-23T13:34:31Z

As mentioned on Slack, I suggest unblocking the work on this by starting with a simple case: we can push down any kind of query to a StoreAPI node if its time range doesn't overlap with any others. This should cover a lot already since the most recent data is usually covered by Receiver/Rule/Sidecar, and then the historical part is covered by a load-balanced Thanos Store. Most recent data typically only spans a day or two so IMHO it is fine to retrieving everything from there. Thus, with query-frontend splitting a query it means that we can already save a lot of data transfer over the wire.

We already do many tricks for downsampled data in iterators and naturally Select() is used for retrieving data (not the results) so it seems to me that a different solution is needed. TBH I like the NodeReplacer idea. I saw a PR where you were trying to contribute this to Prometheus but maybe a simple API to do this would be accepted now because there are more use cases for it?

Or maybe we could make it even simpler: pass everything down via gRPC to another PromQL engine and then just return its results as our (on the querier's side) results i.e. let's only have one engine? So, in a sense, Thanos Query would become a simple reverse proxy that proxies the query that it gets to a StoreAPI node if it is the only one covering that time range?

bwplotka · 2021-12-02T16:33:44Z

BTW we are moving with this work item finally 🤗

I made a quick PoC here:#4901
The first iteration is done by @fpetkovski #4917

We can work on first iteration, in the mean time we are working on the formal proposal.

stale · 2022-03-02T15:26:39Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

markmsmith · 2022-03-07T16:44:30Z

I think this is still needed.

stale · 2022-06-12T17:56:35Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

markmsmith · 2022-06-14T18:44:36Z

Still needed.

stale · 2022-09-21T02:13:34Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

markmsmith · 2022-09-22T22:31:40Z

Still needed

bwplotka added the feature request/improvement label Apr 23, 2018

bwplotka added difficulty: medium and removed difficulty: medium labels May 3, 2018

stale bot added the stale label Jan 11, 2020

stale bot closed this as completed Jan 18, 2020

bwplotka changed the title ~~Query optimizations~~ Query Pushdown Nov 30, 2020

bwplotka reopened this Nov 30, 2020

stale bot removed the stale label Nov 30, 2020

This was referenced Dec 16, 2020

promql: Exposed parsing and using test commands, so Thanos/Cortex can reuse the same language. prometheus/prometheus#8300

Closed

query: Added PromQL tests framework supporing multiple stores. First step towars Query pushdown! #3631

Merged

bwplotka mentioned this issue Jan 26, 2021

Store: Improvements for Queries against Huge Cardinality blocks (10 mil+) #3512

Closed

stale bot added the stale label Jan 30, 2021

stale bot removed the stale label Feb 9, 2021

stale bot added the stale label Jun 3, 2021

stale bot removed the stale label Jun 3, 2021

stale bot added the stale label Aug 2, 2021

GiedriusS removed the stale label Aug 11, 2021

GiedriusS mentioned this issue Sep 30, 2021

query: add initial time-based query pushdown #4712

Closed

bwplotka mentioned this issue Nov 17, 2021

New operator: fallback prometheus/prometheus#9624

Closed

bwplotka mentioned this issue Dec 2, 2021

Implement query pushdown for a subset of aggregations #4917

Merged

2 tasks

This was referenced Dec 30, 2021

store on local filesystem for X (ex:days) and on S3 beyond X #5016

Closed

Add ability to specify some results as precomputed in PromQL expression's AST prometheus/prometheus#10101

Open

stale bot added the stale label Mar 2, 2022

stale bot removed the stale label Mar 7, 2022

stale bot added the stale label Jun 12, 2022

kakkoyun removed the stale label Jun 13, 2022

stale bot added the stale label Sep 21, 2022

stale bot removed the stale label Sep 22, 2022

darcydai mentioned this issue Mar 29, 2023

Does Promxy support query pushdown？ jacksontj/promxy#584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query Pushdown #305

Query Pushdown #305

jacksontj commented Apr 20, 2018 •

edited by bwplotka

Loading

bwplotka commented Apr 23, 2018 •

edited

Loading

jacksontj commented Apr 23, 2018

bwplotka commented Apr 23, 2018 •

edited

Loading

bwplotka commented Apr 23, 2018

stale bot commented Jan 11, 2020

daixiang0 commented Jan 21, 2020

bwplotka commented Nov 30, 2020

stale bot commented Jan 30, 2021

kakkoyun commented Feb 9, 2021

stale bot commented Jun 3, 2021

kakkoyun commented Jun 3, 2021

stale bot commented Aug 2, 2021

GiedriusS commented Sep 23, 2021 •

edited

Loading

bwplotka commented Dec 2, 2021 •

edited

Loading

stale bot commented Mar 2, 2022

markmsmith commented Mar 7, 2022

stale bot commented Jun 12, 2022

markmsmith commented Jun 14, 2022

stale bot commented Sep 21, 2022

markmsmith commented Sep 22, 2022

Query Pushdown #305

Query Pushdown #305

Comments

jacksontj commented Apr 20, 2018 • edited by bwplotka Loading

bwplotka commented Apr 23, 2018 • edited Loading

jacksontj commented Apr 23, 2018

bwplotka commented Apr 23, 2018 • edited Loading

bwplotka commented Apr 23, 2018

stale bot commented Jan 11, 2020

daixiang0 commented Jan 21, 2020

bwplotka commented Nov 30, 2020

stale bot commented Jan 30, 2021

kakkoyun commented Feb 9, 2021

stale bot commented Jun 3, 2021

kakkoyun commented Jun 3, 2021

stale bot commented Aug 2, 2021

GiedriusS commented Sep 23, 2021 • edited Loading

bwplotka commented Dec 2, 2021 • edited Loading

stale bot commented Mar 2, 2022

markmsmith commented Mar 7, 2022

stale bot commented Jun 12, 2022

markmsmith commented Jun 14, 2022

stale bot commented Sep 21, 2022

markmsmith commented Sep 22, 2022

jacksontj commented Apr 20, 2018 •

edited by bwplotka

Loading

bwplotka commented Apr 23, 2018 •

edited

Loading

bwplotka commented Apr 23, 2018 •

edited

Loading

GiedriusS commented Sep 23, 2021 •

edited

Loading

bwplotka commented Dec 2, 2021 •

edited

Loading