-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query Pushdown #305
Comments
Hey @jacksontj, thanks for all of these! I think for Thanos querier, we have some minor advantage. We exactly know what "types" of underlying leafs we have. It will be either:
Totally see this problem, however, I think we avoid this problem, because for Thanos you can have really lightweight Prometheus scrapers (scraper = sidecar+ Prometheus), since all data is uploaded when it appears in disk, thus no need for special optimizations for large number of samples on Prometheus side. We can have relatively short metric retention (24h or even 12h). BTW you did not finish the sentence:
I totally see that use case. For remote query, Not sure how your deduplication code works though, but for Thanos I would be worried about deduplication correctness. It's bit fuzzy on raw data - With aggregated data I can imagine being it way worse or even impossible (we basically have less evidences what happened with the scrapes). How you solved that? For Thanos we don't have |
Presumably the sidecar has the same (or similar) performance issues as prometheus itself? From looking at the code it seems that the sidecar only implements the storage API which means it requires shipping maximum data to the queriers for the querier to calculate an answer.
Presumably the same issue happens for wherever the long-term storage exists? Good to know that thanos mitigates the prom issues, but I'd definitely suggest looking into marhasling performance (and memory consumption) of whatever your long-term storage system is.
I didn't, thats embarassing :) I've edited the original comment, and pasted the sentence down here: In my experience this was the most important feature not so much because of the performance, but because of the stability-- without this feature you can easily exhaust memory on a remote prometheus host with a seemily "safe" query.
The way I wrote the current one it is an API to hook up functions that know how to mutate the tree. My expectation would be that the functions that do the replacing (such as the one in promxy) would expose their structs as packages for others to use. But at the end of the day if you want more sophisticated control there is always chance for more bugs -- the key with the implementation I have is that its completely optional.
Its fairly simple. The details are documented in the
Correct, this would require promql at the leafs. Alternatively you could just send the request to prometheus itself, but I'm not 100% clear on the distinction between the prometheus host and the sidecar/leaf (sounds like its mostly to do with metrics retention).
Thought about it, but with the |
Exactly that! I am only saying that this level of optimization would be required to the component that operates with long-term storage which is
Ok, but it is done on your aggregated query level right? So if you push all queries down, this Basically:
Yea, I cannot, because we use gRPC service instead, but also, again, the main point where "query push down" or marshalling improvements would be beneficial is our store gateway which is totally a remote service to the Prometheus server.
Yea, but that's only the case when you actually do query push down which we don't do. You do - so understand you 100% here. To sum up:
EDIT: query push down, might be tricky on time boundaries though for |
Ah I missed you actually push down only some sub-queries.. hm |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@bwplotka still need this? |
Starting formal design doc, feel free to collaborate (: https://docs.google.com/document/d/1ajMPwVJYnedvQ1uJNW2GDBY6Q91rKAw3kgA5fykIRrg/edit# |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Still valid. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Still valid. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
As mentioned on Slack, I suggest unblocking the work on this by starting with a simple case: we can push down any kind of query to a StoreAPI node if its time range doesn't overlap with any others. This should cover a lot already since the most recent data is usually covered by Receiver/Rule/Sidecar, and then the historical part is covered by a load-balanced Thanos Store. Most recent data typically only spans a day or two so IMHO it is fine to retrieving everything from there. Thus, with query-frontend splitting a query it means that we can already save a lot of data transfer over the wire. We already do many tricks for downsampled data in iterators and naturally Select() is used for retrieving data (not the results) so it seems to me that a different solution is needed. TBH I like the NodeReplacer idea. I saw a PR where you were trying to contribute this to Prometheus but maybe a simple API to do this would be accepted now because there are more use cases for it? Or maybe we could make it even simpler: pass everything down via gRPC to another PromQL engine and then just return its results as our (on the querier's side) results i.e. let's only have one engine? So, in a sense, Thanos Query would become a simple reverse proxy that proxies the query that it gets to a StoreAPI node if it is the only one covering that time range? |
BTW we are moving with this work item finally 🤗 I made a quick PoC here:#4901 We can work on first iteration, in the mean time we are working on the formal proposal. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
I think this is still needed. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Still needed. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Still needed |
Hello thanos people! I just ran into your project the other day and wanted to chat with someone about some potential performance improvements for the
Querier
nodes. Having read a little through the docs it sounds like the querier nodes basically just aggregate data from N prometheus servers into a single endpoint. I have a project https://github.com/jacksontj/promxy which does exactly that. I started back in October of last year, and since then have a lot of perf learnings (specifically mechanisms to reduce the query-load on the prom system).I'll start by stating some context, to help center the rest of the feedback. The first version of Promxy that I wrote was a simple implementation of the Storage interface for prom. This was then attached to an engine and the API and tada remote querier. The only real magic in that first version was how to merge datasets between hosts etc.
From my reading of the code and docs it seems that Thanos' Query service does the same thing.
There are a variety of optimizations I've done since that first version but the main 2 improvements I've done since are:
1. Marshaling PerformanceIn the case of a remote querier the amount of data being marshaled and shipped across the wire is a major concern. Since you are starting with the RemoteRead API I expect the marshaling latency isn't going to be a huge issue (its fast and code-gen). The main concern here is the memory pressure this puts on prometheus itself. On the JSON side this was really bad -- but the issue is similar on the proto side of things-- specifically that when you ask for N bytes of data that consumes at minimum 2*N bytes of memory on the prometheus server (in the JSON case it was 14x, so proto is in a much better place). To solve this for the proto side would require implementing streaming to an io.Writer in the proto libraries. In my experience this was the most important feature not so much because of the performance, but because of the stability-- without this feature you can easily exhaust memory on a remote prometheus host with a seemily "safe" query.
EDIT(@bwplotka): This was done already.
2. NodeReplacer API
By implementing the
Storage
API the querier needs to fetch all data in a window to calculate the response to the promql query. This is fine for single points or small windows, but when you get into the range of some dashboard queries (thousands of timeseries across days of duration) then this becomes overwhelmingly large. In some cases I've seen this "fetch all the data" step can take ~10x the time of funning the query (not to mention the CPU and network util on the remote end). The solution I came up with was theNodeReplacer
API. This API allows you to hook up a function to the promql parsing to "replace" sections of the query as you see fit. This API is quite generic, but the use-case I have for it is actually relatively simple. The main goal is to take the large promql query and break it down into sub-queries that can be farmed out to other prom boxes to have them return an "aggregated" set of data over the wire. This will be easier to explain with an example:Promql:
sum(irate(http_request_duration_microseconds[1m]))
This (effectively) get parsed into an
AggregateExpr
->Call
->VectorSelector
. If we where going soely through the Querier API we'd have to return all datapoints for timeseries that matchhttp_request_duration_microseconds
. With theNodeReplacer
API we can instead ask the downstream prom node forirate(http_request_duration_microseconds[1m])
and sum those together. This is done by replacing theAggregateExpr.Expr
with the fetched data results in aVectorSelector
.TLDR; this means we can dramatically reduce the amount of data that needs to transfer for the vast majority of queries without changing the outcome of the query itself. I had actually submitted this PR to upstream prom last year (prometheus/prometheus#3631) but was rejected, so as of now this is implemented in my fork on github (https://github.com/jacksontj/prometheus/tree/release-2.2_fork_promxy).
There are definitely a lot more things for us to chat about, but I think these are a good starting point to work from. Looking forward to making a better prom ecosystem!
The text was updated successfully, but these errors were encountered: