Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos (sidecar) returns no results with promxy used as Prometheus remote_read endpoint #355

Closed
frebib opened this issue Oct 7, 2020 · 2 comments
Labels

Comments

@frebib
Copy link
Contributor

frebib commented Oct 7, 2020

Related to #350 and #351

For our uses, we are trying to set up an asymmetric Prometheus/Thanos setup using Promxy as a datacentre aggregator.
Here is a simplified view of what setup we're aiming to achieve. The problem part is the blue line from thanos query to promxy. The remote_read api call works as of #351 but returns no data. Cutting out promxy and targetting one of the Prometheus instance('s sidecar) directly works as expected and returns the relevant rows.

thanos-promxy-setup

To replicate this, you'll need promxy and prometheus running, plus two thanos instances, one for the sidecar and one for the query webui.

thanos query --store=<grpc-addr-of-sidecar> --http-address=<bind-address-for-webui> --log.level=debug
thanos sidecar --grpc-address=<grpc-bind-address> --prometheus.url=<promxy-http-url> --log.level=debug

In trying to debug this, it seems that when pointing sidecar directly at Prometheus, it logs

level=debug ts=2020-10-07T12:06:30.476199814Z caller=prometheus.go:259 msg="started handling ReadRequest_STREAMED_XOR_CHUNKS streamed read response."
level=debug ts=2020-10-07T12:06:30.585904858Z caller=prometheus.go:335 msg="handled ReadRequest_STREAMED_XOR_CHUNKS request." frames=5816 series=5816

but when pointing sidecar at promxy, it instead logs

level=debug ts=2020-10-07T13:22:18.354275818Z caller=prometheus.go:214 msg="started handling ReadRequest_SAMPLED response type."
level=debug ts=2020-10-07T13:22:18.397907404Z caller=prometheus.go:254 msg="handled ReadRequest_SAMPLED request." series=5812

After reading your comment yesterday #352 (comment) combined with finding this change prometheus/prometheus@48b2c9c, I'm wondering if Thanos is expecting the remote_read reply in STREAMED_XOR_CHUNKS format instead of SAMPLED. (edit: It appears Thanos accepts both, although the STREAMED codepath is certainly more tested now as Prometheus uses it by default: https://github.com/thanos-io/thanos/blob/a7b2a449ce9aa77cc225a699c1f399a3528d97b3/pkg/store/prometheus.go#L206-L216). It's entirely possible it's not that but it is one difference I observed. This bug may also be fixed by #352 too, possibly

Before I start digging deep into the Prometheus/Thanos/Promxy code again, is there anything that jumps to mind that could cause this behaviour?

Thanks

@jacksontj
Copy link
Owner

This is definitely a reasonable looking objective (prometheus local with recent data, remote thanos with more data). Based on the diagram above I'd expect that to work (although as mentioned in #350 I''m not aware of anyone using the remote_read into promxy), although it would have been broken until that PR yesterday.

One thing I'd suggest looking into as an efficiency improvement is trying to get promxy in front of the thanos querier. Promxy has the ability to sub out a query to many different nodes and requires significantly fewer resources to get the answer. I added an example explaining this a bit here but TLDR remote_read is an inefficient interface for queries. So If promxy could be in front of the stack then that enables some queries (data that is "recent") to be served using the regular query interface through promxy which is significantly cheaper (this would mean alerting would be dramatically cheaper since its acting on recent data).

I did see #352 but that issue seems to be some go.mod issue; in reality promxy is currently based on a prometheus 2.10 fork so that should be a non-issue. Now that does mean we aren't new enough to have that STREAMED_XOR_CHUNKS option and its also possible that prometheus 2.10 had a bug in the SAMPLED interface (it wouldn't surprise me, all the remote_read/write stuff is "unsupported"or "experimental" so there are bugs in there with some regularity). So with that I'd suggest trying your setup with prometheus 2.10 and if you see the same problem there -- then its likely some issue in the prometheus dep (which means it'd be time to update again).

@jacksontj
Copy link
Owner

Seems that there are no updates to this issue; so I'm going to close it out. If there is more to discuss or additional questions feel free to re-open!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants