Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sidecar: exceeded sample limit #1191

Closed
jjneely opened this issue May 29, 2019 · 3 comments · Fixed by #1619
Closed

sidecar: exceeded sample limit #1191

jjneely opened this issue May 29, 2019 · 3 comments · Fixed by #1619

Comments

@jjneely
Copy link
Contributor

jjneely commented May 29, 2019

When running queries via the Query component, we often get partial results and errors back from various sidecars that look like the below. Definitely a Prometheus/Sidecar IP and not a Store component. See graphic.

thanos-query-20190529

We broke out tcpdump to figure out the exact problem. The Prometheus remote read API is returning a HTTP 400 status code to the Sidecar which propagates the result to Thanos. The actual error message is below:

HTTP/1.1 400 Bad Request
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Wed, 29 May 2019 14:41:01 GMT
Content-Length: 33

exceeded sample limit (50000000)

As we are trying to build a transition path from direct Prometheus queries to querying data via Thanos, we have not yet lowered the retention of data on our Prometheus VMs. So they still have 30 days of data (and 60 in some cases). It appears that some of our Prometheus instances are large enough that Prometheus remote read queries end up returning a large amount of data and at times exceed this limit. It also seems to vastly slow down query performance.

The preferred thing to do here, obviously, is run Prometheus VMs with limited retention as data will exist in GCS/Store. We did this in our devel environment and problems went away, query performance increased, and using downsampled results seemed to work much better. However, I find myself in an interesting position. I don't feel comfortable calling my Thanos Query service "production" yet, and I do not feel comfortable removing retention from team's Prometheus VMs without a production Thanos Query service running. It feels like there may be other reasons folks may wish to keep longer than a day or 3 of retention on their Prometheus VMs as well.

I'm wondering if this can be solved by adding a command line option to the Sidecar that would provide a duration like 3d. When the Sidecar process is running it would advertise and use the most recent value of mint or now-3d as the mint argument for Prometheus's remote read API. This would effectively limit queries from the Sidecar to the Prometheus DB to the last 3 days. The remaining long term data would be provided by GCS. This would provide the best of both worlds, limits queries against Prometheus, and allowing us to keep retention until we have teams moved over to the Thanos endpoints.

Thanos, Prometheus and Golang version used

Thanos 0.4.0 with Goland 1.12.1.

What happened

Prometheus remote read APIs return 400 errors when the exact same query would run on the native Prometheus VM without issue and quite quickly.

What you expected to happen

Queries to return correctly.

How to reproduce it (as minimally and precisely as possible):

Use a query that will search through most/all of the TSDB blocks on a Prometheus VM like count (up) by (job) and execute it through Thanos Query. If the TSDB blocks on the Prometheus VM are large enough combined with enough retention, the Thanos Query will start producing this error as the time range is increased. Sometimes its 5 or 6 days, sometimes its 1 to 4 weeks for me.

Anything else we need to know

Asking questions before I start coding. Adding a CLI option here for this looks pretty simple.

@brancz
Copy link
Member

brancz commented Jun 5, 2019

Probably worth getting @bwplotka's opinion, but I think this is generally reasonable.

@FUSAKLA
Copy link
Member

FUSAKLA commented Jun 29, 2019

I saw this requirement mentioned even somewhere else IIRC with the same context. Migrating to Thanos but still holding the data on Prometheus instances.
So this might be worth to add?

@bwplotka
Copy link
Member

bwplotka commented Oct 3, 2019

Sorry for the late answer.

  1. So currently with streaming remote read (Thanos 0.7.0+ and Prometheus 2.13+) this issue should be largely mitigated.
  2. Time slicing for a sidecar for requests might make sense, but we need to be explicit that there are minor differents vs the store gw. I think storeapi.min-time limit as mentioned by @povilasv (jus min-time without max time) might be ok. Some attempt was there sidecar: time limit requests to Prometheus remote read api #1267 Happy to merge this once comments will be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants