-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sidecar: exceeded sample limit #1191
Comments
Probably worth getting @bwplotka's opinion, but I think this is generally reasonable. |
This was referenced Jun 19, 2019
I saw this requirement mentioned even somewhere else IIRC with the same context. Migrating to Thanos but still holding the data on Prometheus instances. |
Sorry for the late answer.
|
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When running queries via the Query component, we often get partial results and errors back from various sidecars that look like the below. Definitely a Prometheus/Sidecar IP and not a Store component. See graphic.
We broke out tcpdump to figure out the exact problem. The Prometheus remote read API is returning a HTTP 400 status code to the Sidecar which propagates the result to Thanos. The actual error message is below:
As we are trying to build a transition path from direct Prometheus queries to querying data via Thanos, we have not yet lowered the retention of data on our Prometheus VMs. So they still have 30 days of data (and 60 in some cases). It appears that some of our Prometheus instances are large enough that Prometheus remote read queries end up returning a large amount of data and at times exceed this limit. It also seems to vastly slow down query performance.
The preferred thing to do here, obviously, is run Prometheus VMs with limited retention as data will exist in GCS/Store. We did this in our devel environment and problems went away, query performance increased, and using downsampled results seemed to work much better. However, I find myself in an interesting position. I don't feel comfortable calling my Thanos Query service "production" yet, and I do not feel comfortable removing retention from team's Prometheus VMs without a production Thanos Query service running. It feels like there may be other reasons folks may wish to keep longer than a day or 3 of retention on their Prometheus VMs as well.
I'm wondering if this can be solved by adding a command line option to the Sidecar that would provide a duration like
3d
. When the Sidecar process is running it would advertise and use the most recent value ofmint
or now-3d as themint
argument for Prometheus's remote read API. This would effectively limit queries from the Sidecar to the Prometheus DB to the last 3 days. The remaining long term data would be provided by GCS. This would provide the best of both worlds, limits queries against Prometheus, and allowing us to keep retention until we have teams moved over to the Thanos endpoints.Thanos, Prometheus and Golang version used
Thanos 0.4.0 with Goland 1.12.1.
What happened
Prometheus remote read APIs return 400 errors when the exact same query would run on the native Prometheus VM without issue and quite quickly.
What you expected to happen
Queries to return correctly.
How to reproduce it (as minimally and precisely as possible):
Use a query that will search through most/all of the TSDB blocks on a Prometheus VM like
count (up) by (job)
and execute it through Thanos Query. If the TSDB blocks on the Prometheus VM are large enough combined with enough retention, the Thanos Query will start producing this error as the time range is increased. Sometimes its 5 or 6 days, sometimes its 1 to 4 weeks for me.Anything else we need to know
Asking questions before I start coding. Adding a CLI option here for this looks pretty simple.
The text was updated successfully, but these errors were encountered: