Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promxy remote_read api endpoint hangs forever #350

Closed
frebib opened this issue Oct 6, 2020 · 3 comments · Fixed by #351
Closed

promxy remote_read api endpoint hangs forever #350

frebib opened this issue Oct 6, 2020 · 3 comments · Fixed by #351

Comments

@frebib
Copy link
Contributor

frebib commented Oct 6, 2020

I discovered this when trying to attach thanos-sidecar to promxy to leverage the remote_read api to fulfil the most recent 2 hours of metrics for a Thanos query.

It seems the hang starts here https://github.com/prometheus/prometheus/blob/8dbfa14607a5cb138df43638f06c1a938c73df84/web/api/v1/api.go#L1218 where the remoteReadGate blocks as there are no slots available. The gate is configured with remoteReadConcurrencyLimit number of slots, as provided by the web.Options initialised in the promxy main.go. The field is never set, so defaults to 0 (which makes sense). When trying to obtain a slot with the gate, it hangs until the context expires, because it can never write the struct (as there is no buffer, and no readers): https://github.com/prometheus/prometheus/blob/8dbfa14607a5cb138df43638f06c1a938c73df84/pkg/gate/gate.go#L33.
The puzzling part about this to me, is the contradictory documentation from prometheus --help:

--storage.remote.read-concurrent-limit=10
                           Maximum number of concurrent remote read calls. 0 means no limit.

This implies a 0 would never block, but that can't be true because a channel with buffer of 0 always blocks.
Either way, by setting this value >0 allows remote_read to work again.

Have you experienced this? Has remote_read ever worked through promxy? Did something change on Prometheus side?

frebib added a commit to frebib/promxy that referenced this issue Oct 6, 2020
Previously this value was unset, defaulting to zero. A zero-value
initialised a channel with no buffer, causing all writes to the buffer
to hang, and therefore all remote_read queries to hang until they
timed-out. This is contradictory to the Prometheus documentation, which
claims that a zero value allows infinite concurrent reads.

Fixes: jacksontj#350
Signed-off-by: Joe Groocock <jgroocock@cloudflare.com>
@jacksontj
Copy link
Owner

Have you experienced this? Has remote_read ever worked through promxy? Did something change on Prometheus side?

remote_read has worked in the past, but this read concurrent limit is newer than the remote_read in promxy -- so likely it stopped working then? I am a bit confused as we have remote_read tests in CI which seem to be passing (https://github.com/jacksontj/promxy/blob/master/test/promql_test.go#L44)?

Either way your linked PR seems reasonable; I'll get that merged in!

jacksontj pushed a commit that referenced this issue Oct 6, 2020
Previously this value was unset, defaulting to zero. A zero-value
initialised a channel with no buffer, causing all writes to the buffer
to hang, and therefore all remote_read queries to hang until they
timed-out. This is contradictory to the Prometheus documentation, which
claims that a zero value allows infinite concurrent reads.

Fixes: #350
Signed-off-by: Joe Groocock <jgroocock@cloudflare.com>
@frebib
Copy link
Contributor Author

frebib commented Oct 6, 2020

we have remote_read tests in CI which seem to be passing

That statement enables remote_read from promxy -> backend Prometheus, doesn't it?
The problem I'm seeing here is calling /api/v1/read -> promxy -> prometheus where promxy was hanging.

@jacksontj
Copy link
Owner

jacksontj commented Oct 8, 2020

OIC; with that being the case we need a test case added for reading remote_read from promxy (which is a supported use-case; its just a lot less efficient than a promql query :) -- which is likely why no one has noticed! ).

To expand a bit on that, the reason its less efficient is that the remote_read interface requires sending all the datapoints for a given timeseries, not the result. So as an example if you wanted sum(foo) remote_read would ask for all data of foo and then sum it remotely; this means if there were 100 series in foo we'd have to serialize and deserialize 100x points; whereas if it is the regular promql interface we can simply be asked for the sum and return the sum (single point).

Created #356 to add that test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants