RPC retry logic in nomad client's rpc.go is incorrect #9265
Labels
stage/accepted
Confirmed, and intend to work on. No timeline committment though.
theme/client
type/bug
Milestone
Background
Nomad exposes HTTP APIs on the servers, and also on the clients. However, to get most answers, the request must eventually be made of a server, usually a Leader server. So, if one makes an HTTP Request to a client, they can expect the request to be forwarded to a server over RPC.
A Nomad API client may connect to one of the Nomad Agent Client's to make its request. If a server becomes unavailable during the processing of the request, the Nomad Agent Client contains retry logic to attempt to hide the availability problem from the API client.
At Cloudflare, we run Nomad on many thousands of machines and rely heavily on blocking queries which involve persistent connections, and therefore a high likelihood a node will restart during an outstanding TCP connection.
In doing so we've found several flaws with how the Nomad Agent Client attempts to retry during server availability issues. We have fixed them, and have been running them on our nodes for a few weeks with success.
We would like to contribute the fixes back, but we have not gotten much traction with some of the Consul library changes and need help.
Credit to Pierre Cauchois for running most of this down.
Current code
Most of that magic is in client/rpc.go
A snippet with some markup
Problems
Why didn't the nomad client retry for us, and why are there 2 different types of errors?
Almost certainly because
canRetry
returned false in the above code, so lets take a quick look at what it looked like before we got our hands in thereFor simplicity lets state these facts
IsEOF and Wrapped Errors
The first problem we found was lib.IsErrEOF returned false even though the error was in fact an EOF error and there are two reasons for nomad, plus a bonus problem with Consul.
We have not had any luck getting the fixes merged into Consul but they are very straight forward.
err == io.EOF is wrong - Fix Not Merged
err == io.EOF
is not enough to solve the problem here, because nomad returns the following wrapped errFrom PR 8632 to Nomad
In this case,
err == io.EOF
is false, buterrors.Is(err, io.EOF)
is true. The fix is to change this conditionalThe error is a non-wrapped string in Consul - Fix Not Merged
Related, it seems that Consul has very similar code to Nomad but without the PR that wrapped the error. In Consul, there, neither
==
noterrors.Is
will work. The fix is to port the Nomad fix.Even with wrapping, some errors are strings - Fix Not Merged
errors.Is
solvesUnexpected response code: 500 (rpc error: EOF)
- this is the error we get when the server that the client is connected to disconnects - a single EOF.However, in the case where we are connected to a server, that forwards the request to ANOTHER server - what happens if the OTHER server disconnects? Well instead of the client getting an EOF, the FIRST server gets an EOF and wraps it, sending it back to the client as a STRING.
At this point all wrapping has been lost, eaten by the type-destroying mess that is gRPC error handling. The fix is
Long-poll requests and timeouts - Nomad Fix Merged - Consul Fix Not Merged
The next problem we had was with blocking requests. This is a feature of both the Nomad and Consul APIs where an HTTP request can be made along with an ?index query parameter to specify that the client is only interested in updates after a particular raft index. In the case that the state is currently prior to the provided raft index, the HTTP request will block until the first of
The retry logic does not appear to have accounted for these types of requests at all, lets take a look at the code
This indicates that it should not retry if more than
c.config.RPCHoldTimeout
(5s) has elapsed. This is a provision to keep from retrying forever, but instead only as long as it might reasonably take for a new server leader to be elected so that a new, healthy connection chain to a leader can be established.Unfortunately, in the case of long poll request, this is almost always false. If we have made a request with a maximum query time of 5 minutes (the default) then, unless our index was already up-to-date when we made it, we will almost certainly have made the request more than 5 minutes ago, and so a retry would not occur.
The first for this is to take into account the MaxQueryTime for long-poll requests and allow them to be retried for at least that long.
Long Poll Bug - Nomad fix not merged - Consul fix is the same, waiting...
The fix which we merged into Nomad for the above issue is a good one - without it retries cannot possibly work, but it does have a flaw.
Consider:
What you would expect is that the second request, #3, has a timeout of 1 minute. Unfortunately, that is not what happens. Thje request is made again from the top - with a timeout of 5 minutes just like the first - and so the entire request could take as much as 9 minutes, even though the client asked for 5 minutes.
What is even worse about this is that if another retryable error occurs at the 7 minute mark, the code will not retry because it will realize it has been 7 minutes with a timeout of 5 and so the client will see an EOF. What it should have seen was a timeout at the 5 minute mark instead.
The correct fix for this is clear: step 3 should make a request to the sever with a timeout of 1 minute (the original timeout, minus time elapsed already). However, how to implement that is not so traight forward.
Because the RPC helpers burry the request time in interfaces and re-infer defaults on the server, there is no easy way for the RPC helper to change the request time to reflect the updated elapsed time.
One way to do that is with reflect. This is an inelegant solution that fixes the bug with a hammer. To fix it correctly, the client needs a way to tell the server the timeout time in a way that is not invisible and type-lost to the RPC function. However the complexity and risk of such a change does not seem appropriate.
The text was updated successfully, but these errors were encountered: