Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: UI dead if internal net.Pipe closes #42828

Closed
ajwerner opened this issue Nov 27, 2019 · 3 comments · Fixed by #48369
Closed

server: UI dead if internal net.Pipe closes #42828

ajwerner opened this issue Nov 27, 2019 · 3 comments · Fixed by #48369
Labels
A-webui-general Issues on the DB Console that span multiple areas or don't have another clear category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@ajwerner
Copy link
Contributor

Describe the problem

The cockroach server creates a singleton net.Pipe through which it connects the status server to the local gRPC server.

cockroach/pkg/server/server.go

Lines 1264 to 1265 in f97dc13

// Setup HTTP<->gRPC handlers.
c1, c2 := net.Pipe()

It wraps one side of this pipe in a singleListener on which it serves the gRPC server:

type singleListener struct {

This listener no-ops on close and always returns the same side of the pipe. This is problematic if the underlying Conn is closed. I can't speak exactly to why the Conn is closed, but it seems to line up with a period of unavailability of the remote side. My best guess is that there was a client timeout somewhere.

The error on authentication requests looks like:

W191122 05:33:26.328330 56 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {ip-172-31-10-214:26257 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: io: read/write on closed pipe". Reconnecting...

The observable behavior is that login requests display System Unavailable. If you perform the login request with Accept: application/json you get the following response:

{
  "error": "all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: authentication handshake failed: io: read/write on closed pipe\"",
  "message": "all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: authentication handshake failed: io: read/write on closed pipe\"",
  "code": 14,
  "details": [
  ]
}

To Reproduce

Not sure how exactly to repro. It seems easy enough to contrive the situation. Coming up with a realistic repro is probably worthwhile. The situation occurred during a rolling restart and upgrade of a four node cluster. At some point during the upgrade a large number of ranges were transiently unavailable as they submitted crash reports (specifically this one: #42802).

The upgrade was from 19.1.5 (maybe 19.1.4) to 19.2.1.

Expected behavior

The expected behavior is that the admin UI remain available.

Possible solution

We should detect the close on the net.Pipe and create a new one. This can all happen underneath the listener. The dialer we inject into the client should then look up the most up-to-date conn from the listener. Probably we should rate limit all of this so we don't get caught in a tight loop.

@vilterp vilterp added the A-webui-general Issues on the DB Console that span multiple areas or don't have another clear category. label Nov 27, 2019
@knz knz added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 4, 2020
@knz
Copy link
Contributor

knz commented May 4, 2020

Found live in 19.2.6 still. Assuming not fixed in 20.1 as well.

@bdarnell
Copy link
Contributor

bdarnell commented May 4, 2020

We should detect the close on the net.Pipe and create a new one.

That doesn't seem quite right - we may have handed out the same pipe endpoint to multiple callers and this would still allow the ones that didn't close the connection to get broken. Instead we should either create a new pipe pair for each "connection" (to more closely mimic real connections) or wrap the pipe endpoint in an object that turns Close into a no-op.

@knz
Copy link
Contributor

knz commented May 4, 2020

we should either create a new pipe pair for each "connection" (to more closely mimic real connections) or wrap the pipe endpoint in an object that turns Close into a no-op.

Agreed. I tracked there are multiple cases where the conn get closed in the current implementation (in particular if the conn handshake ever lasts more than the RPC conn timeout, there's a SetDeadline at the top of handleRawConn).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-webui-general Issues on the DB Console that span multiple areas or don't have another clear category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants