Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[24.0 backport] ssh: fix error on commandconn close, add ping and default timeout #4395

Merged

Conversation

thaJeztah
Copy link
Member

backports:


- Related

- TLDR

During CLI initialization we ping the configured host. Initially, we set no context timeout for these connections, and after b397391 we set timeouts but only for TCP connections (which then got changed in #3722 to only specifically exclude SSH connections and not sockets).

This looks to be due to errory looking logs from cli/connhelper/commandconn/commandconn.go on Close() – which gets called by http.Transport when the request context times out.
However, these logs are being thrown erroneously – commandconn.kill() states that it will return nil if the command terminates regardless of it's exit status:

// kill returns nil if the command terminated, regardless to the exit status.

But the implementation does not hold up to this contract:
if wExitErr.ProcessState.Exited() {

since ProcessState.Exited() returns true only if the program exited due to calling exit and not if the program was terminated by a signal (this probably does works on Windows since here ProcessState.Exited() returns true if the program has exited either way, but not on POSIXy OSs).

If commandconn.kill() is fixed to always return nil when the command has terminated, setting timeouts on contexts passed to SSH connections works fine and we no longer need special handling.

Other than that, I also added -o ConnectTimeout=30 to the created SSH command, which serves to align with the timeout from the client returned for non-SSH connections in

Timeout: 30 * time.Second,
since it will be used only for "dialing" i.e., ssh command is unable to establish a connection within 30s it will error out, but if it can establish a connection it will keep running indefinitely, so commands like docker stats or an attach won't time out after 30s.

During this I also found a racy call to cmd.Wait() between Close() and onEOF(), since calling Close will close the command's stdin/out pipes, which will throw an EOF to pending reads, triggering a call to onEOF(), but afterwards Close will also call kill() which will also call cmd.Wait()

- What I did

  • cli/connhelper/commandconn/commandconn.go:
    • kill(): Fix implementation so we don't return an error when process is successfully closed
    • Write()/Read(): Prevent race by checking if connection is being closed
  • cli/command/cli.go:
    • initializeFromClient(): remove SSH special handling so we also add timeout for SSH connections
  • cli/context/docker/load.go:
    • ClientOpts(): add 30s timeout to align with non-SSH client

- How to verify it

  • DOCKER_HOST=ssh://example.com docker: no longer hangs forever, now it returns after the default 2s CLI init timeout
  • DOCKER_HOST=ssh://example.com docker version: no longer hangs forever, now it will timeout after 30s (can be more if the host resolves to more than 1 IP address)

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

image

laurazard and others added 2 commits June 30, 2023 19:23
---
commandconn: fix race on `Close()`

During normal operation, if a `Read()` or `Write()` call results
in an EOF, we call `onEOF()` to handle the terminating command,
and store it's exit value.

However, if a Read/Write call was blocked while `Close()` is called
the in/out pipes are immediately closed which causes an EOF to be
returned. Here, we shouldn't call `onEOF()`, since the reason why
we got an EOF is because we're already terminating the connection.
This also prevents a race between two calls to the commands `Wait()`,
in the `Close()` call and `onEOF()`

---
Add CLI init timeout to SSH connections

---
connhelper: add 30s ssh default dialer timeout

(same as non-ssh dialer)

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
(cherry picked from commit a5ebe22)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Changes the `Read` and `Write` error handling
logic to return the original error while closing
the connection. We still skip calling `handleEOF`
if already closing the connection.

Fixes the flaky `TestCloseWhileWriting` and
`TestCloseWhileReading` tests.

Co-authored-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Laura Brehm <laurabrehm@hey.com>
(cherry picked from commit d5f564a)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@codecov-commenter
Copy link

Codecov Report

Merging #4395 (2d5f041) into 24.0 (fad718c) will increase coverage by 0.21%.
The diff coverage is 84.05%.

Additional details and impacted files
@@            Coverage Diff             @@
##             24.0    #4395      +/-   ##
==========================================
+ Coverage   59.07%   59.29%   +0.21%     
==========================================
  Files         287      288       +1     
  Lines       24789    24804      +15     
==========================================
+ Hits        14644    14707      +63     
+ Misses       9263     9211      -52     
- Partials      882      886       +4     

@neersighted neersighted merged commit 3713ee1 into docker:24.0 Jun 30, 2023
74 checks passed
@thaJeztah thaJeztah deleted the 24.0_backport_fix-connhelper-docker-example branch June 30, 2023 18:26
@thaJeztah
Copy link
Member Author

/cc @crazy-max @tonistiigi in case you needed the updated dependency with this fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants