Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ssh process killed when context is done #3900

Merged
merged 1 commit into from
Dec 8, 2022

Conversation

pdaig
Copy link
Contributor

@pdaig pdaig commented Dec 5, 2022

- What I did
This solve the issue described in docker/compose#9448

I've also reproduced the issue with the docker compose down command (basically, all commands that are run concurrently over ssh might be affected).

- How I did it
Please see my comment on issue docker/compose#9448 for details on what causes the issue.

When errgroup.Wait() is called, it cancels the Context regardless of whether there was an error or not.
This in turns kills the ssh process and causes an error when go's http.Client tries to reuses this net.Conn (commandConn).

Not passing down the Context might seem counter-intuitive, but in this case, the lifetime of the process should be managed by the http.Client, not the caller's Context.

If the caller cancels the Context due to an error, (1) the net.Conn has no way to communicate that back to the http.Client (ie: mark itself as "errored") and (2) even in case of an error, the established ssh connection can still be reused for other requests, so there's really no need to shut it down.

The http.Client will automatically Close() the connection when it no longer needs it. The number of concurrent and idle connections it can keep can be configured in http.Client's options.

Sidenotes:
[1] Currently it's possible that the program exits before the http.Client has a chance to close all idle connections because the docker Client object itself doesn't seem to be closed (see #3899). A clean shutdown would be preferable but I've not seen any ssh process left behind because it wasn't called (go and/or the OS probably kills them for us on exit).

[2] I've experimented a little bit with the MaxConnsPerHost option of http.Transport (can be added here). I only have anecdotal evidence, I've not run any meaningful benchmark so your experience might be different, but I've found that it's sometimes noticeably slower to start many ssh connections in parallel instead of just funneling all requests sequentially through one connection. Probably due to the added overhead of starting many ssh processes and each process having to establish its own connection from scratch. However, for long running tasks like pulling many large images, parallel may be faster (again depending on a variety of factors like the number of images, their sizes, network conditions, whether the image is already downloaded, etc.). In short, I see an optimization opportunity here for docker over ssh. Request that are expected to be fast like getting metadata on an image/container could be run sequentially while long running requests could run concurrently. Although it's greatly out of scope for this PR, I think it might be an interesting performance improvement. I would have liked to volunteer to work on this but at least for the foreseeable future, I unfortunately can't give it the proper attention it deserves to get it done. If someone else wants to get started on this, please go ahead (and let me know :)).

- How to verify it

  1. Checkout and build https://github.com/pdaig/docker-compose/tree/test-fix-ssh-killed
  2. cd test-pull
  3. DOCKER_HOST=ssh://__USER__@__HOST__ ../bin/build/docker-compose pull

The pull should succeed. If you revert the fix (last commit on the branch), there's a chance the pull will fail. If you're (un)lucky, you might have to try it a few times to reproduce the issue.

Without the fix, I'm able to reproduce the issue about 9/10 times. Others have also confirmed the problem on their end (see docker/compose#9448).

With the fix, I've never had the problem again.

- Description for the changelog
Fix intermittent ssh error when docker compose runs some operations concurrently (like pulling multiple images)

- A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Patrick Daigle <114765035+pdaig@users.noreply.github.com>
@thaJeztah
Copy link
Member

I'm somewhat concerned that with this we would never take the context into account (?); would this mean that cancelling the context would keep the process running on the daemon side?

Wondering if something like moby/moby#44520 would work for these. I know there was also another PR #2132 to improve the SSH handling, but we had to (temporarily) revert (#2303), and probably should be revisited.

/cc @corhere @vvoland (I know you both dug into "context fun")

@pdaig
Copy link
Contributor Author

pdaig commented Dec 5, 2022

@thaJeztah You're right about this not taking the context into account. Initially, I had the same concern until I realized that due to how go's http.Client works, the connection should actually keep running. I know it's a bit counter-intuitive so let me try to clarify. First, it's the ssh client that'll keep running. By extension, the daemon process (dial-stdio) will also run through the ssh connection but only for the duration of the connection (which will end when docker cli exits).

My understanding is that go's http.Client is built to reuse existing net.Conn (in this case a commandConn). If the ssh process started by commandConn gets killed due the caller's context getting cancelled/done, it can't reuse the connection and it cause the [...] has exited with signal: killed error described in docker/compose#9448.

You're probably wondering why we would want the http.Client and the associated connections to keep working when the caller cancelled the context. Well that's because in some cases, the context gets 'cancelled' or 'done' when there's actually no error. For example this function that gets called during a pull. After all goroutines complete successfully, errgroup.Wait() 1 will still cancel the context because it's 'done' but this has the effect of killing the ssh process that was still running. The net.Conn doesn't get closed when the http request completes.

Even when there's an actual error, it could be non-fatal (error handled by the caller). For example, getting an error 404 when checking if an image exists (this is just an example for illustration purpose, I'm not sure it's entirely accurate, but I remember coming across something similar to this).

Regarding moby/moby#44520, I'm not sure I understand everything going on, but if I understand correctly, the idea would be to batch a group of requests into one (for instance, pull a list of images at once instead of multiple individual pull). I think this could also work but would need to be done on a case by case basis and therefore harder to maintain. The main issue is the pull, but I've also reproduced the same issue with the down command. In theory, it could also affect other commands.

Mutliplexing the ssh connection (#2132) would help improve the speed, but I don't think it'd help with the current issue because even if the connection is multiplexed, it still spawns an ssh process on the client which will also get killed when the context is cancelled. However, this would be a better solution to the point I raised about the possible speed optimization.

Footnotes

  1. See https://pkg.go.dev/golang.org/x/sync/errgroup#WithContext which states:
    "The derived Context is canceled the first time a function passed to Go returns a non-nil error or the first time Wait returns, whichever occurs first." (emphasis mine). This description also fits what I experienced.

@thaJeztah
Copy link
Member

Thanks for the extra details. I want to dig into this a bit further, as these things tend to get complicated fast. My thinking is that;

  • if the main context is cancelled, everything should be cancelled (including these SSH connections); otherwise there would be (I think?) the potential that we have runaway ssh processes?
  • but (IIUC) the issue here is that in the compose code (the errgroup) cancels the (main) context when it's done, or when an error occurred.

So I wonder if in that case the errgroup should actually be using it own (sub)context, for example using context.WithCancel

WithCancel returns a copy of parent with a new Done channel. The returned context's Done channel is closed when the returned cancel function is called or when the parent context's Done channel is closed, whichever happens first.
Canceling this context releases resources associated with it, so code should call cancel as soon as the operations running in this Context complete.

We would have to verify if that works, but the things I'm interested in in the above;

  • It's a new context which, when cancelled (errgroup has an error, or errgroup finished) cancels that context
  • but the main (parent) context (if I'm correct) won't be cancelled in that case
  • but the reverse is still true; so if the main (parent) context is cancelled, the new context is also cancelled; this means that there's still a way to "cancel everything" if we want to terminate everything.

Another thought is about the "Even when there's an actual error, it could be non-fatal (error handled by the caller)"

The errgroup WithContext looks to be designed to consider everything in the group as a single unit of work; and because of that (by design) cancels all that work if anything in the group errors? If that's not the desired behaviour ("continue doing the other work if anything fails"); I wonder if compose should use WithContext for that errgroup, or if it should use one without; https://pkg.go.dev/golang.org/x/sync/errgroup#example-Group-JustErrors

(we can add code to not cancel in error-cases that are "ok", but I guess that's something in general for the error-group)

@corhere
Copy link
Contributor

corhere commented Dec 6, 2022

https://pkg.go.dev/net#Dialer.DialContext

If the context expires before the connection is complete, an error is returned. Once successfully connected, any expiration of the context will not affect the connection.

(emphasis mine.)

I think that the change in this PR is the correct solution. While not explicitly stated in the http.Transport godoc, any function assigned to an (http.Transport).DialContext is expected to have the same semantics as the (*net.Dialer).DialContext implementation. Tracing through the net/http source, the request context is used when dialing a new connection if there is no existing available connection to reuse. Connection pooling/reuse therefore only works if the context used when dialing a conn does not affect an established conn. It would be nice to have connhelper.New() respect context cancelation before the function returns, but I think that could wait until Go 1.20 is released with os/exec improvements which would make it much more straightforward to implement.

As far as leaking SSH processes goes, I don't think that would be a problem. The implementation appears to go to great lengths to ensure the SSH process is gone after Close() is called. And the parent process crashing or neglecting to explicitly clean up before exiting likely won't leak an SSH process either: it's probably safe to assume that the SSH client will handle EOF reading from STDIN—or receiving a SIGPIPE signal when writing to STDOUT—by gracefully exiting.

@pdaig
Copy link
Contributor Author

pdaig commented Dec 7, 2022

@thaJeztah

As @corhere mentioned, it's also my understanding that the http.Client uses a connection pool. That's really the key to the change I'm proposing. If you think of it as such, hopefully it'll make more sense.

When a commandConn is created, it's put into the connection pool. If the caller cancels the context, the connection seems to remain in the pool and that's what's causing the issue. The next request will try to reuse the existing connection that has been killed because the context was cancelled earlier.

However, you're absolutely right that "if the main context is cancelled, everything should be cancelled"
... and that's what'll happen when the http.Client cleans up its connections. It'll call commandConn.Close() and that'll kill the process. If the error/cancellation bubbles up to the main context, at some point it'll trigger the http.Client shutdown/cleanup. If the error/cancellation doesn't reach the "top" level, the program continues, the connections are still usable and don't need to be killed.

We still have some form of control over the connection pool. We can limit the number of concurrent connections, the maximum number of idle connection to keep around, how long to keep them, etc (see http.Transport and load.go). We can also force idle connections to close by calling CloseIdleConnections. That's what's called in docker's Client.Close() and that's why I also opened #3899 (even thought the ssh processes are still cleaned up without it).

I've never witnessed a ssh process leak due to that. I presume that's because the go runtime or the os does some additional cleanup. It seems like a plausible explanation to what I witnessed but maybe someone else could confirm, just to be sure.

All that being said, I still want to address your suggestion of using context.WithCancel because you might be on to something. When I first started to investigate the problem, I was looking specifically at the compose code I pointed out earlier. If eg.Wait() would NOT cancel the context when everything succeeded, the commandConn could continue running after the goroutines have returned and we wouldn't have the issue. In this case, it still seems desirable to cancel the whole group if one goroutine fail and to tie the group to the parent context. If we can "recover" from an error and continue however, that could still be an issue, but I'm not sure that's the case. The pull is just one case thought. There are at least 1 other command affected, possibly more.

Thank you both for taking the time to look into this issue. I know it's complex and time consuming.

@corhere
Copy link
Contributor

corhere commented Dec 7, 2022

When I first started to investigate the problem, I was looking specifically at the compose code I pointed out earlier. If eg.Wait() would NOT cancel the context when everything succeeded, the commandConn could continue running after the goroutines have returned and we wouldn't have the issue. In this case, it still seems desirable to cancel the whole group if one goroutine fail and to tie the group to the parent context.

Contexts are for carrying request-scoped values and cancelation signals through a chain of function calls. Deriving a context is semantically beginning a new (sub-)request scope. Cancelation signals that the request has been completed, irrespective of the reason: the go vet linter enforces this. Given that contexts are scoped, it follows that the cancelation of a context associated with one request should not affect the execution of an independent request. As a corollary, resources should be bounded by a context iff their lifetime is scoped to a request. The net.Conn resources in an http.Client's connection pool are not scoped to any particular request, therefore a connection being put into a failure state when a context is canceled is a bug in the net.Conn implementation.

Arranging to have eg.Wait() not cancel its context on success is a non-idiomatic workaround for the actual bug. And it is an incomplete workaround. The context would still be canceled if an error is encountered or the parent context is canceled, potentially breaking (read: killing) one of the pooled connections. In fact, any context passed into any API client method anywhere has the potential to break a pooled connection. The only complete "workaround" would be to never ever cancel any API request!

If I haven't yet convinced you that the context.WithCancel suggestion is not worth pursuing, maybe this will. (In general, a function which takes a context argument and returns a context is probably returning a derived context.)

@pdaig
Copy link
Contributor Author

pdaig commented Dec 8, 2022

@corhere

The net.Conn resources in an http.Client's connection pool are not scoped to any particular request, therefore a connection being put into a failure state when a context is canceled is a bug in the net.Conn implementation.

So the context that's passed to the dialer function shouldn't be tied to a particular request (the context commandConn.New receives comes from Transport.DialContext). If that's the case, then the issue may be on Go's side. I didn't go digging in go's source code to see what's going on under the hood.

If I haven't yet convinced you that the context.WithCancel suggestion is not worth pursuing

Well to be honest, I didn't have much faith it'd work either, but I initially still wanted to give it a go, just in case I had misunderstood something or maybe I could've come up with a slightly different combination that would've worked. Now I think the issue is clearer and I agree it's probably not worth pursuing.

What do you think would be the next step to fix it ? Should we contact the Go team to see what they have to say on this, if it's by-design or a bug they could fix ?
Do you have other suggestions on docker's side we could try ?

Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdaig net.Conn is an interface. The buggy implementation I was referring to is commandconn.New().

So the context that's passed to the dialer function shouldn't be tied to a particular request

No; quite the opposite. The context passed to the dialer function should be tied to a particular request as the process of dialing is itself request-scoped. The connection would be closed if the request is canceled before completion anyway, so there is no point in wasting the resources completing the dial operation either. The root cause of docker/compose#9448 is that cancelation of the context passed into commandconn.New() affects the returned net.Conn. You have identified and fixed that bug.

@pdaig
Copy link
Contributor Author

pdaig commented Dec 8, 2022

Understood, thanks @corhere !

Sorry, I thought you meant earlier you wanted the CommandConn to still be tied to the context (implying the dialer should instead receive a context not tied to a request but instead tied to another longer-living context).

Glad we're now on the same page. Thanks for taking the time to clear up everything :)

Copy link
Collaborator

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @corhere for your insight!

@thaJeztah
Copy link
Member

Discussed this with @corhere - let's get this change in (thanks all!)

We were discussing the code on the compose side, as (to our understanding) the image pulls in the example linked are more intended to be "pulls to happen in parallel" (i.e., failing to pull one image shouldn't fail pulling the others); we were thinking if a waitgroup (perhaps with a multi-error or slice of errors) would be the right solution for that.

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants