-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ssh process killed when context is done #3900
Conversation
Signed-off-by: Patrick Daigle <114765035+pdaig@users.noreply.github.com>
I'm somewhat concerned that with this we would never take the context into account (?); would this mean that cancelling the context would keep the process running on the daemon side? Wondering if something like moby/moby#44520 would work for these. I know there was also another PR #2132 to improve the SSH handling, but we had to (temporarily) revert (#2303), and probably should be revisited. /cc @corhere @vvoland (I know you both dug into "context fun") |
@thaJeztah You're right about this not taking the context into account. Initially, I had the same concern until I realized that due to how go's My understanding is that go's You're probably wondering why we would want the Even when there's an actual error, it could be non-fatal (error handled by the caller). For example, getting an error 404 when checking if an image exists (this is just an example for illustration purpose, I'm not sure it's entirely accurate, but I remember coming across something similar to this). Regarding moby/moby#44520, I'm not sure I understand everything going on, but if I understand correctly, the idea would be to batch a group of requests into one (for instance, pull a list of images at once instead of multiple individual pull). I think this could also work but would need to be done on a case by case basis and therefore harder to maintain. The main issue is the pull, but I've also reproduced the same issue with the Mutliplexing the ssh connection (#2132) would help improve the speed, but I don't think it'd help with the current issue because even if the connection is multiplexed, it still spawns an ssh process on the client which will also get killed when the context is cancelled. However, this would be a better solution to the point I raised about the possible speed optimization. Footnotes
|
Thanks for the extra details. I want to dig into this a bit further, as these things tend to get complicated fast. My thinking is that;
So I wonder if in that case the
We would have to verify if that works, but the things I'm interested in in the above;
Another thought is about the "Even when there's an actual error, it could be non-fatal (error handled by the caller)" The errgroup WithContext looks to be designed to consider everything in the group as a single unit of work; and because of that (by design) cancels all that work if anything in the group errors? If that's not the desired behaviour ("continue doing the other work if anything fails"); I wonder if compose should use (we can add code to not cancel in error-cases that are "ok", but I guess that's something in general for the error-group) |
https://pkg.go.dev/net#Dialer.DialContext
(emphasis mine.) I think that the change in this PR is the correct solution. While not explicitly stated in the As far as leaking SSH processes goes, I don't think that would be a problem. The implementation appears to go to great lengths to ensure the SSH process is gone after |
As @corhere mentioned, it's also my understanding that the When a However, you're absolutely right that "if the main context is cancelled, everything should be cancelled" We still have some form of control over the connection pool. We can limit the number of concurrent connections, the maximum number of idle connection to keep around, how long to keep them, etc (see I've never witnessed a ssh process leak due to that. I presume that's because the go runtime or the os does some additional cleanup. It seems like a plausible explanation to what I witnessed but maybe someone else could confirm, just to be sure. All that being said, I still want to address your suggestion of using Thank you both for taking the time to look into this issue. I know it's complex and time consuming. |
Contexts are for carrying request-scoped values and cancelation signals through a chain of function calls. Deriving a context is semantically beginning a new (sub-)request scope. Cancelation signals that the request has been completed, irrespective of the reason: the Arranging to have If I haven't yet convinced you that the |
So the context that's passed to the dialer function shouldn't be tied to a particular request (the context
Well to be honest, I didn't have much faith it'd work either, but I initially still wanted to give it a go, just in case I had misunderstood something or maybe I could've come up with a slightly different combination that would've worked. Now I think the issue is clearer and I agree it's probably not worth pursuing. What do you think would be the next step to fix it ? Should we contact the Go team to see what they have to say on this, if it's by-design or a bug they could fix ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdaig net.Conn
is an interface. The buggy implementation I was referring to is commandconn.New()
.
So the context that's passed to the dialer function shouldn't be tied to a particular request
No; quite the opposite. The context passed to the dialer function should be tied to a particular request as the process of dialing is itself request-scoped. The connection would be closed if the request is canceled before completion anyway, so there is no point in wasting the resources completing the dial operation either. The root cause of docker/compose#9448 is that cancelation of the context passed into commandconn.New()
affects the returned net.Conn
. You have identified and fixed that bug.
Understood, thanks @corhere ! Sorry, I thought you meant earlier you wanted the Glad we're now on the same page. Thanks for taking the time to clear up everything :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @corhere for your insight!
Discussed this with @corhere - let's get this change in (thanks all!) We were discussing the code on the compose side, as (to our understanding) the image pulls in the example linked are more intended to be "pulls to happen in parallel" (i.e., failing to pull one image shouldn't fail pulling the others); we were thinking if a waitgroup (perhaps with a multi-error or slice of errors) would be the right solution for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- What I did
This solve the issue described in docker/compose#9448
I've also reproduced the issue with the
docker compose down
command (basically, all commands that are run concurrently over ssh might be affected).- How I did it
Please see my comment on issue docker/compose#9448 for details on what causes the issue.
When
errgroup.Wait()
is called, it cancels theContext
regardless of whether there was an error or not.This in turns kills the ssh process and causes an error when go's
http.Client
tries to reuses thisnet.Conn
(commandConn
).Not passing down the
Context
might seem counter-intuitive, but in this case, the lifetime of the process should be managed by thehttp.Client
, not the caller'sContext
.If the caller cancels the
Context
due to an error, (1) thenet.Conn
has no way to communicate that back to thehttp.Client
(ie: mark itself as "errored") and (2) even in case of an error, the established ssh connection can still be reused for other requests, so there's really no need to shut it down.The
http.Client
will automaticallyClose()
the connection when it no longer needs it. The number of concurrent and idle connections it can keep can be configured inhttp.Client
's options.Sidenotes:
[1] Currently it's possible that the program exits before the
http.Client
has a chance to close all idle connections because the docker Client object itself doesn't seem to be closed (see #3899). A clean shutdown would be preferable but I've not seen any ssh process left behind because it wasn't called (go and/or the OS probably kills them for us on exit).[2] I've experimented a little bit with the
MaxConnsPerHost
option of http.Transport (can be added here). I only have anecdotal evidence, I've not run any meaningful benchmark so your experience might be different, but I've found that it's sometimes noticeably slower to start many ssh connections in parallel instead of just funneling all requests sequentially through one connection. Probably due to the added overhead of starting many ssh processes and each process having to establish its own connection from scratch. However, for long running tasks like pulling many large images, parallel may be faster (again depending on a variety of factors like the number of images, their sizes, network conditions, whether the image is already downloaded, etc.). In short, I see an optimization opportunity here for docker over ssh. Request that are expected to be fast like getting metadata on an image/container could be run sequentially while long running requests could run concurrently. Although it's greatly out of scope for this PR, I think it might be an interesting performance improvement. I would have liked to volunteer to work on this but at least for the foreseeable future, I unfortunately can't give it the proper attention it deserves to get it done. If someone else wants to get started on this, please go ahead (and let me know :)).- How to verify it
cd test-pull
DOCKER_HOST=ssh://__USER__@__HOST__ ../bin/build/docker-compose pull
The pull should succeed. If you revert the fix (last commit on the branch), there's a chance the pull will fail. If you're (un)lucky, you might have to try it a few times to reproduce the issue.
Without the fix, I'm able to reproduce the issue about 9/10 times. Others have also confirmed the problem on their end (see docker/compose#9448).
With the fix, I've never had the problem again.
- Description for the changelog
Fix intermittent ssh error when docker compose runs some operations concurrently (like pulling multiple images)
- A picture of a cute animal (not mandatory but encouraged)