-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ensure context is cancelled to prevent goroutine leaks from grpc.newClientStream #1316
Conversation
Please sign your commits following these rules: $ git clone -b "goroutine-leak" git@github.com:coryb/buildkit.git somewhere
$ cd somewhere
$ git commit --amend -s --no-edit
$ git push -f Amending updates the existing PR. You DO NOT need to open a new one. |
…lientStream Signed-off-by: Cory Bennett <cbennett@netflix.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the fix is ok and it fixes an important issue I think this is a sign of wrong internal behavior inside the function called by flightcontrol that should be fixed as well. If a worker function returns it should not leak goroutines, even if the passed context was not canceled. It would be good to look more into why this goroutine doesn't go away when parent function returns before we add this cleanup behavior on higher level.
Hey @tonistiigi thanks for the comment, I completely agree my change is just a band-aid, just not sure where a better fix for this would be. I think the responsibility should likely be with the code that initiates the grpc.NewStream code to begin with since that is what has the strict requirements to prevent leaks. I am just not familiar enough with the codebase to identify the sources of the grpc.NewStream. The call stacks I found are dealing with opaque interfaces, so I could not easily figure out the source of the actual interface implementations that initiated the grpc connections. You probably have a better idea of where to look, so I can look further if you can suggest some pointers where to start. |
@coryb Do you have a small reproducer so I can get a trace for this? Or does this happen with any build? |
Sorry for the delay, I will try to reproduce with simple dockerfile frontend and get back to you. |
@tonistiigi I reproduced it with simple dockerfile llb. The Dockerfile is just:
Run like:
To easily see the goroutine count, I just added this file to the buildkitd main package:
Then with the
|
Weirdly I can't repro this. I let it run for couple of minutes and when I cancel the script, goroutines count goes down to 19. Is there anything specific in your setup, eg. are you using tcp connection etc. |
@tonistiigi We are using TCP, the |
Odd, I double-checked that the issue persists in our setup when I revert our deployment to only use the current buildkitd As @hinshun said we are using tcp, here is the buildkitd.toml file we are using, not sure how our setup might be triggering this:
We connect using the corresponding tls cert/keys over the tcp network. |
Narrowed this down, it only leaks when using the containerd worker, the default runc worker seems fine. So with this config I was able to see the leak:
|
I can repro now. Let's merge so we can get this into 0.6 release and create an issue to track additional fix. |
When running a local buildkitd cluster I noticed there was a leak of goroutines, it seems there were multiple goroutines leaked for every buildkit solve. Here is a graph of a 4 hour window, while running a single buildkit solve in a loop (results entirely cached) (from 20:10 to 22:20), the goroutine count resets around 23:40 because I restart buildkitd:
I was able to collect the stack traces from the call site of the goroutines in grpc.newClientStream and it was leaked from 2 distinct stack traces:
and
The common ancestor was from
flightcontrol.(*call).run
. It seems that both end up callingReadAt
on the grpc stream which does not drain/close the connection completely so it appears that the context needs to be canceled to ensure that the grpc stream is cleaned up.With this patch applied I ran the same test again over an hour and the goroutine count stayed steady and recovered on its own after I stopped my testing:
[cc @hinshun]