Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure context is cancelled to prevent goroutine leaks from grpc.newClientStream #1316

Merged
merged 1 commit into from
Jan 17, 2020

Conversation

coryb
Copy link
Collaborator

@coryb coryb commented Jan 7, 2020

When running a local buildkitd cluster I noticed there was a leak of goroutines, it seems there were multiple goroutines leaked for every buildkit solve. Here is a graph of a 4 hour window, while running a single buildkit solve in a loop (results entirely cached) (from 20:10 to 22:20), the goroutine count resets around 23:40 because I restart buildkitd:
image

I was able to collect the stack traces from the call site of the goroutines in grpc.newClientStream and it was leaked from 2 distinct stack traces:

goroutine 725970 [running]:
github.com/moby/buildkit/vendor/google.golang.org/grpc.newClientStream(0x17bcda0, 0xc01ed1b8c0, 0x2201e00, 0xc00032ce00, 0x1641b3a, 0x2c, 0xc00025c9a0, 0x2, 0x2, 0x0, ...)
        /go/src/github.com/moby/buildkit/vendor/google.golang.org/grpc/stream.go:327 +0x9d1
github.com/moby/buildkit/vendor/github.com/containerd/containerd.namespaceInterceptor.stream(0xc00016f8c0, 0x8, 0x17bcda0, 0xc0222b6750, 0x2201e00, 0xc00032ce00, 0x1641b3a, 0x2c, 0x16
62038, 0xc00025c9a0, ...)
        /go/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/grpc.go:44 +0xcb
github.com/moby/buildkit/vendor/google.golang.org/grpc.(*ClientConn).NewStream(0xc00032ce00, 0x17bcda0, 0xc0222b6750, 0x2201e00, 0x1641b3a, 0x2c, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/moby/buildkit/vendor/google.golang.org/grpc/stream.go:154 +0xe3
github.com/moby/buildkit/vendor/github.com/containerd/containerd/api/services/content/v1.(*contentClient).Read(0xc0000c0058, 0x17bcda0, 0xc0222b6750, 0xc0237d5200, 0x0, 0x0, 0x0, 0xc01ed1b7a0, 0xc023db1930, 0xc023db18d0, ...)
        /go/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/api/services/content/v1/content.pb.go:1140 +0xb7
github.com/moby/buildkit/vendor/github.com/containerd/containerd/content/proxy.(*remoteReaderAt).ReadAt(0xc0237d5180, 0xc010337d40, 0x210, 0x210, 0x0, 0x0, 0x17b4460, 0xc0237d5180)
        /go/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/content/proxy/content_reader.go:43 +0xae
github.com/moby/buildkit/util/imageutil.DetectManifestMediaType(0x17b4460, 0xc0237d5180, 0xc01ed1a900, 0x0, 0x0, 0xc019bb49d4)
        /go/src/github.com/moby/buildkit/util/imageutil/config.go:171 +0x8f
github.com/moby/buildkit/util/pull.(*Puller).Resolve.func1()
        /go/src/github.com/moby/buildkit/util/pull/pull.go:60 +0x48a
sync.(*Once).Do(0xc010b4a6f8, 0xc0078f3af8)
        /.newt-cache/go-version/1.12.3/src/sync/once.go:44 +0xb3
github.com/moby/buildkit/util/pull.(*Puller).Resolve(0xc010b4a690, 0x17bcda0, 0xc01ed1a900, 0x17b69a0, 0xc01ed1abd0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/moby/buildkit/util/pull/pull.go:47 +0x8d
github.com/moby/buildkit/source/containerimage.(*puller).CacheKey(0xc01f4bcd80, 0x17bcda0, 0xc01ed1a900, 0x0, 0xc01f4bcd80, 0x0, 0x0, 0xc0078f3e00, 0x10055c1)
        /go/src/github.com/moby/buildkit/source/containerimage/pull.go:148 +0x64
github.com/moby/buildkit/solver/llbsolver/ops.(*sourceOp).CacheMap(0xc01da5b3b0, 0x17bcda0, 0xc01ed1a900, 0x0, 0xc0078f3f48, 0x17bcda0, 0xc01ed1aa50, 0xc0100f7d40)
        /go/src/github.com/moby/buildkit/solver/llbsolver/ops/source.go:65 +0xa8
github.com/moby/buildkit/solver.(*sharedOp).CacheMap.func1(0x17bd2a0, 0xc019e1b3c0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/moby/buildkit/solver/jobs.go:635 +0x283
github.com/moby/buildkit/util/flightcontrol.(*call).run(0xc01f4bccf0)
        /go/src/github.com/moby/buildkit/util/flightcontrol/flightcontrol.go:119 +0x65
sync.(*Once).Do(0xc01f4bcd50, 0xc018522810)
        /.newt-cache/go-version/1.12.3/src/sync/once.go:44 +0xb3
created by github.com/moby/buildkit/util/flightcontrol.(*call).wait
        /go/src/github.com/moby/buildkit/util/flightcontrol/flightcontrol.go:146 +0x18b

and

goroutine 3129 [running]:
github.com/moby/buildkit/vendor/google.golang.org/grpc.newClientStream(0x17bcdc0, 0xc0019f1320, 0x2201e00, 0xc00016e000, 0x1641b5a, 0x2c, 0xc0000b18e0, 0x2, 0x2, 0x0, ...)
        /go/src/github.com/moby/buildkit/vendor/google.golang.org/grpc/stream.go:327 +0x9d1
github.com/moby/buildkit/vendor/github.com/containerd/containerd.namespaceInterceptor.stream(0xc0001708f0, 0x8, 0x17bcdc0, 0xc0019b7d10, 0x2201e00, 0xc00016e000, 0x1641b5a, 0x2c, 0x16
62058, 0xc0000b18e0, ...)
        /go/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/grpc.go:44 +0xcb
github.com/moby/buildkit/vendor/google.golang.org/grpc.(*ClientConn).NewStream(0xc00016e000, 0x17bcdc0, 0xc0019b7d10, 0x2201e00, 0x1641b5a, 0x2c, 0x0, 0x0, 0x0, 0xc003e56a00, ...)
        /go/src/github.com/moby/buildkit/vendor/google.golang.org/grpc/stream.go:154 +0xe3
github.com/moby/buildkit/vendor/github.com/containerd/containerd/api/services/content/v1.(*contentClient).Read(0xc000010038, 0x17bcdc0, 0xc0019b7d10, 0xc005a19900, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc006eb7a18, ...)
        /go/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/api/services/content/v1/content.pb.go:1140 +0xb7
github.com/moby/buildkit/vendor/github.com/containerd/containerd/content/proxy.(*remoteReaderAt).ReadAt(0xc005a198c0, 0xc003e56a00, 0x6b9, 0x6b9, 0x0, 0xc003321040, 0x47, 0x6b9)
        /go/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/content/proxy/content_reader.go:43 +0xae
github.com/moby/buildkit/vendor/github.com/containerd/containerd/content.ReadBlob(0x17bcdc0, 0xc0019cf110, 0x7fc9828030a0, 0xc00025e080, 0xc005915710, 0x2e, 0xc003321040, 0x47, 0x6b9, 0x0, ...)
        /go/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/content/helpers.go:58 +0x179
github.com/moby/buildkit/util/imageutil.Config(0x17bcdc0, 0xc0019cf110, 0xc0029f1240, 0x3b, 0x17b69c0, 0xc0019ddf50, 0x7fc97c0145c0, 0xc00025e080, 0x17c4ca0, 0xc00025e040, ...)
        /go/src/github.com/moby/buildkit/util/imageutil/config.go:109 +0x5e3
github.com/moby/buildkit/source/containerimage.(*imageSource).ResolveImageConfig.func1(0x17bd2c0, 0xc007497e00, 0xc000de19e0, 0x0, 0x0, 0x1)
        /go/src/github.com/moby/buildkit/source/containerimage/pull.go:79 +0x185
github.com/moby/buildkit/util/flightcontrol.(*call).run(0xc003bfa5a0)
        /go/src/github.com/moby/buildkit/util/flightcontrol/flightcontrol.go:119 +0x65
sync.(*Once).Do(0xc003bfa600, 0xc00087f250)
        /.newt-cache/go-version/1.12.3/src/sync/once.go:44 +0xb3
created by github.com/moby/buildkit/util/flightcontrol.(*call).wait
        /go/src/github.com/moby/buildkit/util/flightcontrol/flightcontrol.go:146 +0x18b

The common ancestor was from flightcontrol.(*call).run. It seems that both end up calling ReadAt on the grpc stream which does not drain/close the connection completely so it appears that the context needs to be canceled to ensure that the grpc stream is cleaned up.

With this patch applied I ran the same test again over an hour and the goroutine count stayed steady and recovered on its own after I stopped my testing:
image

[cc @hinshun]

@GordonTheTurtle
Copy link
Collaborator

Please sign your commits following these rules:
https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "goroutine-leak" git@github.com:coryb/buildkit.git somewhere
$ cd somewhere
$ git commit --amend -s --no-edit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

…lientStream

Signed-off-by: Cory Bennett <cbennett@netflix.com>
Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the fix is ok and it fixes an important issue I think this is a sign of wrong internal behavior inside the function called by flightcontrol that should be fixed as well. If a worker function returns it should not leak goroutines, even if the passed context was not canceled. It would be good to look more into why this goroutine doesn't go away when parent function returns before we add this cleanup behavior on higher level.

@coryb
Copy link
Collaborator Author

coryb commented Jan 10, 2020

Hey @tonistiigi thanks for the comment, I completely agree my change is just a band-aid, just not sure where a better fix for this would be. I think the responsibility should likely be with the code that initiates the grpc.NewStream code to begin with since that is what has the strict requirements to prevent leaks.

I am just not familiar enough with the codebase to identify the sources of the grpc.NewStream. The call stacks I found are dealing with opaque interfaces, so I could not easily figure out the source of the actual interface implementations that initiated the grpc connections. You probably have a better idea of where to look, so I can look further if you can suggest some pointers where to start.

@tonistiigi
Copy link
Member

@coryb Do you have a small reproducer so I can get a trace for this? Or does this happen with any build?

@coryb
Copy link
Collaborator Author

coryb commented Jan 14, 2020

Sorry for the delay, I will try to reproduce with simple dockerfile frontend and get back to you.

@coryb
Copy link
Collaborator Author

coryb commented Jan 16, 2020

@tonistiigi I reproduced it with simple dockerfile llb. The Dockerfile is just:

FROM alpine

Run like:

while true; do 
    buildctl build --frontend=dockerfile.v0 --local context=.  --local dockerfile=. || break
done

To easily see the goroutine count, I just added this file to the buildkitd main package:

package main

import (
    "runtime"
    "time"

    "github.com/sirupsen/logrus"
)

func init() {
    go func() {
        for {
            <-time.After(10 * time.Second)
            logrus.Infof("GoRoutines: %d", runtime.NumGoroutine())
        }
    }()
}

Then with the while loop running I am seeing somethign like:

time="2020-01-16T01:09:03Z" level=info msg="GoRoutines: 20"
time="2020-01-16T01:09:13Z" level=info msg="GoRoutines: 96"
time="2020-01-16T01:09:23Z" level=info msg="GoRoutines: 206"
time="2020-01-16T01:09:33Z" level=info msg="GoRoutines: 300"
time="2020-01-16T01:09:43Z" level=info msg="GoRoutines: 407"
time="2020-01-16T01:09:53Z" level=info msg="GoRoutines: 451"
time="2020-01-16T01:10:03Z" level=info msg="GoRoutines: 536"
time="2020-01-16T01:10:13Z" level=info msg="GoRoutines: 644"
time="2020-01-16T01:10:23Z" level=info msg="GoRoutines: 732"
time="2020-01-16T01:10:33Z" level=info msg="GoRoutines: 803"
time="2020-01-16T01:10:43Z" level=info msg="GoRoutines: 888"

@tonistiigi
Copy link
Member

Weirdly I can't repro this. I let it run for couple of minutes and when I cancel the script, goroutines count goes down to 19. Is there anything specific in your setup, eg. are you using tcp connection etc.

@hinshun
Copy link
Collaborator

hinshun commented Jan 16, 2020

@tonistiigi We are using TCP, the buildkitd and buildctl are on different machines.

@coryb
Copy link
Collaborator Author

coryb commented Jan 16, 2020

Odd, I double-checked that the issue persists in our setup when I revert our deployment to only use the current buildkitd master branch.

As @hinshun said we are using tcp, here is the buildkitd.toml file we are using, not sure how our setup might be triggering this:

debug = true
root = "/mnt/buildkit"
[grpc]
  address = [ "tcp://0.0.0.0:8980" ]
  debugAddress = "127.0.0.1:6060"
  uid = 0
  gid = 0
  [grpc.tls]
    cert = "/certificates/server.crt"
    key = "/certificates/server.key"
    ca = "/certificates/server.trust.pem"
[worker.oci]
  enabled = false
[worker.containerd]
  enabled = true
  address = "/mnt/containerd/containerd.sock"
  platforms = [ "linux/amd64" ]
  namespace = "buildkit"
  gc = false

We connect using the corresponding tls cert/keys over the tcp network.

@coryb
Copy link
Collaborator Author

coryb commented Jan 16, 2020

Narrowed this down, it only leaks when using the containerd worker, the default runc worker seems fine. So with this config I was able to see the leak:

root = "/mnt/buildkit"
[grpc]
  address = [ "unix:///tmp/buildkitd.sock" ]
  uid = 0
  gid = 0
[worker.oci]
  enabled = false
[worker.containerd]
  enabled = true
  address = "/mnt/containerd/containerd.sock"
  platforms = [ "linux/amd64" ]
  namespace = "buildkit"
  gc = false

@tonistiigi
Copy link
Member

I can repro now. Let's merge so we can get this into 0.6 release and create an issue to track additional fix.

@tonistiigi
Copy link
Member

#1334

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants