-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restart eBPF tracking on error #2735
Conversation
probe/endpoint/ebpf.go
Outdated
@@ -287,10 +310,33 @@ func (t *EbpfTracker) isDead() bool { | |||
|
|||
func (t *EbpfTracker) stop() { | |||
t.Lock() | |||
defer t.Unlock() | |||
|
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
probe/endpoint/ebpf.go
Outdated
tracer.Start() | ||
|
||
return nil | ||
} |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
9efe6fc
to
1de8096
Compare
I updated the code. It seems to restart fine. But I had one instance of my test where Scope was talking 100% of the cpu in the gobpf code when I tested the fallback to proc parsing. I got the stack of the go routine taking all the cpu with the following command. The stack changes all the time since it is running.
Also, the integration test needs fixing too: https://circleci.com/gh/kinvolk/scope/990 |
64b8ca0
to
58b3361
Compare
I fixed the regression in integration test 313 and added a new integration test 315 that restarts EbpfTracker. |
probe/endpoint/ebpf.go
Outdated
closedDuringInit: map[fourTuple]struct{}{}, | ||
var debugBPF bool | ||
if os.Getenv("SCOPE_DEBUG_BPF") != "" { | ||
log.Warnf("ebpf tracker started in debug mode") |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
probe/endpoint/ebpf_test.go
Outdated
@@ -220,6 +221,8 @@ func TestInvalidTimeStampDead(t *testing.T) { | |||
if cnt != 2 { | |||
t.Errorf("walkConnections found %v instead of 2 connections", cnt) | |||
} | |||
// EbpfTracker is marked as dead asynchronously. | |||
time.Sleep(100 * time.Millisecond) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
probe/endpoint/connection_tracker.go
Outdated
} | ||
t.ebpfFailureTime = time.Now() | ||
|
||
if t.ebpfFailureCount == 0 { |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
probe/endpoint/connection_tracker.go
Outdated
@@ -29,6 +30,9 @@ type connectionTracker struct { | |||
flowWalker flowWalker // Interface | |||
ebpfTracker *EbpfTracker | |||
reverseResolver *reverseResolver | |||
|
|||
ebpfFailureCount int | |||
ebpfFailureTime time.Time |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
probe/endpoint/connection_tracker.go
Outdated
@@ -29,6 +30,9 @@ type connectionTracker struct { | |||
flowWalker flowWalker // Interface | |||
ebpfTracker *EbpfTracker | |||
reverseResolver *reverseResolver | |||
|
|||
ebpfFailureCount int |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
t.tracer.Stop() | ||
|
||
// Do not call tracer.Stop() in this thread, otherwise tracer.Stop() will | ||
// deadlock waiting for this thread to pick up the next event. |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
probe/endpoint/connection_tracker.go
Outdated
t.ebpfFailureCount++ | ||
err := t.ebpfTracker.restart() | ||
if err == nil { | ||
go t.getInitialState() |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
8050079
to
5575556
Compare
Branch updated. |
Tests:
It seemed to work fine. But then, I had a look at the go routine stacks with pprof after Scope fell back on proc parsing:
|
Fix in gobpf, tested on GCE: https://gist.github.com/alban/cb4b899e9558e231d080ec7e7b3abbc5 It seems to resolve the leaking go routine. |
PerfMap's PollStop() documentation says the channel should be closed after calling PollStop(): https://github.com/iovisor/gobpf/blob/d0a3e1b/elf/perf.go#L241-L242 This is solving a leaking go routine issue in Scope: weaveworks/scope#2735 (comment)
PerfMap's PollStop() documentation says the channel should be closed after calling PollStop(): https://github.com/iovisor/gobpf/blob/d0a3e1b/elf/perf.go#L241-L242 This is solving a leaking go routine issue in Scope: weaveworks/scope#2735 (comment)
I updated the vendoring with weaveworks/tcptracer-bpf#49 and marked this PR as WIP because the dependency is not merged yet. |
EbpfTracker can die when the tcp events are received out of order. This can happen with a buggy kernel or apparently in other cases, see: weaveworks#2650 As a workaround, restart EbpfTracker when an event is received out of order. This does not seem to happen often, but as a precaution, EbpfTracker will not restart if the last failure is less than 5 minutes ago. This is not easy to test but I added instrumentation to trigger a restart: - Start Scope with: $ sudo WEAVESCOPE_DOCKER_ARGS="-e SCOPE_DEBUG_BPF=1" ./scope launch - Request a stop with: $ echo stop | sudo tee /proc/$(pidof scope-probe)/root/var/run/scope/debug-bpf
82d597c
to
760fd2c
Compare
I updated the PR (pending on weaveworks/tcptracer-bpf#50) and it works for me now. I don't see leaking go routines nor panics anymore. I tested several times both with waiting 5 minutes between the ebpf tracker crashes and without waiting. |
This includes: - iovisor/gobpf#70 perf: close go channels idiomatically - iovisor/gobpf#70 close channels on the sender side & fix closing race - weaveworks/tcptracer-bpf#50 vendor: update gobpf
760fd2c
to
93ca8b8
Compare
@2opremio new review welcome :) |
This is a workaround for #2650
Tested manually the following way:
Start Scope with:
Simulate EbpfTracker failure with:
TODO:
pending on tracer: close channels on Stop() tcptracer-bpf#49