Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart eBPF tracking on error #2735

Merged
merged 3 commits into from
Aug 18, 2017

Conversation

alban
Copy link
Contributor

@alban alban commented Jul 20, 2017

This is a workaround for #2650

Tested manually the following way:

Start Scope with:

    sudo WEAVESCOPE_DOCKER_ARGS="-e SCOPE_DEBUG_BPF=1" ./scope launch

Simulate EbpfTracker failure with:

    echo foo | sudo tee /proc/$(pidof scope-probe)/root/var/run/scope/debug-bpf

TODO:

@@ -287,10 +310,33 @@ func (t *EbpfTracker) isDead() bool {

func (t *EbpfTracker) stop() {
t.Lock()
defer t.Unlock()

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

tracer.Start()

return nil
}

This comment was marked as abuse.

This comment was marked as abuse.

@alban alban force-pushed the alban/bpf-restart branch 2 times, most recently from 9efe6fc to 1de8096 Compare July 24, 2017 15:22
@alban
Copy link
Contributor Author

alban commented Jul 24, 2017

I updated the code. It seems to restart fine. But I had one instance of my test where Scope was talking 100% of the cpu in the gobpf code when I tested the fallback to proc parsing.

I got the stack of the go routine taking all the cpu with the following command. The stack changes all the time since it is running.

sudo WEAVESCOPE_DOCKER_ARGS="-e SCOPE_DEBUG_BPF=1" ./scope launch --probe.http.listen :4041
http://localhost:4041/debug/pprof/goroutine?debug=2
goroutine 4442 [runnable, locked to thread]:
github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf._C2func_poll(0xc42385aea0, 0x4, 0x1f4, 0x4, 0x0, 0x0)
	github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf/_obj/_cgo_gotypes.go:313 +0x68
github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf.perfEventPoll(0xc4218f0e40, 0x4, 0x4, 0xc42385ae00, 0xc423857510)
	/go/src/github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf/perf.go:258 +0xf6
github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf.(*PerfMap).PollStart.func1(0xc42026c910, 0xc4245df1c0, 0xc4245db180)
	/go/src/github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf/perf.go:171 +0x695
created by github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf.(*PerfMap).PollStart
	/go/src/github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf/perf.go:238 +0xf1

Also, the integration test needs fixing too: https://circleci.com/gh/kinvolk/scope/990
integration/313_container_to_container_edge_with_ebpf_proc_fallback_test.sh

@alban alban force-pushed the alban/bpf-restart branch 2 times, most recently from 64b8ca0 to 58b3361 Compare July 25, 2017 13:00
@alban
Copy link
Contributor Author

alban commented Jul 25, 2017

I fixed the regression in integration test 313 and added a new integration test 315 that restarts EbpfTracker.

@alban alban changed the title [WIP] restart eBPF tracking on error restart eBPF tracking on error Jul 25, 2017
@rade rade requested a review from 2opremio July 25, 2017 18:57
closedDuringInit: map[fourTuple]struct{}{},
var debugBPF bool
if os.Getenv("SCOPE_DEBUG_BPF") != "" {
log.Warnf("ebpf tracker started in debug mode")

This comment was marked as abuse.

This comment was marked as abuse.

@@ -220,6 +221,8 @@ func TestInvalidTimeStampDead(t *testing.T) {
if cnt != 2 {
t.Errorf("walkConnections found %v instead of 2 connections", cnt)
}
// EbpfTracker is marked as dead asynchronously.
time.Sleep(100 * time.Millisecond)

This comment was marked as abuse.

This comment was marked as abuse.

}
t.ebpfFailureTime = time.Now()

if t.ebpfFailureCount == 0 {

This comment was marked as abuse.

This comment was marked as abuse.

@@ -29,6 +30,9 @@ type connectionTracker struct {
flowWalker flowWalker // Interface
ebpfTracker *EbpfTracker
reverseResolver *reverseResolver

ebpfFailureCount int
ebpfFailureTime time.Time

This comment was marked as abuse.

This comment was marked as abuse.

@@ -29,6 +30,9 @@ type connectionTracker struct {
flowWalker flowWalker // Interface
ebpfTracker *EbpfTracker
reverseResolver *reverseResolver

ebpfFailureCount int

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

t.tracer.Stop()

// Do not call tracer.Stop() in this thread, otherwise tracer.Stop() will
// deadlock waiting for this thread to pick up the next event.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

t.ebpfFailureCount++
err := t.ebpfTracker.restart()
if err == nil {
go t.getInitialState()

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

@alban alban force-pushed the alban/bpf-restart branch from 8050079 to 5575556 Compare August 8, 2017 12:30
@alban
Copy link
Contributor Author

alban commented Aug 8, 2017

Branch updated.

@alban
Copy link
Contributor Author

alban commented Aug 8, 2017

Tests:

  • CircleCI passes
  • tested manually on my laptop (simulating EbpfTracker failure)
  • tested on the GCE instance with the test script to reproduce it the bug

It seemed to work fine.

But then, I had a look at the go routine stacks with pprof after Scope fell back on proc parsing:

goroutine 26 [chan send, 4 minutes]:
github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf.(*PerfMap).PollStart.func1(0xc4202c0370, 0xc422dbbf00, 0xc423f508a0)
	/go/src/github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf/perf.go:229 +0x4f7
created by github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf.(*PerfMap).PollStart
	/go/src/github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf/perf.go:238 +0xf1

@alban
Copy link
Contributor Author

alban commented Aug 8, 2017

Fix in gobpf, tested on GCE: https://gist.github.com/alban/cb4b899e9558e231d080ec7e7b3abbc5
Explanation: https://github.com/iovisor/gobpf/blob/d0a3e1b/elf/perf.go#L241-L242

It seems to resolve the leaking go routine. I'll do the PR tomorrow. weaveworks/tcptracer-bpf#49

alban added a commit to kinvolk-archives/tcptracer-bpf that referenced this pull request Aug 8, 2017
PerfMap's PollStop() documentation says the channel should be closed
after calling PollStop():
https://github.com/iovisor/gobpf/blob/d0a3e1b/elf/perf.go#L241-L242

This is solving a leaking go routine issue in Scope:
weaveworks/scope#2735 (comment)
alban added a commit to kinvolk-archives/tcptracer-bpf that referenced this pull request Aug 9, 2017
PerfMap's PollStop() documentation says the channel should be closed
after calling PollStop():
https://github.com/iovisor/gobpf/blob/d0a3e1b/elf/perf.go#L241-L242

This is solving a leaking go routine issue in Scope:
weaveworks/scope#2735 (comment)
@alban alban changed the title restart eBPF tracking on error [WIP] restart eBPF tracking on error Aug 9, 2017
@alban
Copy link
Contributor Author

alban commented Aug 9, 2017

I updated the vendoring with weaveworks/tcptracer-bpf#49 and marked this PR as WIP because the dependency is not merged yet.

alban added 2 commits August 17, 2017 16:39
EbpfTracker can die when the tcp events are received out of order. This
can happen with a buggy kernel or apparently in other cases, see:
weaveworks#2650

As a workaround, restart EbpfTracker when an event is received out of
order. This does not seem to happen often, but as a precaution,
EbpfTracker will not restart if the last failure is less than 5 minutes
ago.

This is not easy to test but I added instrumentation to trigger a
restart:

- Start Scope with:
    $ sudo WEAVESCOPE_DOCKER_ARGS="-e SCOPE_DEBUG_BPF=1" ./scope launch

- Request a stop with:
    $ echo stop | sudo tee /proc/$(pidof scope-probe)/root/var/run/scope/debug-bpf
@alban alban force-pushed the alban/bpf-restart branch from 82d597c to 760fd2c Compare August 17, 2017 15:19
@alban
Copy link
Contributor Author

alban commented Aug 17, 2017

I updated the PR (pending on weaveworks/tcptracer-bpf#50) and it works for me now. I don't see leaking go routines nor panics anymore. I tested several times both with waiting 5 minutes between the ebpf tracker crashes and without waiting.

This includes:

- iovisor/gobpf#70
  perf: close go channels idiomatically

- iovisor/gobpf#70
  close channels on the sender side & fix closing race

- weaveworks/tcptracer-bpf#50
  vendor: update gobpf
@alban alban force-pushed the alban/bpf-restart branch from 760fd2c to 93ca8b8 Compare August 17, 2017 15:57
@alban alban changed the title [WIP] restart eBPF tracking on error restart eBPF tracking on error Aug 17, 2017
@alban
Copy link
Contributor Author

alban commented Aug 17, 2017

@2opremio new review welcome :)

@rade rade merged commit 8fe3538 into weaveworks:master Aug 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants