restart eBPF tracking on error #2735

alban · 2017-07-20T16:17:45Z

This is a workaround for #2650

Tested manually the following way:

Start Scope with:

    sudo WEAVESCOPE_DOCKER_ARGS="-e SCOPE_DEBUG_BPF=1" ./scope launch

Simulate EbpfTracker failure with:

    echo foo | sudo tee /proc/$(pidof scope-probe)/root/var/run/scope/debug-bpf

TODO:

implement guard against "flapping"
tests
~~pending on tracer: close channels on Stop() tcptracer-bpf#49~~
pending on vendor: update gobpf tcptracer-bpf#50

probe/endpoint/ebpf.go

@@ -287,10 +310,33 @@ func (t *EbpfTracker) isDead() bool {

 func (t *EbpfTracker) stop() {
 	t.Lock()
+	defer t.Unlock()
+


probe/endpoint/ebpf.go

+	tracer.Start()
+
+	return nil
+}


alban · 2017-07-24T16:08:19Z

I updated the code. It seems to restart fine. But I had one instance of my test where Scope was talking 100% of the cpu in the gobpf code when I tested the fallback to proc parsing.

I got the stack of the go routine taking all the cpu with the following command. The stack changes all the time since it is running.

sudo WEAVESCOPE_DOCKER_ARGS="-e SCOPE_DEBUG_BPF=1" ./scope launch --probe.http.listen :4041
http://localhost:4041/debug/pprof/goroutine?debug=2

goroutine 4442 [runnable, locked to thread]:
github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf._C2func_poll(0xc42385aea0, 0x4, 0x1f4, 0x4, 0x0, 0x0)
	github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf/_obj/_cgo_gotypes.go:313 +0x68
github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf.perfEventPoll(0xc4218f0e40, 0x4, 0x4, 0xc42385ae00, 0xc423857510)
	/go/src/github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf/perf.go:258 +0xf6
github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf.(*PerfMap).PollStart.func1(0xc42026c910, 0xc4245df1c0, 0xc4245db180)
	/go/src/github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf/perf.go:171 +0x695
created by github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf.(*PerfMap).PollStart
	/go/src/github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf/perf.go:238 +0xf1

Also, the integration test needs fixing too: https://circleci.com/gh/kinvolk/scope/990
integration/313_container_to_container_edge_with_ebpf_proc_fallback_test.sh

alban · 2017-07-25T16:35:16Z

I fixed the regression in integration test 313 and added a new integration test 315 that restarts EbpfTracker.

probe/endpoint/ebpf.go

-		closedDuringInit: map[fourTuple]struct{}{},
+	var debugBPF bool
+	if os.Getenv("SCOPE_DEBUG_BPF") != "" {
+		log.Warnf("ebpf tracker started in debug mode")


probe/endpoint/ebpf_test.go

@@ -220,6 +221,8 @@ func TestInvalidTimeStampDead(t *testing.T) {
 	if cnt != 2 {
 		t.Errorf("walkConnections found %v instead of 2 connections", cnt)
 	}
+	// EbpfTracker is marked as dead asynchronously.
+	time.Sleep(100 * time.Millisecond)


probe/endpoint/connection_tracker.go

+		}
+		t.ebpfFailureTime = time.Now()
+
+		if t.ebpfFailureCount == 0 {


probe/endpoint/connection_tracker.go

@@ -29,6 +30,9 @@ type connectionTracker struct {
 	flowWalker      flowWalker // Interface
 	ebpfTracker     *EbpfTracker
 	reverseResolver *reverseResolver
+
+	ebpfFailureCount int
+	ebpfFailureTime  time.Time


probe/endpoint/connection_tracker.go

@@ -29,6 +30,9 @@ type connectionTracker struct {
 	flowWalker      flowWalker // Interface
 	ebpfTracker     *EbpfTracker
 	reverseResolver *reverseResolver
+
+	ebpfFailureCount int


probe/endpoint/ebpf.go

-		t.tracer.Stop()
+
+	// Do not call tracer.Stop() in this thread, otherwise tracer.Stop() will
+	// deadlock waiting for this thread to pick up the next event.


probe/endpoint/connection_tracker.go

+			t.ebpfFailureCount++
+			err := t.ebpfTracker.restart()
+			if err == nil {
+				go t.getInitialState()


alban · 2017-08-08T12:32:00Z

Branch updated.

alban · 2017-08-08T15:19:09Z

Tests:

CircleCI passes
tested manually on my laptop (simulating EbpfTracker failure)
tested on the GCE instance with the test script to reproduce it the bug

It seemed to work fine.

But then, I had a look at the go routine stacks with pprof after Scope fell back on proc parsing:

goroutine 26 [chan send, 4 minutes]:
github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf.(*PerfMap).PollStart.func1(0xc4202c0370, 0xc422dbbf00, 0xc423f508a0)
	/go/src/github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf/perf.go:229 +0x4f7
created by github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf.(*PerfMap).PollStart
	/go/src/github.com/weaveworks/scope/vendor/github.com/weaveworks/tcptracer-bpf/vendor/github.com/iovisor/gobpf/elf/perf.go:238 +0xf1

the go routine in gobpf should not be there anymore but it is. It does not take CPU, but that's still a leak. Probably related to @2opremio 's comment restart eBPF tracking on error #2735 (comment)
the go import path shows a double vendoring. Is it expected?

alban · 2017-08-08T15:49:32Z

Fix in gobpf, tested on GCE: https://gist.github.com/alban/cb4b899e9558e231d080ec7e7b3abbc5
Explanation: https://github.com/iovisor/gobpf/blob/d0a3e1b/elf/perf.go#L241-L242

It seems to resolve the leaking go routine. ~~I'll do the PR tomorrow.~~ weaveworks/tcptracer-bpf#49

PerfMap's PollStop() documentation says the channel should be closed after calling PollStop(): https://github.com/iovisor/gobpf/blob/d0a3e1b/elf/perf.go#L241-L242 This is solving a leaking go routine issue in Scope: weaveworks/scope#2735 (comment)

alban · 2017-08-09T13:31:13Z

I updated the vendoring with weaveworks/tcptracer-bpf#49 and marked this PR as WIP because the dependency is not merged yet.

EbpfTracker can die when the tcp events are received out of order. This can happen with a buggy kernel or apparently in other cases, see: weaveworks#2650 As a workaround, restart EbpfTracker when an event is received out of order. This does not seem to happen often, but as a precaution, EbpfTracker will not restart if the last failure is less than 5 minutes ago. This is not easy to test but I added instrumentation to trigger a restart: - Start Scope with: $ sudo WEAVESCOPE_DOCKER_ARGS="-e SCOPE_DEBUG_BPF=1" ./scope launch - Request a stop with: $ echo stop | sudo tee /proc/$(pidof scope-probe)/root/var/run/scope/debug-bpf

alban · 2017-08-17T15:34:55Z

I updated the PR (pending on weaveworks/tcptracer-bpf#50) and it works for me now. I don't see leaking go routines nor panics anymore. I tested several times both with waiting 5 minutes between the ebpf tracker crashes and without waiting.

This includes: - iovisor/gobpf#70 perf: close go channels idiomatically - iovisor/gobpf#70 close channels on the sender side & fix closing race - weaveworks/tcptracer-bpf#50 vendor: update gobpf

alban · 2017-08-17T15:59:48Z

@2opremio new review welcome :)

rade reviewed Jul 20, 2017

View reviewed changes

alban force-pushed the alban/bpf-restart branch 2 times, most recently from 9efe6fc to 1de8096 Compare July 24, 2017 15:22

alban force-pushed the alban/bpf-restart branch 2 times, most recently from 64b8ca0 to 58b3361 Compare July 25, 2017 13:00

alban changed the title ~~[WIP] restart eBPF tracking on error~~ restart eBPF tracking on error Jul 25, 2017

rade requested a review from 2opremio July 25, 2017 18:57

2opremio reviewed Jul 31, 2017

View reviewed changes

probe/endpoint/ebpf.go Outdated

closedDuringInit: map[fourTuple]struct{}{},

var debugBPF bool

if os.Getenv("SCOPE_DEBUG_BPF") != "" {

log.Warnf("ebpf tracker started in debug mode")

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

2opremio reviewed Jul 31, 2017

View reviewed changes

probe/endpoint/connection_tracker.go Outdated

}

t.ebpfFailureTime = time.Now()

if t.ebpfFailureCount == 0 {

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

2opremio reviewed Jul 31, 2017

View reviewed changes

probe/endpoint/connection_tracker.go Outdated

t.ebpfFailureCount++

err := t.ebpfTracker.restart()

if err == nil {

go t.getInitialState()

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

alban force-pushed the alban/bpf-restart branch from 8050079 to 5575556 Compare August 8, 2017 12:30

alban mentioned this pull request Aug 8, 2017

tracer: close channels on Stop() weaveworks/tcptracer-bpf#49

Closed

alban changed the title ~~restart eBPF tracking on error~~ [WIP] restart eBPF tracking on error Aug 9, 2017

alban mentioned this pull request Aug 17, 2017

vendor: update gobpf weaveworks/tcptracer-bpf#50

Merged

alban added 2 commits August 17, 2017 16:39

integration test: restart EbpfTracker

af14cf7

alban force-pushed the alban/bpf-restart branch from 82d597c to 760fd2c Compare August 17, 2017 15:19

vendor: update tcptracer-bpf and gobpf

93ca8b8

This includes: - iovisor/gobpf#70 perf: close go channels idiomatically - iovisor/gobpf#70 close channels on the sender side & fix closing race - weaveworks/tcptracer-bpf#50 vendor: update gobpf

alban force-pushed the alban/bpf-restart branch from 760fd2c to 93ca8b8 Compare August 17, 2017 15:57

alban changed the title ~~[WIP] restart eBPF tracking on error~~ restart eBPF tracking on error Aug 17, 2017

2opremio approved these changes Aug 18, 2017

View reviewed changes

rade merged commit 8fe3538 into weaveworks:master Aug 18, 2017

rade mentioned this pull request Aug 18, 2017

eBPF tracker bounces due to late events #2827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restart eBPF tracking on error #2735

restart eBPF tracking on error #2735

alban commented Jul 20, 2017 •

edited

Loading

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

alban commented Jul 24, 2017

alban commented Jul 25, 2017

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

alban commented Aug 8, 2017

alban commented Aug 8, 2017

alban commented Aug 8, 2017 •

edited

Loading

alban commented Aug 9, 2017

alban commented Aug 17, 2017

alban commented Aug 17, 2017

restart eBPF tracking on error #2735

restart eBPF tracking on error #2735

Conversation

alban commented Jul 20, 2017 • edited Loading

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

alban commented Jul 24, 2017

alban commented Jul 25, 2017

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

alban commented Aug 8, 2017

alban commented Aug 8, 2017

alban commented Aug 8, 2017 • edited Loading

alban commented Aug 9, 2017

alban commented Aug 17, 2017

alban commented Aug 17, 2017

alban commented Jul 20, 2017 •

edited

Loading

alban commented Aug 8, 2017 •

edited

Loading