Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of context switches #2594

Closed
whyrusleeping opened this issue Apr 20, 2016 · 5 comments
Closed

Large number of context switches #2594

whyrusleeping opened this issue Apr 20, 2016 · 5 comments
Labels
topic/libp2p Topic libp2p

Comments

@whyrusleeping
Copy link
Member

Irc user A124 reported a large amount of lag caused by an excess of context switches in the ipfs daemon.

When IFPS is running, there are CPU context switches spikes, from average 3k per whole system to 16k, they are fairly regular in pattern (once oevery few seconds)
08:13   <A124>  There is also problem in the CPU scheduler, which is (finally after years) put to light and finally patch should go in. That does consume a lot of cpu when context switches happen to different cores and stuff.
08:14   <A124>  So combination of those spikes + bad scheduler (the default CF on most systems) results in spikes and jitter in realtime applications plus wastes a lot of CPU.
08:23   <A124>  Yeah, I should find the reference. The CPU scheduler is presented on this years conference. But older kernels plus bad application would still suffer.
08:24   <A124>  Their measure performance hit was 10-20% usually, 20% for databases and up to 138x for scientific corner cases.
08:24   <A124>  www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
08:24   < *>    A124 still had it open since last week.
08:29   < *>    A124 notes, that he wanted to do multithreaded client for his application, but after getting more background, decided using polling, as a lot of connections and threads, hence context switching makes whole application both slower and more demanding on system and CPU. ... Depending on type of application and scenario, other people also use Threads + Fibers combination. (Few threads, each with few fibers.) Fibers are not pre-emptive, so less scheduling ove
08:30   < *>    A124 notes According to https://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf page 7, Software execution and Context switching are fields for oportunity. The CPU side being solved by the mentined patches)

He also built some graphs of the issue: https://imgur.com/uN0yF9d&wMp6t3y

This issue is to track this problem.

@whyrusleeping
Copy link
Member Author

whyrusleeping commented Apr 20, 2016

whys general perf debugging guidelines for go-ipfs:

When you see ipfs doing something (using lots of cpu, memory, or otherwise
being weird), the first thing you want to do is gather all the relevant
profiling information.

  • goroutine dump
    • curl localhost:5001/debug/pprof/goroutine\?debug=2 > ipfs.stacks
  • 30 second cpu profile
    • curl localhost:5001/debug/pprof/profile > ipfs.cpuprof
  • heap trace dump
    • curl localhost:5001/debug/pprof/heap > ipfs.stacks
  • system information
    • ipfs diag sys > ipfs.sysinfo

Bundle all that up and include a copy of the ipfs binary that you are running
(having the exact same binary is important, it contains debug info).

You can investigate yourself if you feel intrepid:

Analysing the stack dump.

The first thing I look for is hung goroutines, any goroutine thats been stuck
for over a minute will note that in the trace, it looks something like:

goroutine 2306090 [semacquire, 458 minutes]:
sync.runtime_Semacquire(0xc8222fd3e4)
    /home/whyrusleeping/go/src/runtime/sema.go:47 +0x26
sync.(*Mutex).Lock(0xc8222fd3e0)
    /home/whyrusleeping/go/src/sync/mutex.go:83 +0x1c4
gx/ipfs/QmedFDs1WHcv3bcknfo64dw4mT1112yptW1H65Y2Wc7KTV/yamux.(*Session).Close(0xc8222fd340, 0x0, 0x0)
    /home/whyrusleeping/gopkg/src/gx/ipfs/QmedFDs1WHcv3bcknfo64dw4mT1112yptW1H65Y2Wc7KTV/yamux/session.go:205 +0x55
gx/ipfs/QmWSJzRkCMJFHYUQZxKwPX8WA7XipaPtfiwMPARP51ymfn/go-stream-muxer/yamux.(*conn).Close(0xc8222fd340, 0x0, 0x0)
    /home/whyrusleeping/gopkg/src/gx/ipfs/QmWSJzRkCMJFHYUQZxKwPX8WA7XipaPtfiwMPARP51ymfn/go-stream-muxer/yamux/yamux.go:39 +0x2d
gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream.(*Conn).Close(0xc8257a2000, 0x0, 0x0)
    /home/whyrusleeping/gopkg/src/gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream/conn.go:156 +0x1f2
created by gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream.(*Conn).GoClose
    /home/whyrusleeping/gopkg/src/gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream/conn.go:131 +0xab

At the top, you can see that this goroutine (number 2306090) has been waiting
to acquire a semaphore for 458 minutes. That seems bad. Looking at the rest of
the trace, we see the exact line its waiting on is line 47 of runtime/sema.go.
Thats not particularly helpful, so we move on. Next we see that call was made
by line 205 of yamux/session.go in the Close method of yamux.Session. This
one appears to be the issue.

Given that information, I would look for another goroutine that might be
holding the semaphore in question in the rest of the stack dump. I won't go
into that here.

There are a few different reasons that goroutines can be hung, semacquire
means we're waiting to take a lock or semaphore. select means that the
goroutine is hanging in a select statement and none of the cases are yielding
anything. chan receive and chan send are waiting for a channel to be
received from or sent on, respectively. IO wait generally means that we are
waiting on a socket to read or write data, although it can mean we are
waiting on a very slow filesystem. If you see any of those tags without a , X minutes suffix, that generally means there isn't a problem, you just caught
that goroutine in the middle of a short wait for something. If the wait time is
over a few minutes, that either means that goroutine doesnt do much, or
something is pretty wrong.

Analysing the CPU Profile

The go team wrote an excellent article on profiling go
programs
, if you've already
gathered the above information, you can skip down to where they start talking
about go tool pprof. My go-to method of analysing these is to run the 'web'
command, which generates an svg dotgraph and opens it in your browser. This is
the quickest way to easily point out where the hot spots in the code are.

Other

If you have any questions, or want me to analyse some weird go-ipfs behaviour,
just let me know, and be sure to include all the profiling information
mentioned at the top.

@RichardLitt
Copy link
Member

@whyrusleeping This would be great to have in a doc! That way you can just point to it. Want me to do that?

@whyrusleeping
Copy link
Member Author

I remember now what happened the last time we ran into this, last time it was reuseport. To test this, set IPFS_REUSEPORT to false and run the daemon. If that solves the problem, then we really should pay attention to the reuseport impl. There were some updates here: golang/go#9661

@whyrusleeping
Copy link
Member Author

@RichardLitt if you want to make that into a doc then by all means please do :)

RichardLitt added a commit that referenced this issue Apr 29, 2016
Stolen from this comment #2594 (comment)

License: MIT
Signed-off-by: Richard Littauer <richard.littauer@gmail.com>
@whyrusleeping whyrusleeping added the topic/libp2p Topic libp2p label Aug 19, 2016
@Stebalien
Copy link
Member

Closing as this issue isn't directly actionable. It's a symptom with many complicated causes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/libp2p Topic libp2p
Projects
None yet
Development

No branches or pull requests

3 participants