Description
This issue reopens #18023.
There it was observed that if a server goroutine is locked to OS thread, such locking imposes big performance penalty compared to the same server code but without handler being locked to OS thread. Relevant golang-nuts thread discusses this and notes that for case when runtime.LockOSThread was used the number of context switches is 10x (ten times, not 1000x times) more compared to the case without OS thread locking. #18023 (comment) notices the context switch can happen because e.g. futex_wake()
in kernel can move woken process to a different CPU.
More, it was found that essentially at every CGo call lockOSThread is used internally by Go runtime:
https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107
so even if user code does not use LockOSThread, but uses CGo calls on server side, there are preconditions to presume similar kind of slowdown.
With above in mind #18023 (comment) shows a dirty patch that spins a bit in notesleep()
before going to kernel to futex_wait()
. This way it is shown that 1) large fraction of performance penalty related to LockOSThread can go away, and 2) the case of CGo calls on server can also receive visible speedup:
name old time/op new time/op delta
Unlocked-4 485ns ± 0% 483ns ± 1% ~ (p=0.188 n=9+10)
Locked-4 5.22µs ± 1% 1.32µs ± 5% -74.64% (p=0.000 n=9+10)
CGo-4 581ns ± 1% 556ns ± 0% -4.27% (p=0.000 n=10+10)
CGo10-4 2.20µs ± 6% 1.23µs ± 0% -44.32% (p=0.000 n=10+9)
The patch is for sure not completely right (and probably far away from being right) as always spinning unconditionally should sometimes bring harm instead of good. But it shows that with proper scheduler tuning it is possible to avoid context switches and perform better.
I attach my original post here for completeness.
Thanks,
Kirill
/cc @rsc, @ianlancetaylor, @dvyukov, @aclements, @bcmills
Let me chime in a bit. On Linux the context switch can happen, if my reading of futex_wake() is correct (which is probably not), because e.g. wake_up_q() via calling wake_up_process()
-> try_to_wake_up() -> select_task_rq() can select another cpu
cpu = cpumask_any(&p->cpus_allowed);
for woken process.
The Go runtime calls futex_wake()
in notewakeup() to wake up an M that was previously stopped via stopm() -> notesleep() (the latter calls futexwait()
).
When LockOSThread is used an M is dedicated to G so when that G blocks, e.g. on chan send, that M, if I undestand correctly, has high chances to stop. And if it stops it goes to futexwait
and then context switch happens when someone wakes it up because e.g. something was sent to the G via channel.
With this thinking the following patch:
diff --git a/src/runtime/lock_futex.go b/src/runtime/lock_futex.go
index 9d55bd129c..418fe1b845 100644
--- a/src/runtime/lock_futex.go
+++ b/src/runtime/lock_futex.go
@@ -146,7 +157,13 @@ func notesleep(n *note) {
// Sleep for an arbitrary-but-moderate interval to poll libc interceptors.
ns = 10e6
}
- for atomic.Load(key32(&n.key)) == 0 {
+ for spin := 0; atomic.Load(key32(&n.key)) == 0; spin++ {
+ // spin a bit hoping we'll get wakup soon
+ if spin < 10000 {
+ continue
+ }
+
+ // no luck -> go to sleep heavily to kernel
gp.m.blocked = true
futexsleep(key32(&n.key), 0, ns)
if *cgo_yield != nil {
makes BenchmarkLocked much faster on my computer:
name old time/op new time/op delta
Unlocked-4 485ns ± 0% 483ns ± 1% ~ (p=0.188 n=9+10)
Locked-4 5.22µs ± 1% 1.32µs ± 5% -74.64% (p=0.000 n=9+10)
I also looked around and found: essentially at every CGo call lockOSThread is used:
https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107
With this in mind I modified the benchmark a bit so that no LockOSThread is explicitly used, but server performs 1 and 10 simple C calls for every request:
CGo-4 581ns ± 1% 556ns ± 0% -4.27% (p=0.000 n=10+10)
CGo10-4 2.20µs ± 6% 1.23µs ± 0% -44.32% (p=0.000 n=10+9)
which shows the change brings quite visible speedup.
This way I'm not saying my patch is right, but at least it shows that much can be improved. So I suggest to reopen the issue.
Thanks beforehand,
Kirill
/cc @dvyukov, @aclements, @bcmills
full benchmark source:
(tmp_test.go
)
package tmp
import (
"runtime"
"testing"
)
type in struct {
c chan *out
arg int
}
type out struct {
ret int
}
func client(c chan *in, arg int) int {
rc := make(chan *out)
c <- &in{
c: rc,
arg: arg,
}
ret := <-rc
return ret.ret
}
func _server(c chan *in, argadjust func(int) int) {
for r := range c {
r.c <- &out{ret: argadjust(r.arg)}
}
}
func server(c chan *in) {
_server(c, func(arg int) int {
return 3 + arg
})
}
func lockedServer(c chan *in) {
runtime.LockOSThread()
server(c)
runtime.UnlockOSThread()
}
// server with 1 C call per request
func cserver(c chan *in) {
_server(c, cargadjust)
}
// server with 10 C calls per request
func cserver10(c chan *in) {
_server(c, func(arg int) int {
for i := 0; i < 10; i++ {
arg = cargadjust(arg)
}
return arg
})
}
func benchmark(b *testing.B, srv func(chan *in)) {
inc := make(chan *in)
go srv(inc)
for i := 0; i < b.N; i++ {
client(inc, i)
}
close(inc)
}
func BenchmarkUnlocked(b *testing.B) { benchmark(b, server) }
func BenchmarkLocked(b *testing.B) { benchmark(b, lockedServer) }
func BenchmarkCGo(b *testing.B) { benchmark(b, cserver) }
func BenchmarkCGo10(b *testing.B) { benchmark(b, cserver10) }
(tmp.go
)
package tmp
// int argadjust(int arg) { return 3 + arg; }
import "C"
// XXX here because cannot use C in tests directly
func cargadjust(arg int) int {
return int(C.argadjust(C.int(arg)))
}