runtime: big performance penalty with runtime.LockOSThread

This issue reopens #18023.

There it was observed that if a server goroutine is locked to OS thread, such locking imposes big performance penalty compared to the same server code but without handler being locked to OS thread. Relevant [golang-nuts thread](https://groups.google.com/forum/#!topic/golang-nuts/WqVEThZnQkM) discusses this and notes that for case when runtime.LockOSThread was used the number of context switches is 10x (ten times, not 1000x times) more compared to the case without OS thread locking. https://github.com/golang/go/issues/18023#issuecomment-328194383 notices the context switch can happen because e.g. `futex_wake()` in kernel can move woken process to a different CPU.

More, it was found that essentially at every CGo call lockOSThread is used internally by Go runtime:

https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107

so even if user code does not use LockOSThread, but uses CGo calls on server side, there are preconditions to presume similar kind of slowdown.

With above in mind https://github.com/golang/go/issues/18023#issuecomment-328194383 shows a dirty patch that spins a bit in `notesleep()` before going to kernel to `futex_wait()`. This way it is shown that 1) large fraction of performance penalty related to LockOSThread can go away, and 2) the case of CGo calls on server can also receive visible speedup:

```
name        old time/op  new time/op  delta
Unlocked-4   485ns ± 0%   483ns ± 1%     ~     (p=0.188 n=9+10)
Locked-4    5.22µs ± 1%  1.32µs ± 5%  -74.64%  (p=0.000 n=9+10)
CGo-4        581ns ± 1%   556ns ± 0%   -4.27%  (p=0.000 n=10+10)
CGo10-4     2.20µs ± 6%  1.23µs ± 0%  -44.32%  (p=0.000 n=10+9)
```

The patch is for sure not completely right (and probably far away from being right) as always spinning unconditionally should sometimes bring harm instead of good. But it shows that with proper scheduler tuning it is possible to avoid context switches and perform better.

I attach my original post here for completeness.

Thanks,
Kirill

/cc @rsc, @ianlancetaylor, @dvyukov, @aclements, @bcmills 

----

https://github.com/golang/go/issues/18023#issuecomment-328194383:

Let me chime in a bit. On Linux the context switch can happen, if my reading of [futex_wake()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/futex.c?id=cef5d0f9#n1502) is correct (which is probably not), because e.g. [wake_up_q()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c?id=cef5d0f9#n450) via calling `wake_up_process()` -> [try_to_wake_up()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c?id=cef5d0f9#n1947) -> [select_task_rq()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c?id=cef5d0f9#n1531) can select another cpu

```
                cpu = cpumask_any(&p->cpus_allowed);
```

for woken process.

The Go runtime calls `futex_wake()` in [notewakeup()](https://github.com/golang/go/blob/ab401077/src/runtime/lock_futex.go#L130) to wake up an M that was previously stopped via [stopm](https://github.com/golang/go/blob/ab401077/src/runtime/proc.go#L1666)() -> [notesleep()](https://github.com/golang/go/blob/ab401077/src/runtime/lock_futex.go#L139)  (the latter calls `futexwait()`).

When LockOSThread is used an M is dedicated to G so when that G blocks, e.g. on chan send, that M, if I undestand correctly, has high chances to stop. And if it stops it goes to `futexwait` and then context switch happens when someone wakes it up because e.g. something was sent to the G via channel.

With this thinking the following patch:

```diff
diff --git a/src/runtime/lock_futex.go b/src/runtime/lock_futex.go
index 9d55bd129c..418fe1b845 100644
--- a/src/runtime/lock_futex.go
+++ b/src/runtime/lock_futex.go
@@ -146,7 +157,13 @@ func notesleep(n *note) {
                // Sleep for an arbitrary-but-moderate interval to poll libc interceptors.
                ns = 10e6
        }
-       for atomic.Load(key32(&n.key)) == 0 {
+       for spin := 0; atomic.Load(key32(&n.key)) == 0; spin++ {
+               // spin a bit hoping we'll get wakup soon
+               if spin < 10000 {
+                       continue
+               }
+
+               // no luck -> go to sleep heavily to kernel
                gp.m.blocked = true
                futexsleep(key32(&n.key), 0, ns)
                if *cgo_yield != nil {
```

makes BenchmarkLocked much faster on my computer:

```
name        old time/op  new time/op  delta
Unlocked-4   485ns ± 0%   483ns ± 1%     ~     (p=0.188 n=9+10)
Locked-4    5.22µs ± 1%  1.32µs ± 5%  -74.64%  (p=0.000 n=9+10)
```

----

I also looked around and found: essentially at every CGo call lockOSThread is used:

https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107

With this in mind I modified the benchmark a bit so that no LockOSThread is explicitly used, but server performs 1 and 10 simple C calls for every request:

```
CGo-4        581ns ± 1%   556ns ± 0%   -4.27%  (p=0.000 n=10+10)
CGo10-4     2.20µs ± 6%  1.23µs ± 0%  -44.32%  (p=0.000 n=10+9)
```

which shows the change brings quite visible speedup.

This way I'm not saying my patch is right, but at least it shows that much can be improved. So I suggest to reopen the issue.

Thanks beforehand,
Kirill

/cc @dvyukov, @aclements, @bcmills 

----

full benchmark source:

(`tmp_test.go`)
```go
package tmp

import (
        "runtime"
        "testing"
)

type in struct {
        c   chan *out
        arg int
}

type out struct {
        ret int
}

func client(c chan *in, arg int) int {
        rc := make(chan *out)
        c <- &in{
                c:   rc,
                arg: arg,
        }
        ret := <-rc
        return ret.ret
}

func _server(c chan *in, argadjust func(int) int) {
        for r := range c {
                r.c <- &out{ret: argadjust(r.arg)}
        }
}

func server(c chan *in) {
        _server(c, func(arg int) int {
                return 3 + arg
        })
}

func lockedServer(c chan *in) {
        runtime.LockOSThread()
        server(c)
        runtime.UnlockOSThread()
}

// server with 1 C call per request
func cserver(c chan *in) {
        _server(c, cargadjust)
}

// server with 10 C calls per request
func cserver10(c chan *in) {
        _server(c, func(arg int) int {
                for i := 0; i < 10; i++ {
                        arg = cargadjust(arg)
                }
                return arg
        })
}

func benchmark(b *testing.B, srv func(chan *in)) {
        inc := make(chan *in)
        go srv(inc)
        for i := 0; i < b.N; i++ {
                client(inc, i)
        }
        close(inc)
}

func BenchmarkUnlocked(b *testing.B)    { benchmark(b, server) }
func BenchmarkLocked(b *testing.B)      { benchmark(b, lockedServer) }
func BenchmarkCGo(b *testing.B)         { benchmark(b, cserver) }
func BenchmarkCGo10(b *testing.B)       { benchmark(b, cserver10) }
```

(`tmp.go`)
```go
package tmp

// int argadjust(int arg) { return 3 + arg; }
import "C"

// XXX here because cannot use C in tests directly
func cargadjust(arg int) int {
        return int(C.argadjust(C.int(arg)))
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: big performance penalty with runtime.LockOSThread #21827

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

runtime: big performance penalty with runtime.LockOSThread #21827

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions