Skip to content

runtime: big performance penalty with runtime.LockOSThread #21827

Open
@navytux

Description

@navytux

This issue reopens #18023.

There it was observed that if a server goroutine is locked to OS thread, such locking imposes big performance penalty compared to the same server code but without handler being locked to OS thread. Relevant golang-nuts thread discusses this and notes that for case when runtime.LockOSThread was used the number of context switches is 10x (ten times, not 1000x times) more compared to the case without OS thread locking. #18023 (comment) notices the context switch can happen because e.g. futex_wake() in kernel can move woken process to a different CPU.

More, it was found that essentially at every CGo call lockOSThread is used internally by Go runtime:

https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107

so even if user code does not use LockOSThread, but uses CGo calls on server side, there are preconditions to presume similar kind of slowdown.

With above in mind #18023 (comment) shows a dirty patch that spins a bit in notesleep() before going to kernel to futex_wait(). This way it is shown that 1) large fraction of performance penalty related to LockOSThread can go away, and 2) the case of CGo calls on server can also receive visible speedup:

name        old time/op  new time/op  delta
Unlocked-4   485ns ± 0%   483ns ± 1%     ~     (p=0.188 n=9+10)
Locked-4    5.22µs ± 1%  1.32µs ± 5%  -74.64%  (p=0.000 n=9+10)
CGo-4        581ns ± 1%   556ns ± 0%   -4.27%  (p=0.000 n=10+10)
CGo10-4     2.20µs ± 6%  1.23µs ± 0%  -44.32%  (p=0.000 n=10+9)

The patch is for sure not completely right (and probably far away from being right) as always spinning unconditionally should sometimes bring harm instead of good. But it shows that with proper scheduler tuning it is possible to avoid context switches and perform better.

I attach my original post here for completeness.

Thanks,
Kirill

/cc @rsc, @ianlancetaylor, @dvyukov, @aclements, @bcmills


#18023 (comment):

Let me chime in a bit. On Linux the context switch can happen, if my reading of futex_wake() is correct (which is probably not), because e.g. wake_up_q() via calling wake_up_process() -> try_to_wake_up() -> select_task_rq() can select another cpu

                cpu = cpumask_any(&p->cpus_allowed);

for woken process.

The Go runtime calls futex_wake() in notewakeup() to wake up an M that was previously stopped via stopm() -> notesleep() (the latter calls futexwait()).

When LockOSThread is used an M is dedicated to G so when that G blocks, e.g. on chan send, that M, if I undestand correctly, has high chances to stop. And if it stops it goes to futexwait and then context switch happens when someone wakes it up because e.g. something was sent to the G via channel.

With this thinking the following patch:

diff --git a/src/runtime/lock_futex.go b/src/runtime/lock_futex.go
index 9d55bd129c..418fe1b845 100644
--- a/src/runtime/lock_futex.go
+++ b/src/runtime/lock_futex.go
@@ -146,7 +157,13 @@ func notesleep(n *note) {
                // Sleep for an arbitrary-but-moderate interval to poll libc interceptors.
                ns = 10e6
        }
-       for atomic.Load(key32(&n.key)) == 0 {
+       for spin := 0; atomic.Load(key32(&n.key)) == 0; spin++ {
+               // spin a bit hoping we'll get wakup soon
+               if spin < 10000 {
+                       continue
+               }
+
+               // no luck -> go to sleep heavily to kernel
                gp.m.blocked = true
                futexsleep(key32(&n.key), 0, ns)
                if *cgo_yield != nil {

makes BenchmarkLocked much faster on my computer:

name        old time/op  new time/op  delta
Unlocked-4   485ns ± 0%   483ns ± 1%     ~     (p=0.188 n=9+10)
Locked-4    5.22µs ± 1%  1.32µs ± 5%  -74.64%  (p=0.000 n=9+10)

I also looked around and found: essentially at every CGo call lockOSThread is used:

https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107

With this in mind I modified the benchmark a bit so that no LockOSThread is explicitly used, but server performs 1 and 10 simple C calls for every request:

CGo-4        581ns ± 1%   556ns ± 0%   -4.27%  (p=0.000 n=10+10)
CGo10-4     2.20µs ± 6%  1.23µs ± 0%  -44.32%  (p=0.000 n=10+9)

which shows the change brings quite visible speedup.

This way I'm not saying my patch is right, but at least it shows that much can be improved. So I suggest to reopen the issue.

Thanks beforehand,
Kirill

/cc @dvyukov, @aclements, @bcmills


full benchmark source:

(tmp_test.go)

package tmp

import (
        "runtime"
        "testing"
)

type in struct {
        c   chan *out
        arg int
}

type out struct {
        ret int
}

func client(c chan *in, arg int) int {
        rc := make(chan *out)
        c <- &in{
                c:   rc,
                arg: arg,
        }
        ret := <-rc
        return ret.ret
}

func _server(c chan *in, argadjust func(int) int) {
        for r := range c {
                r.c <- &out{ret: argadjust(r.arg)}
        }
}

func server(c chan *in) {
        _server(c, func(arg int) int {
                return 3 + arg
        })
}

func lockedServer(c chan *in) {
        runtime.LockOSThread()
        server(c)
        runtime.UnlockOSThread()
}

// server with 1 C call per request
func cserver(c chan *in) {
        _server(c, cargadjust)
}

// server with 10 C calls per request
func cserver10(c chan *in) {
        _server(c, func(arg int) int {
                for i := 0; i < 10; i++ {
                        arg = cargadjust(arg)
                }
                return arg
        })
}

func benchmark(b *testing.B, srv func(chan *in)) {
        inc := make(chan *in)
        go srv(inc)
        for i := 0; i < b.N; i++ {
                client(inc, i)
        }
        close(inc)
}

func BenchmarkUnlocked(b *testing.B)    { benchmark(b, server) }
func BenchmarkLocked(b *testing.B)      { benchmark(b, lockedServer) }
func BenchmarkCGo(b *testing.B)         { benchmark(b, cserver) }
func BenchmarkCGo10(b *testing.B)       { benchmark(b, cserver10) }

(tmp.go)

package tmp

// int argadjust(int arg) { return 3 + arg; }
import "C"

// XXX here because cannot use C in tests directly
func cargadjust(arg int) int {
        return int(C.argadjust(C.int(arg)))
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions