Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: unexpectedly large slowdown with runtime.LockOSThread #18023

Closed
ianlancetaylor opened this issue Nov 22, 2016 · 5 comments
Closed

runtime: unexpectedly large slowdown with runtime.LockOSThread #18023

ianlancetaylor opened this issue Nov 22, 2016 · 5 comments
Milestone

Comments

@ianlancetaylor
Copy link
Member

The benchmark https://play.golang.org/p/nI4LO1wu17 shows a significant slowdown when using runtime.LockOSThread.

On amd64 at current tip I see this output from go test -test.cpu=1,2,4,8 -test.bench=.:

BenchmarkUnlocked     	 3000000	       500 ns/op
BenchmarkUnlocked-2   	 3000000	       638 ns/op
BenchmarkUnlocked-4   	 2000000	       749 ns/op
BenchmarkUnlocked-8   	 2000000	       911 ns/op
BenchmarkLocked       	  200000	      7887 ns/op
BenchmarkLocked-2     	  200000	      8048 ns/op
BenchmarkLocked-4     	  200000	      8235 ns/op
BenchmarkLocked-8     	  200000	      7882 ns/op

The problem may be entirely due to the extra thread context switches required by runtime.LockOSThread. When the thread is not locked, the channel communication may be done in practice by a simple goroutine switch. When using runtime.LockOSThread a full thread context switch is required.

Still, we should make sure we have the tools to confirm that.

@ianlancetaylor ianlancetaylor added this to the Go1.9 milestone Nov 22, 2016
@rsc
Copy link
Contributor

rsc commented Nov 22, 2016

The golang-nuts thread about this says that using 'perf stat' shows approximately 1000X more thread context switches with LockOSThread than without. Is that enough confirmation?

@ianlancetaylor
Copy link
Member Author

Somehow I don't see that, but, yes, there is probably nothing that can be done here. Sorry for the noise.

@navytux
Copy link
Contributor

navytux commented Sep 8, 2017

Let me chime in a bit. On Linux the context switch can happen, if my reading of futex_wake() is correct (which is probably not), because e.g. wake_up_q() via calling wake_up_process() -> try_to_wake_up() -> select_task_rq() can select another cpu

                cpu = cpumask_any(&p->cpus_allowed);

for woken process.

The Go runtime calls futex_wake() in notewakeup() to wake up an M that was previously stopped via stopm() -> notesleep() (the latter calls futexwait()).

When LockOSThread is used an M is dedicated to G so when that G blocks, e.g. on chan send, that M, if I undestand correctly, has high chances to stop. And if it stops it goes to futexwait and then context switch happens when someone wakes it up because e.g. something was sent to the G via channel.

With this thinking the following patch:

diff --git a/src/runtime/lock_futex.go b/src/runtime/lock_futex.go
index 9d55bd129c..418fe1b845 100644
--- a/src/runtime/lock_futex.go
+++ b/src/runtime/lock_futex.go
@@ -146,7 +157,13 @@ func notesleep(n *note) {
                // Sleep for an arbitrary-but-moderate interval to poll libc interceptors.
                ns = 10e6
        }
-       for atomic.Load(key32(&n.key)) == 0 {
+       for spin := 0; atomic.Load(key32(&n.key)) == 0; spin++ {
+               // spin a bit hoping we'll get wakup soon
+               if spin < 10000 {
+                       continue
+               }
+
+               // no luck -> go to sleep heavily to kernel
                gp.m.blocked = true
                futexsleep(key32(&n.key), 0, ns)
                if *cgo_yield != nil {

makes BenchmarkLocked much faster on my computer:

name        old time/op  new time/op  delta
Unlocked-4   485ns ± 0%   483ns ± 1%     ~     (p=0.188 n=9+10)
Locked-4    5.22µs ± 1%  1.32µs ± 5%  -74.64%  (p=0.000 n=9+10)

I also looked around and found: essentially at every CGo call lockOSThread is used:

https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107

With this in mind I modified the benchmark a bit so that no LockOSThread is explicitly used, but server performs 1 and 10 simple C calls for every request:

CGo-4        581ns ± 1%   556ns ± 0%   -4.27%  (p=0.000 n=10+10)
CGo10-4     2.20µs ± 6%  1.23µs ± 0%  -44.32%  (p=0.000 n=10+9)

which shows the change brings quite visible speedup.

This way I'm not saying my patch is right, but at least it shows that much can be improved. So I suggest to reopen the issue.

Thanks beforehand,
Kirill

/cc @dvyukov, @aclements, @bcmills


full benchmark source:

(tmp_test.go)

package tmp

import (
        "runtime"
        "testing"
)

type in struct {
        c   chan *out
        arg int
}

type out struct {
        ret int
}

func client(c chan *in, arg int) int {
        rc := make(chan *out)
        c <- &in{
                c:   rc,
                arg: arg,
        }
        ret := <-rc
        return ret.ret
}

func _server(c chan *in, argadjust func(int) int) {
        for r := range c {
                r.c <- &out{ret: argadjust(r.arg)}
        }
}

func server(c chan *in) {
        _server(c, func(arg int) int {
                return 3 + arg
        })
}

func lockedServer(c chan *in) {
        runtime.LockOSThread()
        server(c)
        runtime.UnlockOSThread()
}

// server with 1 C call per request
func cserver(c chan *in) {
        _server(c, cargadjust)
}

// server with 10 C calls per request
func cserver10(c chan *in) {
        _server(c, func(arg int) int {
                for i := 0; i < 10; i++ {
                        arg = cargadjust(arg)
                }
                return arg
        })
}

func benchmark(b *testing.B, srv func(chan *in)) {
        inc := make(chan *in)
        go srv(inc)
        for i := 0; i < b.N; i++ {
                client(inc, i)
        }
        close(inc)
}

func BenchmarkUnlocked(b *testing.B)    { benchmark(b, server) }
func BenchmarkLocked(b *testing.B)      { benchmark(b, lockedServer) }
func BenchmarkCGo(b *testing.B)         { benchmark(b, cserver) }
func BenchmarkCGo10(b *testing.B)       { benchmark(b, cserver10) }

(tmp.go)

package tmp

// int argadjust(int arg) { return 3 + arg; }
import "C"

// XXX here because cannot use C in tests directly
func cargadjust(arg int) int {
        return int(C.argadjust(C.int(arg)))
}

@ianlancetaylor
Copy link
Member Author

@navytux This issue is closed. If you want to discuss your patch, please open a new issue or raise a topic on the golang-dev mailing list. Thanks.

@navytux
Copy link
Contributor

navytux commented Sep 10, 2017

@ianlancetaylor thanks for feedback. I've opened #21827 and posted update to original thread on golang-nuts.

@golang golang locked and limited conversation to collaborators Sep 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants