Skip to content

runtime: sudog leak #9110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rsc opened this issue Nov 16, 2014 · 5 comments
Closed

runtime: sudog leak #9110

rsc opened this issue Nov 16, 2014 · 5 comments
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Nov 16, 2014

The SudoG recycling code is very sloppy about zeroing fields that are no longer needed.
We have already fixed elem and selectdone, but next and waitlink are basically never
cleared either. This leads in real programs to an arbitrarily large number of SudoG's
leaking. 

Specifically, the waitlink field is set during select but not cleared until the SudoG is
next reused for a channel send/recv or select. Uses of a SudoG for sync.Cond don't set
or use waitlink at all. Similarly, the next field is left intact on return from
releaseSudog, and worse it is left intact when the cache head is nil'ed out.

The effect of this is that after a select the entries added to the SudoG cache have a
.waitlink list pointing in the opposite direction of the .next list in the cache.
Operations by sync.Cond can reorder entries on the cache .next list without affecting
the .waitlink list, creating the possibility of a .waitlink pointing at basically the
rest of the list. Then if that entry is being used by a sync.Cond.Wait when a GC happens
and c.sudogcache is set to nil, nothing is collected, because the entry blocked in
sync.Cond.Wait has a waitlist pointing at the whole list that was just
"discarded". Even worse, when that sync.Cond.Wait finishes and puts that SudoG
into c.sudogcache, the .waitlink will still be dangling back to the earlier list,
pinning the whole thing in memory. 

In programs with the right mix of non-trivial selects and sync.Cond operations end up
with allocated SudoGs that are heavily cross-linked by all these dangling pointers.
Setting c.sudogcache = nil does make sure that the SudoGs stop being used, but chances
are very good that there is a dangling .waitlink (probably many) pointing into the list
that keeps a large fraction of it from being collected. And then on the next collection
chances are very good that that SudoG keeping the previous generation from being
collected itself finds a way not to get collected, causing the leaks to pile up
aribtrarily.

Below is a simple program that leaks basically all of its SudoG in each iteration of the
for { }.

Attached is a picture of the SudoGs for the first iteration, showing the next and
waitlink links and how nothing gets collected.

This is an abstracted version of something I observed in a Google production server
testing Go 1.4.

package main

import (
    "runtime"
    "runtime/debug"
    "sync"
    "time"
)

func main() {
    debug.SetGCPercent(1000000) // only GC when we ask for GC

    release := func() {}
    for {
        c := make(chan int)
        for i := 0; i < 1000; i++ {
            go func() {
                select {
                case <-c:
                case <-c:
                case <-c:
                }
            }()
        }
        time.Sleep(10 * time.Millisecond)
        release()

        close(c) // let select put its sudog's into the cache
        time.Sleep(10 * time.Millisecond)

        // pick up top sudog
        var cond1 sync.Cond
        var mu1 sync.Mutex
        cond1.L = &mu1
        go func() {
            mu1.Lock()
            cond1.Wait()
            mu1.Unlock()
        }()
        time.Sleep(1 * time.Millisecond)

        // pick up next sudog
        var cond2 sync.Cond
        var mu2 sync.Mutex
        cond2.L = &mu2
        go func() {
            mu2.Lock()
            cond2.Wait()
            mu2.Unlock()
        }()
        time.Sleep(1 * time.Millisecond)

        // put top sudog back
        cond1.Broadcast()
        time.Sleep(1 * time.Millisecond)

        // drop cache on floor
        runtime.GC()

        // release cond2 after select has gotten to run
        release = func() {
            cond2.Broadcast()
            time.Sleep(1 * time.Millisecond)
        }
    }
}

Attachments:

  1. IMG_1907.JPG (377479 bytes)
@gopherbot
Copy link
Contributor

Comment 1:

CL https://golang.org/cl/177870043 mentions this issue.

@rsc
Copy link
Contributor Author

rsc commented Nov 16, 2014

Comment 2:

This issue was closed by revision b3932ba.

Status changed to Fixed.

@rsc rsc added fixed labels Nov 16, 2014
@rsc rsc self-assigned this Nov 16, 2014
@odysseus9672
Copy link

The test for this issue fails on my Mac OS X 10.10 machines when GOMAXPROCS is set to any value but 1. I've observed this failure on the master branch for a few months, now.

go run run.go -- fixedbugs/issue9110.go

incorrect output
BUG: object leak: 0 -> 227 -> 253

FAIL fixedbugs/issue9110.go 0.853s
35.89 real 185.50 user 34.97 sys
2015/03/22 16:48:52 Failed: exit status 1

echo $GOMAXPROCS
2

@odysseus9672
Copy link

Further info: the problem goes away if I unset GOMAXPROCS. Should I report this as a new issue?

@bradfitz
Copy link
Contributor

Yes, open a new bug. This is an old bug and is closed.

@rsc rsc added this to the Go1.4 milestone Apr 14, 2015
@rsc rsc removed the release-go1.4 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 25, 2016
wheatman pushed a commit to wheatman/go-akaros that referenced this issue Jun 25, 2018
The SudoG used to sit on the stack, so it was cheap to allocated
and didn't need to be cleaned up when finished.

For the conversion to Go, we had to move sudog off the stack
for a few reasons, so we added a cache of recently used sudogs
to keep allocation cheap. But we didn't add any of the necessary
cleanup before adding a SudoG to the new cache, and so the cached
SudoGs had stale pointers inside them that have caused all sorts
of awful, hard to debug problems.

CL 155760043 made sure SudoG.elem is cleaned up.
CL 150520043 made sure SudoG.selectdone is cleaned up.

This CL makes sure SudoG.next, SudoG.prev, and SudoG.waitlink
are cleaned up. I should have done this when I did the other two
fields; instead I wasted a week tracking down a leak they caused.

A dangling SudoG.waitlink can point into a sudogcache list that
has been "forgotten" in order to let the GC collect it, but that
dangling .waitlink keeps the list from being collected.
And then the list holding the SudoG with the dangling waitlink
can find itself in the same situation, and so on. We end up
with lists of lists of unusable SudoGs that are still linked into
the object graph and never collected (given the right mix of
non-trivial selects and non-channel synchronization).

More details in golang.org/issue/9110.

Fixes golang#9110.

LGTM=r
R=r
CC=dvyukov, golang-codereviews, iant, khr
https://golang.org/cl/177870043
wheatman pushed a commit to wheatman/go-akaros that referenced this issue Jun 26, 2018
The SudoG used to sit on the stack, so it was cheap to allocated
and didn't need to be cleaned up when finished.

For the conversion to Go, we had to move sudog off the stack
for a few reasons, so we added a cache of recently used sudogs
to keep allocation cheap. But we didn't add any of the necessary
cleanup before adding a SudoG to the new cache, and so the cached
SudoGs had stale pointers inside them that have caused all sorts
of awful, hard to debug problems.

CL 155760043 made sure SudoG.elem is cleaned up.
CL 150520043 made sure SudoG.selectdone is cleaned up.

This CL makes sure SudoG.next, SudoG.prev, and SudoG.waitlink
are cleaned up. I should have done this when I did the other two
fields; instead I wasted a week tracking down a leak they caused.

A dangling SudoG.waitlink can point into a sudogcache list that
has been "forgotten" in order to let the GC collect it, but that
dangling .waitlink keeps the list from being collected.
And then the list holding the SudoG with the dangling waitlink
can find itself in the same situation, and so on. We end up
with lists of lists of unusable SudoGs that are still linked into
the object graph and never collected (given the right mix of
non-trivial selects and non-channel synchronization).

More details in golang.org/issue/9110.

Fixes golang#9110.

LGTM=r
R=r
CC=dvyukov, golang-codereviews, iant, khr
https://golang.org/cl/177870043
wheatman pushed a commit to wheatman/go-akaros that referenced this issue Jul 9, 2018
The SudoG used to sit on the stack, so it was cheap to allocated
and didn't need to be cleaned up when finished.

For the conversion to Go, we had to move sudog off the stack
for a few reasons, so we added a cache of recently used sudogs
to keep allocation cheap. But we didn't add any of the necessary
cleanup before adding a SudoG to the new cache, and so the cached
SudoGs had stale pointers inside them that have caused all sorts
of awful, hard to debug problems.

CL 155760043 made sure SudoG.elem is cleaned up.
CL 150520043 made sure SudoG.selectdone is cleaned up.

This CL makes sure SudoG.next, SudoG.prev, and SudoG.waitlink
are cleaned up. I should have done this when I did the other two
fields; instead I wasted a week tracking down a leak they caused.

A dangling SudoG.waitlink can point into a sudogcache list that
has been "forgotten" in order to let the GC collect it, but that
dangling .waitlink keeps the list from being collected.
And then the list holding the SudoG with the dangling waitlink
can find itself in the same situation, and so on. We end up
with lists of lists of unusable SudoGs that are still linked into
the object graph and never collected (given the right mix of
non-trivial selects and non-channel synchronization).

More details in golang.org/issue/9110.

Fixes golang#9110.

LGTM=r
R=r
CC=dvyukov, golang-codereviews, iant, khr
https://golang.org/cl/177870043
wheatman pushed a commit to wheatman/go-akaros that referenced this issue Jul 20, 2018
The SudoG used to sit on the stack, so it was cheap to allocated
and didn't need to be cleaned up when finished.

For the conversion to Go, we had to move sudog off the stack
for a few reasons, so we added a cache of recently used sudogs
to keep allocation cheap. But we didn't add any of the necessary
cleanup before adding a SudoG to the new cache, and so the cached
SudoGs had stale pointers inside them that have caused all sorts
of awful, hard to debug problems.

CL 155760043 made sure SudoG.elem is cleaned up.
CL 150520043 made sure SudoG.selectdone is cleaned up.

This CL makes sure SudoG.next, SudoG.prev, and SudoG.waitlink
are cleaned up. I should have done this when I did the other two
fields; instead I wasted a week tracking down a leak they caused.

A dangling SudoG.waitlink can point into a sudogcache list that
has been "forgotten" in order to let the GC collect it, but that
dangling .waitlink keeps the list from being collected.
And then the list holding the SudoG with the dangling waitlink
can find itself in the same situation, and so on. We end up
with lists of lists of unusable SudoGs that are still linked into
the object graph and never collected (given the right mix of
non-trivial selects and non-channel synchronization).

More details in golang.org/issue/9110.

Fixes golang#9110.

LGTM=r
R=r
CC=dvyukov, golang-codereviews, iant, khr
https://golang.org/cl/177870043
wheatman pushed a commit to wheatman/go-akaros that referenced this issue Jul 30, 2018
The SudoG used to sit on the stack, so it was cheap to allocated
and didn't need to be cleaned up when finished.

For the conversion to Go, we had to move sudog off the stack
for a few reasons, so we added a cache of recently used sudogs
to keep allocation cheap. But we didn't add any of the necessary
cleanup before adding a SudoG to the new cache, and so the cached
SudoGs had stale pointers inside them that have caused all sorts
of awful, hard to debug problems.

CL 155760043 made sure SudoG.elem is cleaned up.
CL 150520043 made sure SudoG.selectdone is cleaned up.

This CL makes sure SudoG.next, SudoG.prev, and SudoG.waitlink
are cleaned up. I should have done this when I did the other two
fields; instead I wasted a week tracking down a leak they caused.

A dangling SudoG.waitlink can point into a sudogcache list that
has been "forgotten" in order to let the GC collect it, but that
dangling .waitlink keeps the list from being collected.
And then the list holding the SudoG with the dangling waitlink
can find itself in the same situation, and so on. We end up
with lists of lists of unusable SudoGs that are still linked into
the object graph and never collected (given the right mix of
non-trivial selects and non-channel synchronization).

More details in golang.org/issue/9110.

Fixes golang#9110.

LGTM=r
R=r
CC=dvyukov, golang-codereviews, iant, khr
https://golang.org/cl/177870043
@rsc rsc removed their assignment Jun 23, 2022
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants