runtime: "program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 #21559

vielmetti · 2017-08-22T20:56:59Z

What version of Go are you using (`go version`)?

go version go1.9rc2 linux/arm64
bootstrap compiler is go version go1.6.2 linux/arm64

Machine is

$ uname -a
Linux lab-centriq 4.7.0-2-generic #5~pdaw1.1+bandera.6-Ubuntu SMP Wed Feb 8 23:10:54 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux

Does this issue reproduce with the latest release?

This issue does not reproduce on the Cavium ThunderX hardware, which passes all tests without flaws.

What operating system and processor architecture are you using (`go env`)?

./go env
GOARCH="arm64"
GOBIN=""
GOEXE=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/emv/go"
GORACE=""
GOROOT="/home/emv/src/go.googlesource.com/go"
GOTOOLDIR="/home/emv/src/go.googlesource.com/go/pkg/tool/linux_arm64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build997434454=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

git checkout go1.9rc2
cd src
./all.bash

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

What did you expect to see?

All tests pass

What did you see instead?

runtime: program exceeds 50-thread limit
fatal error: thread exhaustion
FAIL    os      0.509s 
Failed: exit status 1

Complete output at https://gist.github.com/017dbb7fa20d184d66a8d9ea4bf61225

The text was updated successfully, but these errors were encountered:

vielmetti · 2017-08-22T21:13:37Z

From reading the test code, I think the error is showing up in

// Test that reading from a pipe doesn't use up a thread.
func TestPipeThreads(t *testing.T) {

vielmetti · 2017-08-22T21:22:50Z

Note that all tests pass on the Cavium ThunderX, but this one test fails to pass on the Centriq system. Exploring further to see if I can get different results.

ianlancetaylor · 2017-08-24T00:00:46Z

Is this easily repeatable? If so, please run go test -c os to get an os.test binary, then run ./os.test --testing.short to verify that the problem still occurs, and attach the output of strace -f ./os.test --testing.short. Thanks.

vielmetti · 2017-08-24T01:40:45Z

Odd. It's reproducible from the ./all.bash shell script, but when I make and run the the os.test binary, it passes.

vielmetti · 2017-08-24T03:55:35Z

When I download the binaries for go 1.9 rc2 they pass the os.test test, but when I use those binaries to bootstrap the go 1.9 rc2 sources, the all.bash script gets this failure - only on the Centriq system.

ianlancetaylor · 2017-08-25T05:25:00Z

Is there any difference in the output of ulimit -u on the succeeding and failing systems?

vielmetti · 2017-08-25T08:56:06Z

ulimit -u on the ThunderX: 515386

ulimit -u on Centriq: 385905

Both should be plenty high for any reason.

ianlancetaylor · 2017-08-25T21:41:39Z

I haven't been able to think of any reason why this test would behave differently when invoked from all.bash. I can't recall seeing any other reports of this problem. The test is fairly simple, and is basically testing that the pipes are handled by the internal poller. If we could capture it standalone, then running the test under strace -f would likely show the problem. But running all.bash under strace -f will produce too much irrelevant information.

vielmetti · 2017-08-25T22:03:47Z

I'm also puzzled. The only think I can possibly think of is that this machine is running a unique kernel, and that we're planning to update the kernel to something closer to mainline. I'll put this on hold for now, and rerun it the next time there's a new kernel.

valyala · 2017-08-27T10:16:31Z

This may be related to #21621 and #21550 during compilation of a package with many files. As I uderstand, currently each file may be concurrently opened and read in a separate OS thread if the corresponding file I/O syscalls (cgo calls) are slow.

ianlancetaylor · 2017-08-28T05:39:57Z

@valyala In this case the calls should not be slow, because when using a pipe all I/O should be done using non-blocking calls.

vielmetti · 2017-08-28T05:59:09Z

OK, adding to the puzzle:

git checkout go1.9
cd src
./all.bash

just worked perfectly on the same machine, no errors; building master also gave me no errors. But building go1.9rc2 did again error out on os just the same way as it did originally.

Curious!

orivej · 2017-09-13T06:56:52Z

I can easily reproduce this on a 56-threaded Intel Xeon E5-2660 v4 Linux. (Out of 100 consecutive runs of os.test --test.short built from go master, 9 panicked with this error.) Unfortunately, the problem vanishes under strace.

orivej · 2017-09-13T07:07:47Z

Here is a test panic from that machine, if it may be of any use: os.txt.

orivej · 2017-09-13T15:37:18Z

In fact this reproduces only when CPU is under some load. Stressing CPU makes it reproduce practically every time, whereas without any load the test may always succeed. I used the following script to bisect go history, c05b06a is the first bad commit.

#!/bin/bash
set -e
cd src
./make.bash
cd os
../../bin/go test -a -c os
stress-ng -t 60 -c 60 &
trap 'killall stress-ng || true' EXIT
for i in `seq 100`; do
    ./os.test --test.short
done

ianlancetaylor · 2017-09-13T17:10:14Z

Thanks for the bisect. That is the commit that introduced the failing test in the first place.

ianlancetaylor · 2017-09-13T17:20:02Z

Can you generate a failure with GOTRACEBACK=system set in the environment and show us the stack dump? Thanks.

orivej · 2017-09-13T17:52:50Z

Here is one: os.txt (go is at the current master, 9593b74).

gopherbot · 2017-09-13T18:17:32Z

Change https://golang.org/cl/63650 mentions this issue: os: avoid crashing with a thundering herd in TestPipeThreads

ianlancetaylor · 2017-09-13T18:17:51Z

Sorry for missing that you attached that before.

The stack trace suggests that the problem is not with the test itself, it is with the cleanup when the test is done. It looks like we can get a thundering herd as we stop all the goroutines.

Can you see if https://golang.org/cl/63650 fixes the problem for you? Thanks.

orivej · 2017-09-13T18:18:46Z

When I repeat the test under load (and it fails every time), the number of goroutines in the dump typically is either 8-10, or 107-108. Above was an example with 10 goroutines, and here is one with 108:
os.txt

I'll test your patch now.

orivej · 2017-09-13T18:23:24Z

The patch fixes the issue. Thanks!

ianlancetaylor · 2017-09-13T18:39:44Z

Thanks for testing it.

Marking issue as 1.9.1 to consider bringing the test fix into 1.9.1 to avoid spurious build failure reports.

ianlancetaylor · 2017-09-13T22:54:43Z

Reopening to consider a backport to 1.9.1.

rsc · 2017-10-13T18:31:35Z

I'm confused. The test says "Test that reading from a pipe doesn't use up a thread." and it sounds like it was using up a thread on the breaking system. The rewrite of the test may just be making the test not as precise about finding problems, essentially papering over a real problem.

Can you explain why the old test was wrong? I just don't see it. The old channel+waitgroup looks functionally equivalent to the new channel+channel, and I don't see why putting Close into the goroutine (instead of a later cleanup pass) would change anything, unless Close is happening to cause extra serialization that defeats the test.

If you don't mind just disabling the test for Go 1.9.2 because you think it's a bad test for some reason, that's fine. But it seems like a worthwhile test and maybe pointing out a real problem.

ianlancetaylor · 2017-10-13T18:53:02Z

The test is intended to test that reading from a pipe when there is no data to read doesn't use up a thread. When data arrives, a thread is required to actually do the read. With the original test, all the goroutines would queue up to read, and all is well. Then we would write to all of them at once, and we would blow past the thread limit.

But, yeah, my fix is not well written. I think it may work because of synchronization on the cdone channel. cdone ought to be an unbuffered channel.

For 1.9 we should probably just disable the test.

rsc · 2017-10-13T19:17:02Z

OK. Disabling test OK for Go 1.9.2.
TODO: Send a CL on the release branch, but do not submit it. (I want to try automating the actual release branch commits.)

ianlancetaylor · 2017-10-13T19:23:02Z

Sent https://golang.org/cl/70771 for 1.9 branch.

gopherbot · 2017-10-13T19:27:34Z

Change https://golang.org/cl/70771 mentions this issue: [release-branch.go1.9] os: skip TestPipeThreads as flaky for 1.9

rsc · 2017-10-14T00:24:09Z

CL 70771 OK for Go 1.9.2.

Updates #21559 Change-Id: I90fa8b4ef97c4251440270491ac4c833d76ee872 Reviewed-on: https://go-review.googlesource.com/70771 Run-TryBot: Russ Cox <rsc@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>

rsc · 2017-10-26T21:09:08Z

go1.9.2 has been packaged and includes:

CL 70771

The release is posted at golang.org/dl.

— golang.org/x/build/cmd/releasebot, Oct 26 21:09:08 UTC

golang/go#19608 https://hydra.nixos.org/build/64329767/nixlog/1 https://hydra.nixos.org/build/64244716/nixlog/1 Remove the patch for golang/go#21559 because it is skipped as flaky since Go 1.9.2.

vielmetti changed the title ~~"runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2~~ "runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 Aug 22, 2017

vielmetti changed the title ~~"runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64~~ "runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 Aug 22, 2017

ianlancetaylor added this to the Go1.10 milestone Aug 24, 2017

ianlancetaylor added Testing An issue that has been verified to require only test changes, not just a test failure. release-blocker labels Aug 24, 2017

ianlancetaylor changed the title ~~"runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400~~ runtime: "program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 Aug 25, 2017

ianlancetaylor added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Aug 25, 2017

ianlancetaylor removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 13, 2017

ianlancetaylor modified the milestones: Go1.10, Go1.9.1 Sep 13, 2017

gopherbot closed this as completed in 29415eb Sep 13, 2017

ianlancetaylor reopened this Sep 13, 2017

orivej mentioned this issue Sep 17, 2017

go: fix tests and impurity NixOS/nixpkgs#29489

Merged

8 tasks

rsc modified the milestones: Go1.9.1, Go1.9.2 Oct 4, 2017

rsc added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 13, 2017

rsc removed the Testing An issue that has been verified to require only test changes, not just a test failure. label Oct 13, 2017

rsc added CherryPickApproved Used during the release process for point releases and removed release-blocker labels Oct 14, 2017

rsc mentioned this issue Oct 15, 2017

all: Go 1.9.2 release status #22277

Closed

rsc closed this as completed Oct 26, 2017

golang locked and limited conversation to collaborators Oct 26, 2018

gopherbot added the FrozenDueToAge label Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: "program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 #21559

runtime: "program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 #21559

vielmetti commented Aug 22, 2017 •

edited

Loading

vielmetti commented Aug 22, 2017

vielmetti commented Aug 22, 2017

ianlancetaylor commented Aug 24, 2017

vielmetti commented Aug 24, 2017

vielmetti commented Aug 24, 2017 •

edited

Loading

ianlancetaylor commented Aug 25, 2017

vielmetti commented Aug 25, 2017

ianlancetaylor commented Aug 25, 2017

vielmetti commented Aug 25, 2017

valyala commented Aug 27, 2017

ianlancetaylor commented Aug 28, 2017

vielmetti commented Aug 28, 2017

orivej commented Sep 13, 2017 •

edited

Loading

orivej commented Sep 13, 2017

orivej commented Sep 13, 2017

ianlancetaylor commented Sep 13, 2017

ianlancetaylor commented Sep 13, 2017

orivej commented Sep 13, 2017

gopherbot commented Sep 13, 2017

ianlancetaylor commented Sep 13, 2017

orivej commented Sep 13, 2017

orivej commented Sep 13, 2017

ianlancetaylor commented Sep 13, 2017

ianlancetaylor commented Sep 13, 2017

rsc commented Oct 13, 2017

ianlancetaylor commented Oct 13, 2017

rsc commented Oct 13, 2017

ianlancetaylor commented Oct 13, 2017

gopherbot commented Oct 13, 2017

rsc commented Oct 14, 2017

rsc commented Oct 26, 2017

runtime: "program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 #21559

runtime: "program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 #21559

Comments

vielmetti commented Aug 22, 2017 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

vielmetti commented Aug 22, 2017

vielmetti commented Aug 22, 2017

ianlancetaylor commented Aug 24, 2017

vielmetti commented Aug 24, 2017

vielmetti commented Aug 24, 2017 • edited Loading

ianlancetaylor commented Aug 25, 2017

vielmetti commented Aug 25, 2017

ianlancetaylor commented Aug 25, 2017

vielmetti commented Aug 25, 2017

valyala commented Aug 27, 2017

ianlancetaylor commented Aug 28, 2017

vielmetti commented Aug 28, 2017

orivej commented Sep 13, 2017 • edited Loading

orivej commented Sep 13, 2017

orivej commented Sep 13, 2017

ianlancetaylor commented Sep 13, 2017

ianlancetaylor commented Sep 13, 2017

orivej commented Sep 13, 2017

gopherbot commented Sep 13, 2017

ianlancetaylor commented Sep 13, 2017

orivej commented Sep 13, 2017

orivej commented Sep 13, 2017

ianlancetaylor commented Sep 13, 2017

ianlancetaylor commented Sep 13, 2017

rsc commented Oct 13, 2017

ianlancetaylor commented Oct 13, 2017

rsc commented Oct 13, 2017

ianlancetaylor commented Oct 13, 2017

gopherbot commented Oct 13, 2017

rsc commented Oct 14, 2017

rsc commented Oct 26, 2017

vielmetti commented Aug 22, 2017 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

vielmetti commented Aug 24, 2017 •

edited

Loading

orivej commented Sep 13, 2017 •

edited

Loading