Description
Coordinator sometimes restarts unexpectedly, causing build failures and gomote sessions to be terminated.
Trends:
- I have always observed this under load (many builders building at once such that we are at our max threshold of GCP CPUs, 500)
- When it does restart, it will happen in bursts (many restarts over a duration of hours).
- There is a correlation between coordinator restarts and Go releases (usage of golang.org/x/build/cmd/release). Though, that could be because the release tool generates load.
- There will be no indication in the logs that coordinator restarted due to any program error - there are no panics, or any other indication of any error on the coordinator side. Just start-up logs. Eg:
2017/10/25 18:59:36 Creating Kubernetes pod "buildlet-linux-kubestd-rn68e80e3" for host-linux-kubestd
2017/10/25 18:59:36 Registering reverse buildlet "go-builder-2" (10.240.0.20:46624) for host type host-linux-arm5spacemonkey
2017/10/25 18:59:36 Buildlet go-builder-2/10.240.0.20:46624: {Version:15} for [host-linux-arm5spacemonkey]
# restart here
2017/10/25 18:59:38 coordinator version "d55be8da83e622767c00a19031b29cb701fcc744" starting
The rest of this issue is dedicated to documenting specific instances of when I observed these restarts.
1. 08/11/2017
Maintner went down, many builders queued. (see #21383).
When we got maintner back up, all queued builders tried to run, and many (most?) failed.
Seemed to me that the failure had to be caused by all the builders sharing some resource, and hitting some limit.
@aclements helped me debug a bit; we concluded that we were probably hitting a disk write IOPS limit.
Cannot remember if we observed coordinator restarts.
2. 09/22/2017
The second instance of this that I've seen is @rsc golang.org/x/build/cmd/release
(binary named build.release
below) failures.
$ ./build.release boring18
...
2017/09/22 16:45:28 linux-amd64: Start.
2017/09/22 16:45:28 linux-amd64: Creating buildlet.
2017/09/22 16:47:13 linux-amd64: Error: Post https://farmer.golang.org:443/buildlet/create: EOF
$ ./build.release boring18
2017/09/22 16:49:32 linux-amd64: Start.
2017/09/22 16:49:32 linux-amd64: Creating buildlet.
2017/09/22 16:49:47 linux-amd64: Error: Post https://farmer.golang.org:443/buildlet/create: dial tcp 107.178.219.46:443: getsockopt: connection timed out
ls: cannot access go1.8.3b4.linux-amd64.tar.gz: No such file or directory
$ ./build.release boring18
2017/09/22 16:52:48 linux-amd64: Start.
2017/09/22 16:52:48 linux-amd64: Creating buildlet.
2017/09/22 16:55:05 linux-amd64: Pushing source to buildlet.
2017/09/22 16:55:22 linux-amd64: Writing VERSION file.
2017/09/22 16:55:22 linux-amd64: Cleaning goroot (pre-build).
2017/09/22 16:55:22 linux-amd64: Building.
2017/09/22 16:59:24 linux-amd64: Error: error copying response: unexpected EOF
$ ./build.release boring18
2017/09/22 17:02:11 linux-amd64: Start.
2017/09/22 17:02:11 linux-amd64: Creating buildlet.
2017/09/22 17:04:24 linux-amd64: Pushing source to buildlet.
2017/09/22 17:04:40 linux-amd64: Writing VERSION file.
2017/09/22 17:04:40 linux-amd64: Cleaning goroot (pre-build).
2017/09/22 17:04:40 linux-amd64: Building.
2017/09/22 17:11:32 linux-amd64: Error: error copying response: unexpected EOF
The last 3 failures were due to coordinator restarting in the middle of handling the requests.
There is no indication that coordinator restarted due to any program error - there are no panics, or any other indication of any error on the coordinator side.
Buildlet logs were also interspersed with non-program builder failures (just exit status 1
) at the same time, eg:
13:45:31.000
2017/09/22 20:45:31 {windows-amd64-2008 f2ab41b02d4aa71fe472c56b21f7004071bd42ab } failed: tests failed: dist test failed: go_test:os: exit status 1
13:45:51.000
2017/09/22 20:45:51 [build windows-386-2008 f2ab41b02d4aa71fe472c56b21f7004071bd42ab]: Execution error running go_test:path/filepath on http://10.240.0.71 GCE VM: buildlet-windows-amd64-2008-rnbeb0f3b: buildlet: Client closed (numFails = 1)
13:46:22.000
2017/09/22 20:46:22 failed to extract snapshot for helper 10.0.3.52:80: Post http://10.0.3.52/writetgz?dir=go: EOF
13:46:33.000
2017/09/22 20:46:33 {linux-amd64-alpine 93c869ec1678e2cbe3baf4c2e48f7c3482012ae2 } failed: tests failed: dist test failed: go_test:runtime: exit status 1
13:46:37.000
2017/09/22 20:46:37 [build linux-amd64-alpine 39d24b6b6d3e25b4e3a5ad6c7b79891fc3020319]: Execution error running runtime:cpu124 on http://10.0.5.111 Kube Pod: buildlet-linux-x86-alpine-rn34e9c04: buildlet: Client closed (numFails = 1)
13:46:47.000
2017/09/22 20:46:47 {plan9-amd64-9front 42cb4dcdb59a0be3ad52bf6cf33489301d039e08 } failed: tests failed: dist test failed: go_test:net: exit status: 'go 29782: 1'
@rsc tried again a few days later to release and everything worked fine; no changes to the builder pipeline (that I am aware of).
3. 10/25/2017
timestamps (eastern)
15:06, 14:59, 13:56
shadams$ kubectl get po
NAME READY STATUS RESTARTS AGE
coordinator-deployment-1785484075-5gcw0 1/1 Running 3 1h