kindest/node entrypoint sometimes fails with exit code 141 (SIGPIPE) #2178

sbueringer · 2021-04-02T15:29:20Z

What happened:

Some context:

we're using kind for e2e tests in cluster-api
tests are using CAPD (ClusterAPI Provider Docker) to spin up other Docker-based clusters
tests are executed in Prow
tests are running in parallel
we're using the kindest/node image to spin up our own Docker-based clusters

Most of it shouldn't matter, but it might be helpful context. From time to time we're seeing that the kindest/node image fails on startup. According to docker inspect it fails with exit code 141 (logs). docker logs shows the following:

regular log:

"INFO: ensuring we can execute mount/umount even with userns-remap"
"INFO: remounting /sys read-only"
"INFO: making mounts shared"
"INFO: fix cgroup mounts for all subsystems"

log with set -x:

+ local docker_cgroup cgroup_subsystems subsystem
++ head -n 1
++ cut '-d ' -f 4
++ echo '6217 6216 0:29 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:923 - cgroup cgroup rw,xattr,name=systemd
6287 6216 0:32 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:924 - cgroup cgroup rw,cpu,cpuacct
6288 6216 0:33 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:925 - cgroup cgroup rw,net_cls,net_prio
6289 6216 0:34 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:926 - cgroup cgroup rw,memory
6290 6216 0:35 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:927 - cgroup cgroup rw,freezer
6291 6216 0:36 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:928 - cgroup cgroup rw,hugetlb
6320 6216 0:37 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:929 - cgroup cgroup rw,pids
6322 6216 0:39 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:931 - cgroup cgroup rw,blkio
6323 6216 0:40 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:932 - cgroup cgroup rw,cpuset
6324 6216 0:41 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:933 - cgroup cgroup rw,perf_event
6325 6216 0:42 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:934 - cgroup cgroup rw,devices'
+ docker_cgroup=/docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556

To me it looks like it fails here: https://github.com/kubernetes-sigs/kind/blob/v0.9.0/images/base/files/usr/local/bin/entrypoint#L82

The only reasonable explanation I could come up with is:

echo sends the docker_cgroup_mounts into the pipe (maybe line by line?)
head -n 1 exits after the first line
the pipe (and thus the whole script) fails because echo cannot send the other lines

I suspect it happens rarely because this might be caused by a CPU context switch (or something like this?) at the wrong moment, because of the high concurrency we have during our tests.

What you expected to happen:
kindest/node should not fail on startup.

How to reproduce it (as minimally and precisely as possible):
Really hard to reproduce. In our case we can be sure it's gone after about 10-20 successful test runs in a row.

Anything else we need to know?:

I changed the following and this seems to fix the issue in my case:

old code:

docker_cgroup=$(echo "${docker_cgroup_mounts}" | head -n 1 | cut -d' ' -f 4)

"fixed" code:

docker_cgroup=$( < <(echo "${docker_cgroup_mounts}") head -n 1 | cut -d' ' -f 4)

The idea is to move the echo out of the pipe so it should not be possible anymore to get a SIGPIPE/ pipefail.

The code in kind on the main branch changed a bit since the version 0.9 we're currently using, but there's one line left which should have the same issue:
https://github.com/kubernetes-sigs/kind/blob/main/images/base/files/usr/local/bin/entrypoint#L232

For reference, the respective cluster-api issue: kubernetes-sigs/cluster-api#4405

Environment:

kind version: (use kind version): 0.9
Kubernetes version: (use kubectl version): 1.19.1
Docker version: (use docker info):

Client: Docker Engine - Community
 Version: 20.10.3
 API version: 1.40
 Go version: go1.13.15
 Git commit: 48d30b5
 Built: Fri Jan 29 14:33:25 2021
 OS / Arch: linux / amd64
 Context: default
 Experimental: true

Server: Docker Engine - Community
 Engine:
  Version: 19.03.15
  API version: 1.40 (minimum version 1.12)
  Go version: go1.13.15
  Git commit: 99e3ed8919
  Built: Sat Jan 30 03:15:34 2021
  OS / Arch: linux / amd64
  Experimental: false
 containerd:
  Version: 1.4.3
  GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version: 1.0.0-rc92
  GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version: 0.18.0
  GitCommit: fec3683

OS (e.g. from /etc/os-release): not sure but the image we're using in the ProwJob is: gcr.io/k8s-testimages/kubekins-e2e:v20210330-fadf59c-go-canary

The text was updated successfully, but these errors were encountered:

BenTheElder · 2021-04-02T19:04:20Z

we've seen something like this before in kubernetes kubernetes/kubernetes#57805, I'm guessing piping to head has the same issue.

we should use the heredoc approach here as well, it's a bit cheaper / simpler

BenTheElder · 2021-04-02T19:04:33Z

thanks for tracking this down and filing the very detailed report :-)
/assign

sbueringer · 2021-04-02T20:20:12Z

thanks for tracking this down and filing the very detailed report :-)
/assign

You're welcome :)

How/when is the new version of the script included in the images?

I assume something like:

after the merge on main on new images or
after the next release in new images

Thx for the fast fix

BenTheElder · 2021-04-02T20:22:55Z

How/when is the new version of the script included in the images?

we push images as needed, the base image is updated in the fix PR, I'd push a node image but we're already working on revamped node images in #2176 which will hopefully be ready to go soon. If it doesn't we'll push one anyhow, we keep one fairly up to date for HEAD.

A full set of images is pushed on release, we were hoping for a release next week, but with people out it may slip another week (I am out all week, it looks like others are out some as well).

sbueringer · 2021-04-02T20:29:59Z

Ah perfect. I think we'll wait for the full set, but a few weeks is what I hoped for so that's great.

We're currently using 1.19 and for upgrade tests also a 1.18 image (I'll have to check why not 1.20 and 1.19).

sbueringer added the kind/bug Categorizes issue or PR as related to a bug. label Apr 2, 2021

k8s-ci-robot assigned BenTheElder Apr 2, 2021

BenTheElder mentioned this issue Apr 2, 2021

avoid head sigpipe #2179

Merged

k8s-ci-robot closed this as completed in #2179 Apr 6, 2021

danilo-gemoli mentioned this issue Mar 25, 2024

hotfix: avoid sigpipe in periodic-etcd-defragger openshift/release#50175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kindest/node entrypoint sometimes fails with exit code 141 (SIGPIPE) #2178

kindest/node entrypoint sometimes fails with exit code 141 (SIGPIPE) #2178

sbueringer commented Apr 2, 2021 •

edited

Loading

BenTheElder commented Apr 2, 2021

BenTheElder commented Apr 2, 2021

sbueringer commented Apr 2, 2021 •

edited

Loading

BenTheElder commented Apr 2, 2021

sbueringer commented Apr 2, 2021

kindest/node entrypoint sometimes fails with exit code 141 (SIGPIPE) #2178

kindest/node entrypoint sometimes fails with exit code 141 (SIGPIPE) #2178

Comments

sbueringer commented Apr 2, 2021 • edited Loading

BenTheElder commented Apr 2, 2021

BenTheElder commented Apr 2, 2021

sbueringer commented Apr 2, 2021 • edited Loading

BenTheElder commented Apr 2, 2021

sbueringer commented Apr 2, 2021

sbueringer commented Apr 2, 2021 •

edited

Loading

sbueringer commented Apr 2, 2021 •

edited

Loading