Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpm/1.4.2 fails on a bosh-lite #176

Open
abg opened this issue Oct 29, 2024 · 5 comments
Open

bpm/1.4.2 fails on a bosh-lite #176

abg opened this issue Oct 29, 2024 · 5 comments

Comments

@abg
Copy link
Member

abg commented Oct 29, 2024

Yesterday our pipelines picked up bpm/1.4.2 that bumped to runc/1.2.0 and environments using a bosh-lite configuration started failing.

The initial deployment is successful but cleaning up jobs later fails.

# bpm stop test-server
Error: failed to cleanup job-process: exit status 1

bpm seems to get in a bad state if I have multiple deployments and restart a couple times. Here's a reproduction using the bpm-release bosh-lite.yml test manifest.

$ bosh -n -d bpm deploy manifests/bosh-lite.yml
...success...
$ export BOSH_DEPLOYMENT=bpm-$(uuidgen)
$ bosh -n deploy manifests/bosh-lite.yml -o <(echo '[{"type":"replace","path":"/name","value":"((deployment_name))"}]') -v deployment_name=$BOSH_DEPLOYMENT
...success...
$ bosh -n restart
...success...
$ bosh -n restart
...
Task 20 | 14:31:59 | L starting jobs: bpm/33f58def-3dac-467e-bc7d-715e4a890b54 (0) (canary) (00:02:33)
                   L Error: 'bpm/33f58def-3dac-467e-bc7d-715e4a890b54 (0)' is not running after update. Review logs for failed jobs: test-server, alt-test-server
...

$ bosh ssh 
# bpm list
Name                        Pid Status
test-errand                 -   stopped
test-server                 -   failed
test-server.alt-test-server -   failed
# bpm start test-server
Error: failed to clean up stale job-process: exit status 1
# bpm stop test-server
Error: failed to clean up stale job-process: exit status 1

This may be related:

# /var/vcap/packages/bpm/bin/runc --root /var/vcap/sys/run/bpm-runc delete --force bpm-test-server
ERRO[0002] unable to destroy container: unable to remove container's cgroup: Failed to remove paths: map[net_cls:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-test-server net_prio:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-test-server]

I couldn't reproduce this on a bbl environment. I couldn't reproduce this with bpm/1.4.1.

Rolling back to bpm/1.4.1 (and runc/1.1.15) seems to resolve this issue for us.

@abg
Copy link
Member Author

abg commented Oct 31, 2024

Poking at this a little this morning, I see runc-1.1.15 ran into the same container / cgroup teardown issue, but seemingly ignored it. runc-1.2.0 seems to hard stop when it cannot cleanup a group.

# /var/vcap/packages/bpm/bin/runc --version
runc version 1.2.0
commit: unknown
spec: 1.2.0
go: go1.23.2
libseccomp: 2.5.1
# /var/vcap/packages/bpm/bin/runc --root /var/vcap/sys/run/bpm-runc/ delete bpm-loggr-forwarder-agent
ERRO[0002] unable to destroy container: unable to remove container's cgroup: Failed to remove paths: map[net_cls:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent net_prio:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent]
# echo $?
1

# /var/vcap/packages/bpm/bin/runc-1.1.15 --version
runc version 1.1.15
commit: unknown
spec: 1.0.2-dev
go: go1.23.2
libseccomp: 2.5.3

# /var/vcap/packages/bpm/bin/runc-1.1.15 --root /var/vcap/sys/run/bpm-runc/ delete bpm-loggr-forwarder-agent
WARN[0000] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent: device or resource busy"
WARN[0000] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent: device or resource busy"
ERRO[0000] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent: device or resource busy"
ERRO[0000] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent: device or resource busy"
ERRO[0000] Failed to remove paths: map[net_cls:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent net_prio:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent]
# echo $?
0

@jpalermo jpalermo moved this from Inbox to Pending Review | Discussion in Foundational Infrastructure Working Group Oct 31, 2024
@jpalermo jpalermo moved this from Pending Review | Discussion to Pending Merge | Prioritized in Foundational Infrastructure Working Group Oct 31, 2024
@selzoc
Copy link
Member

selzoc commented Oct 31, 2024

We found the changes in runc that lead to this behavioral change: opencontainers/runc@a6f4081
and
opencontainers/runc@7396ca9

Essentially, errors from removing cgroups were being ignored. Now they're taking a "fast fail" approach.

It's unclear at the moment what leads to the errors removing cgroups on bosh lites.

@selzoc
Copy link
Member

selzoc commented Oct 31, 2024

Note that this only appears to be a problem with bosh lites using the warden cpi, not the docker cpi.

selzoc added a commit to cloudfoundry/bosh-deployment that referenced this issue Nov 19, 2024
We've had reports such as
cloudfoundry/bpm-release#176 where running
bosh lite with the latest runc will fail to deploy or restart.  While we
haven't been able to find an absolute root cause, it is the case that
garden-runc-release switched the default of this property a few months
back: cloudfoundry/garden-runc-release#315.

Garden has some issues with running itself under bpm
(see https://github.com/cloudfoundry/garden-runc-release/blob/develop/docs/BPM_support.md)
So we postulate that doing the reverse (running bpm under Garden) has
some similar issues.

We have not been able to reproduce the issue in
cloudfoundry/bpm-release#176 with
containerd_mode set to false.
@selzoc
Copy link
Member

selzoc commented Nov 19, 2024

FYI I've opened a PR for bosh-deployment to resolve this issue: cloudfoundry/bosh-deployment#479

@abg
Copy link
Member Author

abg commented Dec 11, 2024

Swinging back around to this, I unpinned bpm in one of our pipelines since we are pulling in this cloudfoundry/bosh-deployment#479 change now.

My pipeline failed - random jobs failed to start on redeploys or monit stop / start operations.

It seems like the containerd_mode: false property is set but in some configurations jobs don't restart cleanly.

$ bosh env
Name               bosh-lite
UUID               45dc22bd-4972-459d-93ac-93a048e71e1b
Version            280.1.13 (00000000)
Director Stemcell  -/1.651
CPI                warden_cpi
Features           config_server: enabled
                   local_dns: enabled
                   snapshots: disabled
User               admin

$ bosh -n restart
...
Task 149 | 19:06:13 | L starting jobs: bpm/7792fd64-c12f-4034-b129-b10eed0a3946 (0) (canary) (00:02:48)
                    L Error: 'bpm/7792fd64-c12f-4034-b129-b10eed0a3946 (0)' is not running after update. Review logs for failed jobs: test-server, alt-test-server

$ bosh ssh
$ sudo -i
# bpm version
1.4.6
# bpm list
Name                        Pid Status
test-errand                 -   stopped
test-server                 -   failed
test-server.alt-test-server -   failed

# /var/vcap/packages/bpm/bin/runc --root /var/vcap/sys/run/bpm-runc/ delete bpm-test-server
ERRO[0002] unable to destroy container: unable to remove container's cgroup: Failed to remove paths: map[net_cls:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-test-server net_prio:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-test-server]

$ ssh -i /tmp/director.key jumpbox@${director_ip}
bosh/0:~$ sudo -i
bosh/0:~# bosh/0:~# grep -ri containerd /var/vcap/jobs/garden/monit
bosh/0:~# 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Pending Merge | Prioritized
Development

No branches or pull requests

2 participants