-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpm/1.4.2 fails on a bosh-lite #176
Comments
Poking at this a little this morning, I see runc-1.1.15 ran into the same container / cgroup teardown issue, but seemingly ignored it. runc-1.2.0 seems to hard stop when it cannot cleanup a group.
|
We found the changes in runc that lead to this behavioral change: opencontainers/runc@a6f4081 Essentially, errors from removing cgroups were being ignored. Now they're taking a "fast fail" approach. It's unclear at the moment what leads to the errors removing cgroups on bosh lites. |
Note that this only appears to be a problem with bosh lites using the |
We've had reports such as cloudfoundry/bpm-release#176 where running bosh lite with the latest runc will fail to deploy or restart. While we haven't been able to find an absolute root cause, it is the case that garden-runc-release switched the default of this property a few months back: cloudfoundry/garden-runc-release#315. Garden has some issues with running itself under bpm (see https://github.com/cloudfoundry/garden-runc-release/blob/develop/docs/BPM_support.md) So we postulate that doing the reverse (running bpm under Garden) has some similar issues. We have not been able to reproduce the issue in cloudfoundry/bpm-release#176 with containerd_mode set to false.
FYI I've opened a PR for bosh-deployment to resolve this issue: cloudfoundry/bosh-deployment#479 |
Swinging back around to this, I unpinned bpm in one of our pipelines since we are pulling in this cloudfoundry/bosh-deployment#479 change now. My pipeline failed - random jobs failed to start on redeploys or monit stop / start operations. It seems like the
|
Yesterday our pipelines picked up bpm/1.4.2 that bumped to runc/1.2.0 and environments using a bosh-lite configuration started failing.
The initial deployment is successful but cleaning up jobs later fails.
bpm seems to get in a bad state if I have multiple deployments and restart a couple times. Here's a reproduction using the bpm-release bosh-lite.yml test manifest.
This may be related:
I couldn't reproduce this on a bbl environment. I couldn't reproduce this with bpm/1.4.1.
Rolling back to bpm/1.4.1 (and runc/1.1.15) seems to resolve this issue for us.
The text was updated successfully, but these errors were encountered: