-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1745004: baremetal: Use podman inspect to check ironic service status #2249
Conversation
Marking WIP pending feedback from those running into these issues, I've not been able to reproduce reliably in my environment |
Why are you not running these services using static pods? They have liveness probes ans such? |
The patch appeared to work for me, in that the changes got me past the part that they were meant to fix, but it appears now that i'm stick in the link-machine-and-node.sh loop:
I'm not sure if this is related, but a common error from all of the nodes is:
(not sure why my markdown above isn't working; here's the actual error: Aug 21 19:42:09 master-0 hyperkube[1236]: E0821 19:42:09.096126 1236 kubelet.go:1648] Failed creating a mirror pod for "mdns-publisher-master-0_openshift-kni-infra(ea93e1102ed250148867ccbe693731e5)": namespaces "openshift-kni-infra" not found) There's no I'll redeploy and test more. |
Failed again, looks like it's having issues finding the IPA images:
This part may be related to #2234 |
Great success. Deployment works on my end, with these changes rebased on top of #2234 |
Yeah I think static pods may be a better approach - I started out copying the systemd/script approach used for the DNS integration (which have since moved to staticpods launched via the MCO). These services are specific to the bootstrap VM (same services are launched on the cluster via the MAO as part of bringing up the baremetal-operator), so I guess I can define a static pod via the installer assets, similar to how we're creating this systemd service? I removed the WIP since Dan confirmed this works around his immediate blocker issues and I raised #2251 to track moving to a staticpod - I'll investigate and propose that as a follow-up if that's OK with @abhinavdahiya |
These errors are temporary and expected, we loop waiting for the images to get downloaded by the downloader containers. |
works for me (kills the service when needed), see "error getting container from store..."
|
data/data/bootstrap/baremetal/files/usr/local/bin/startironic.sh.template
Outdated
Show resolved
Hide resolved
/lgtm since #2251 was filed to track moving this to a static pod |
Note that @dhiggins figured out the root cause is a crio restart, ref #2251 (comment) so moving to staticpods will likely be the long term fix, but this has been confirmed by several people as working around the immediate blocker issue |
Can confirm, I suspect that in 50% of my builds I run into the problem that this PR is solving. Big LGTM from me. Thx! |
Also seeing a rather odd one which I suspect is because we're not handling the pod errors without this patch:
I have to manually restart Ironic for this to work properly without the patch here. |
As discussed this is another manifestation of the same issue, the conductor service got killed so the db-sync didn't happen - this restart approach should work around it until we get the podman-crio restart issues worked out and/or switch to a staticpod |
@hardys: This pull request references Bugzilla bug 1745004, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@hardys: This pull request references Bugzilla bug 1745004, which is valid. The bug has been moved to the POST state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
Can someone please give this PR some love? For baremetal deployments is quite exhausting to reinstall if you forgot to restart the ironic service due to the time they take to reboot. |
data/data/bootstrap/baremetal/files/usr/local/bin/startironic.sh.template
Show resolved
Hide resolved
Some people are hitting issues where the containers appear running in podman ps output, but are in fact unresponsive and podman exec/inspect CLI options fail. This may be a libpod bug (looking for related issues), but as a workaround we can check the inspect status, which should mean we can detect zombie containers and restart the ironic.service which appears to solve the issue. Related: openshift-metal3/dev-scripts#753
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, hardys, russellb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
7 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@hardys: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@hardys: All pull requests linked via external trackers have merged. Bugzilla bug 1745004 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This comment should have been adjusted following code-review updates in openshift#2249 but I missed it, now we are using the --format option clarify the comment to explain the multiple templating
This comment should have been adjusted following code-review updates in openshift#2249 but I missed it, now we are using the --format option clarify the comment to explain the multiple templating
Some people are hitting issues where the containers appear running in
podman ps output, but are in fact unresponsive and podman exec/inspect
CLI options fail.
This may be a libpod bug (looking for related issues), but as a workaround
we can check the inspect status, which should mean we can detect zombie
containers and restart the ironic.service which appears to solve the
issue.
Related: openshift-metal3/dev-scripts#753