Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1745004: baremetal: Use podman inspect to check ironic service status #2249

Merged
merged 1 commit into from
Aug 28, 2019

Conversation

hardys
Copy link
Contributor

@hardys hardys commented Aug 21, 2019

Some people are hitting issues where the containers appear running in
podman ps output, but are in fact unresponsive and podman exec/inspect
CLI options fail.

This may be a libpod bug (looking for related issues), but as a workaround
we can check the inspect status, which should mean we can detect zombie
containers and restart the ironic.service which appears to solve the
issue.

Related: openshift-metal3/dev-scripts#753

@hardys hardys changed the title baremetal: Use podman inspect to check ironic service status WIP: baremetal: Use podman inspect to check ironic service status Aug 21, 2019
@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 21, 2019
@hardys
Copy link
Contributor Author

hardys commented Aug 21, 2019

Marking WIP pending feedback from those running into these issues, I've not been able to reproduce reliably in my environment

@abhinavdahiya
Copy link
Contributor

Why are you not running these services using static pods? They have liveness probes ans such?

@dantrainor
Copy link

dantrainor commented Aug 21, 2019

The patch appeared to work for me, in that the changes got me past the part that they were meant to fix, but it appears now that i'm stick in the link-machine-and-node.sh loop:

+ echo -n 'Waiting for openshift-master-0 to stabilize ... '
Waiting for openshift-master-0 to stabilize ... + time_diff=-1566410097
+ [[ -1566410097 -gt '' ]]
++ curl -s -X GET http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts/openshift-master-0/status -H 'Accept: application/json' -H 'Content-Type: application/json' -H 'User-Agent: link-machine-and-node'
++ jq .status.provisioning.state
++ sed 's/"//g'
+ state=null
+ echo null
null
+ '[' null = 'externally provisioned' ']'
+ sleep 5

I'm not sure if this is related, but a common error from all of the nodes is:

(not sure why my markdown above isn't working; here's the actual error: Aug 21 19:42:09 master-0 hyperkube[1236]: E0821 19:42:09.096126 1236 kubelet.go:1648] Failed creating a mirror pod for "mdns-publisher-master-0_openshift-kni-infra(ea93e1102ed250148867ccbe693731e5)": namespaces "openshift-kni-infra" not found)

There's no .status.provisioning.state present when querying the link machine URL, and I think it's because bits of openshift-kni-infra never get started, so there's no data to report.

I'll redeploy and test more.

@dantrainor
Copy link

dantrainor commented Aug 21, 2019

Failed again, looks like it's having issues finding the IPA images:

Aug 21 20:24:14 localhost openshift.sh[1621]: kubectl create --filename ./90_metal3_baremetalhost_crd.yaml failed. Retrying in 5 seconds...
Aug 21 20:24:14 localhost startironic.sh[2054]: + curl --fail --head http://localhost/images/ironic-python-agent.initramfs
Aug 21 20:24:14 localhost startironic.sh[2054]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Aug 21 20:24:14 localhost startironic.sh[2054]:                                  Dload  Upload   Total   Spent    Left  Speed
Aug 21 20:24:14 localhost startironic.sh[2054]: [158B blob data]
Aug 21 20:24:14 localhost startironic.sh[2054]: curl: (22) The requested URL returned error: 404 Not Found
Aug 21 20:24:14 localhost startironic.sh[2054]: + sleep 1
Aug 21 20:24:15 localhost startironic.sh[2054]: + curl --fail --head http://localhost/images/ironic-python-agent.initramfs
Aug 21 20:24:15 localhost startironic.sh[2054]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Aug 21 20:24:15 localhost startironic.sh[2054]:                                  Dload  Upload   Total   Spent    Left  Speed
Aug 21 20:24:15 localhost startironic.sh[2054]: [158B blob data]
Aug 21 20:24:15 localhost startironic.sh[2054]: curl: (22) The requested URL returned error: 404 Not Found
Aug 21 20:24:15 localhost startironic.sh[2054]: + sleep 1
Aug 21 20:24:16 localhost startironic.sh[2054]: + curl --fail --head http://localhost/images/ironic-python-agent.initramfs
Aug 21 20:24:16 localhost startironic.sh[2054]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Aug 21 20:24:16 localhost startironic.sh[2054]:                                  Dload  Upload   Total   Spent    Left  Speed
Aug 21 20:24:16 localhost startironic.sh[2054]: [158B blob data]
Aug 21 20:24:16 localhost startironic.sh[2054]: curl: (22) The requested URL returned error: 404 Not Found
Aug 21 20:24:16 localhost startironic.sh[2054]: + sleep 1
Aug 21 20:24:17 localhost startironic.sh[2054]: + curl --fail --head http://localhost/images/ironic-python-agent.initramfs
Aug 21 20:24:17 localhost startironic.sh[2054]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Aug 21 20:24:17 localhost startironic.sh[2054]:                                  Dload  Upload   Total   Spent    Left  Speed
Aug 21 20:24:17 localhost startironic.sh[2054]: [158B blob data]
Aug 21 20:24:17 localhost startironic.sh[2054]: curl: (22) The requested URL returned error: 404 Not Found

This part may be related to #2234

@dantrainor
Copy link

Great success. Deployment works on my end, with these changes rebased on top of #2234

@hardys hardys changed the title WIP: baremetal: Use podman inspect to check ironic service status baremetal: Use podman inspect to check ironic service status Aug 22, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 22, 2019
@hardys
Copy link
Contributor Author

hardys commented Aug 22, 2019

Why are you not running these services using static pods? They have liveness probes ans such?

Yeah I think static pods may be a better approach - I started out copying the systemd/script approach used for the DNS integration (which have since moved to staticpods launched via the MCO).

These services are specific to the bootstrap VM (same services are launched on the cluster via the MAO as part of bringing up the baremetal-operator), so I guess I can define a static pod via the installer assets, similar to how we're creating this systemd service?

I removed the WIP since Dan confirmed this works around his immediate blocker issues and I raised #2251 to track moving to a staticpod - I'll investigate and propose that as a follow-up if that's OK with @abhinavdahiya

@hardys
Copy link
Contributor Author

hardys commented Aug 22, 2019

Failed again, looks like it's having issues finding the IPA images:

These errors are temporary and expected, we loop waiting for the images to get downloaded by the downloader containers.

@derekhiggins
Copy link
Contributor

works for me (kills the service when needed), see "error getting container from store..."

Aug 22 06:44:39 localhost startironic.sh[2221]: + for name in ironic-api ironic-conductor ironic-inspector dnsmasq httpd mariadb
Aug 22 06:44:39 localhost startironic.sh[2221]: + grep -q '"Status": "running"'
Aug 22 06:44:39 localhost startironic.sh[2221]: + podman inspect dnsmasq
Aug 22 06:44:39 localhost startironic.sh[2221]: Error: error getting libpod container inspect data dffc1656d3e965bb691bfee3047071973f9d7669b9fd6d68c3e13fb697a86c43: error getting container from store "dffc1656d>
Aug 22 06:44:39 localhost startironic.sh[2221]: + echo 'ERROR: Unexpected service status for dnsmasq'
Aug 22 06:44:39 localhost startironic.sh[2221]: ERROR: Unexpected service status for dnsmasq
Aug 22 06:44:39 localhost startironic.sh[2221]: + podman inspect dnsmasq
Aug 22 06:44:39 localhost startironic.sh[2221]: Error: error getting libpod container inspect data dffc1656d3e965bb691bfee3047071973f9d7669b9fd6d68c3e13fb697a86c43: error getting container from store "dffc1656d>
Aug 22 06:44:39 localhost systemd[1]: ironic.service: Main process exited, code=exited, status=125/n/a
Aug 22 06:44:39 localhost systemd[1]: ironic.service: Failed with result 'exit-code'.
Aug 22 06:44:49 localhost systemd[1]: ironic.service: Service RestartSec=10s expired, scheduling restart.
Aug 22 06:44:49 localhost systemd[1]: ironic.service: Scheduled restart job, restart counter is at 1.
Aug 22 06:44:49 localhost systemd[1]: Stopped Baremetal Deployment Ironic Services.
Aug 22 06:44:49 localhost systemd[1]: Started Baremetal Deployment Ironic Services.

@russellb
Copy link
Member

/lgtm since #2251 was filed to track moving this to a static pod

@hardys
Copy link
Contributor Author

hardys commented Aug 22, 2019

Note that @dhiggins figured out the root cause is a crio restart, ref #2251 (comment) so moving to staticpods will likely be the long term fix, but this has been confirmed by several people as working around the immediate blocker issue

@rdoxenham
Copy link
Contributor

Can confirm, I suspect that in 50% of my builds I run into the problem that this PR is solving. Big LGTM from me. Thx!

rdoxenham pushed a commit to rdoxenham/installer that referenced this pull request Aug 23, 2019
@rdoxenham
Copy link
Contributor

Also seeing a rather odd one which I suspect is because we're not handling the pod errors without this patch:

[kni@provisioner ~]$ openstack baremetal node list
(pymysql.err.ProgrammingError) (1146, u"Table 'ironic.nodes' doesn't exist") [SQL: u'SELECT anon_1.nodes_created_at AS anon_1_nodes_created_at, anon_1.nodes_updated_at AS anon_1_nodes_updated_at, anon_1.nodes_version AS anon_1_nodes_version, anon_1.nodes_id AS anon_1_nodes_id, anon_1.nodes_uuid AS anon_1_nodes_uuid, anon_1.nodes_instance_uuid AS anon_1_nodes_instance_uuid, anon_1.nodes_name AS anon_1_nodes_name, anon_1.nodes_chassis_id AS anon_1_nodes_chassis_id, anon_1.nodes_power_state AS anon_1_nodes_power_state, anon_1.nodes_target_power_state AS anon_1_nodes_target_power_state, anon_1.nodes_provision_state AS anon_1_nodes_provision_state, anon_1.nodes_target_provision_state AS anon_1_nodes_target_provision_state, anon_1.nodes_provision_updated_at AS anon_1_nodes_provision_updated_at, anon_1.nodes_last_error AS anon_1_nodes_last_error, anon_1.nodes_instance_info AS anon_1_nodes_instance_info, anon_1.nodes_properties AS anon_1_nodes_properties, anon_1.nodes_driver AS anon_1_nodes_driver, anon_1.nodes_driver_info AS anon_1_nodes_driver_info, anon_1.nodes_driver_internal_info AS anon_1_nodes_driver_internal_info, anon_1.nodes_clean_step AS anon_1_nodes_clean_step, anon_1.nodes_deploy_step AS anon_1_nodes_deploy_step, anon_1.nodes_resource_class AS anon_1_nodes_resource_class, anon_1.nodes_raid_config AS anon_1_nodes_raid_config, anon_1.nodes_target_raid_config AS anon_1_nodes_target_raid_config, anon_1.nodes_reservation AS anon_1_nodes_reservation, anon_1.nodes_conductor_affinity AS anon_1_nodes_conductor_affinity, anon_1.nodes_conductor_group AS anon_1_nodes_conductor_group, anon_1.nodes_maintenance AS anon_1_nodes_maintenance, anon_1.nodes_maintenance_reason AS anon_1_nodes_maintenance_reason, anon_1.nodes_fault AS anon_1_nodes_fault, anon_1.nodes_console_enabled AS anon_1_nodes_console_enabled, anon_1.nodes_inspection_finished_at AS anon_1_nodes_inspection_finished_at, anon_1.nodes_inspection_started_at AS anon_1_nodes_inspection_started_at, anon_1.nodes_extra AS anon_1_nodes_extra, anon_1.nodes_automated_clean AS anon_1_nodes_automated_clean, anon_1.nodes_protected AS anon_1_nodes_protected, anon_1.nodes_protected_reason AS anon_1_nodes_protected_reason, anon_1.nodes_owner AS anon_1_nodes_owner, anon_1.nodes_allocation_id AS anon_1_nodes_allocation_id, anon_1.nodes_description AS anon_1_nodes_description, anon_1.nodes_bios_interface AS anon_1_nodes_bios_interface, anon_1.nodes_boot_interface AS anon_1_nodes_boot_interface, anon_1.nodes_console_interface AS anon_1_nodes_console_interface, anon_1.nodes_deploy_interface AS anon_1_nodes_deploy_interface, anon_1.nodes_inspect_interface AS anon_1_nodes_inspect_interface, anon_1.nodes_management_interface AS anon_1_nodes_management_interface, anon_1.nodes_network_interface AS anon_1_nodes_network_interface, anon_1.nodes_raid_interface AS anon_1_nodes_raid_interface, anon_1.nodes_rescue_interface AS anon_1_nodes_rescue_interface, anon_1.nodes_storage_interface AS anon_1_nodes_storage_interface, anon_1.nodes_power_interface AS anon_1_nodes_power_interface, anon_1.nodes_vendor_interface AS anon_1_nodes_vendor_interface, node_tags_1.created_at AS node_tags_1_created_at, node_tags_1.updated_at AS node_tags_1_updated_at, node_tags_1.version AS node_tags_1_version, node_tags_1.node_id AS node_tags_1_node_id, node_tags_1.tag AS node_tags_1_tag, node_traits_1.created_at AS node_traits_1_created_at, node_traits_1.updated_at AS node_traits_1_updated_at, node_traits_1.version AS node_traits_1_version, node_traits_1.node_id AS node_traits_1_node_id, node_traits_1.trait AS node_traits_1_trait \nFROM (SELECT nodes.created_at AS nodes_created_at, nodes.updated_at AS nodes_updated_at, nodes.version AS nodes_version, nodes.id AS nodes_id, nodes.uuid AS nodes_uuid, nodes.instance_uuid AS nodes_instance_uuid, nodes.name AS nodes_name, nodes.chassis_id AS nodes_chassis_id, nodes.power_state AS nodes_power_state, nodes.target_power_state AS nodes_target_power_state, nodes.provision_state AS nodes_provision_state, nodes.target_provision_state AS nodes_target_provision_state, nodes.provision_updated_at AS nodes_provision_updated_at, nodes.last_error AS nodes_last_error, nodes.instance_info AS nodes_instance_info, nodes.properties AS nodes_properties, nodes.driver AS nodes_driver, nodes.driver_info AS nodes_driver_info, nodes.driver_internal_info AS nodes_driver_internal_info, nodes.clean_step AS nodes_clean_step, nodes.deploy_step AS nodes_deploy_step, nodes.resource_class AS nodes_resource_class, nodes.raid_config AS nodes_raid_config, nodes.target_raid_config AS nodes_target_raid_config, nodes.reservation AS nodes_reservation, nodes.conductor_affinity AS nodes_conductor_affinity, nodes.conductor_group AS nodes_conductor_group, nodes.maintenance AS nodes_maintenance, nodes.maintenance_reason AS nodes_maintenance_reason, nodes.fault AS nodes_fault, nodes.console_enabled AS nodes_console_enabled, nodes.inspection_finished_at AS nodes_inspection_finished_at, nodes.inspection_started_at AS nodes_inspection_started_at, nodes.extra AS nodes_extra, nodes.automated_clean AS nodes_automated_clean, nodes.protected AS nodes_protected, nodes.protected_reason AS nodes_protected_reason, nodes.owner AS nodes_owner, nodes.allocation_id AS nodes_allocation_id, nodes.description AS nodes_description, nodes.bios_interface AS nodes_bios_interface, nodes.boot_interface AS nodes_boot_interface, nodes.console_interface AS nodes_console_interface, nodes.deploy_interface AS nodes_deploy_interface, nodes.inspect_interface AS nodes_inspect_interface, nodes.management_interface AS nodes_management_interface, nodes.network_interface AS nodes_network_interface, nodes.raid_interface AS nodes_raid_interface, nodes.rescue_interface AS nodes_rescue_interface, nodes.storage_interface AS nodes_storage_interface, nodes.power_interface AS nodes_power_interface, nodes.vendor_interface AS nodes_vendor_interface \nFROM nodes ORDER BY nodes.id ASC \n LIMIT %(param_1)s) AS anon_1 LEFT OUTER JOIN node_tags AS node_tags_1 ON node_tags_1.node_id = anon_1.nodes_id LEFT OUTER JOIN node_traits AS node_traits_1 ON node_traits_1.node_id = anon_1.nodes_id ORDER BY anon_1.nodes_id ASC'] [parameters: {u'param_1': 1000}] (Background on this error at: http://sqlalche.me/e/f405) (HTTP 500)

I have to manually restart Ironic for this to work properly without the patch here.

@hardys
Copy link
Contributor Author

hardys commented Aug 23, 2019

Also seeing a rather odd one which I suspect is because we're not handling the pod errors without this patch:

As discussed this is another manifestation of the same issue, the conductor service got killed so the db-sync didn't happen - this restart approach should work around it until we get the podman-crio restart issues worked out and/or switch to a staticpod

@hardys hardys changed the title baremetal: Use podman inspect to check ironic service status Bug 1745004: baremetal: Use podman inspect to check ironic service status Aug 23, 2019
@openshift-ci-robot
Copy link
Contributor

@hardys: This pull request references Bugzilla bug 1745004, which is invalid:

  • expected the bug to target the "4.2.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1745004: baremetal: Use podman inspect to check ironic service status

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 23, 2019
@hardys hardys changed the title baremetal: Use podman inspect to check ironic service status Bug 1745004: baremetal: Use podman inspect to check ironic service status Aug 23, 2019
@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 23, 2019
@openshift-ci-robot
Copy link
Contributor

@hardys: This pull request references Bugzilla bug 1745004, which is valid. The bug has been moved to the POST state.

In response to this:

Bug 1745004: baremetal: Use podman inspect to check ironic service status

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@russellb
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2019
@e-minguez
Copy link
Contributor

Can someone please give this PR some love? For baremetal deployments is quite exhausting to reinstall if you forgot to restart the ironic service due to the time they take to reboot.
Thanks!!!

Some people are hitting issues where the containers appear running in
podman ps output, but are in fact unresponsive and podman exec/inspect
CLI options fail.

This may be a libpod bug (looking for related issues), but as a workaround
we can check the inspect status, which should mean we can detect zombie
containers and restart the ironic.service which appears to solve the
issue.

Related: openshift-metal3/dev-scripts#753
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 27, 2019
@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 27, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, hardys, russellb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 27, 2019
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

@hardys: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt 7ce0f5c link /test e2e-libvirt
ci/prow/e2e-openstack 7ce0f5c link /test e2e-openstack

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 6318c72 into openshift:master Aug 28, 2019
@openshift-ci-robot
Copy link
Contributor

@hardys: All pull requests linked via external trackers have merged. Bugzilla bug 1745004 has been moved to the MODIFIED state.

In response to this:

Bug 1745004: baremetal: Use podman inspect to check ironic service status

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hardys pushed a commit to hardys/installer that referenced this pull request Aug 28, 2019
This comment should have been adjusted following code-review
updates in openshift#2249 but I missed it, now we are using the --format
option clarify the comment to explain the multiple templating
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
This comment should have been adjusted following code-review
updates in openshift#2249 but I missed it, now we are using the --format
option clarify the comment to explain the multiple templating
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants