Bug 1745004: baremetal: Use podman inspect to check ironic service status #2249

hardys · 2019-08-21T17:46:03Z

Some people are hitting issues where the containers appear running in
podman ps output, but are in fact unresponsive and podman exec/inspect
CLI options fail.

This may be a libpod bug (looking for related issues), but as a workaround
we can check the inspect status, which should mean we can detect zombie
containers and restart the ironic.service which appears to solve the
issue.

Related: openshift-metal3/dev-scripts#753

hardys · 2019-08-21T17:47:14Z

Marking WIP pending feedback from those running into these issues, I've not been able to reproduce reliably in my environment

abhinavdahiya · 2019-08-21T17:57:46Z

Why are you not running these services using static pods? They have liveness probes ans such?

dantrainor · 2019-08-21T19:45:30Z

The patch appeared to work for me, in that the changes got me past the part that they were meant to fix, but it appears now that i'm stick in the link-machine-and-node.sh loop:

+ echo -n 'Waiting for openshift-master-0 to stabilize ... '
Waiting for openshift-master-0 to stabilize ... + time_diff=-1566410097
+ [[ -1566410097 -gt '' ]]
++ curl -s -X GET http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts/openshift-master-0/status -H 'Accept: application/json' -H 'Content-Type: application/json' -H 'User-Agent: link-machine-and-node'
++ jq .status.provisioning.state
++ sed 's/"//g'
+ state=null
+ echo null
null
+ '[' null = 'externally provisioned' ']'
+ sleep 5

I'm not sure if this is related, but a common error from all of the nodes is:

(not sure why my markdown above isn't working; here's the actual error: Aug 21 19:42:09 master-0 hyperkube[1236]: E0821 19:42:09.096126 1236 kubelet.go:1648] Failed creating a mirror pod for "mdns-publisher-master-0_openshift-kni-infra(ea93e1102ed250148867ccbe693731e5)": namespaces "openshift-kni-infra" not found)

There's no .status.provisioning.state present when querying the link machine URL, and I think it's because bits of openshift-kni-infra never get started, so there's no data to report.

I'll redeploy and test more.

dantrainor · 2019-08-21T20:25:26Z

Failed again, looks like it's having issues finding the IPA images:

Aug 21 20:24:14 localhost openshift.sh[1621]: kubectl create --filename ./90_metal3_baremetalhost_crd.yaml failed. Retrying in 5 seconds...
Aug 21 20:24:14 localhost startironic.sh[2054]: + curl --fail --head http://localhost/images/ironic-python-agent.initramfs
Aug 21 20:24:14 localhost startironic.sh[2054]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Aug 21 20:24:14 localhost startironic.sh[2054]:                                  Dload  Upload   Total   Spent    Left  Speed
Aug 21 20:24:14 localhost startironic.sh[2054]: [158B blob data]
Aug 21 20:24:14 localhost startironic.sh[2054]: curl: (22) The requested URL returned error: 404 Not Found
Aug 21 20:24:14 localhost startironic.sh[2054]: + sleep 1
Aug 21 20:24:15 localhost startironic.sh[2054]: + curl --fail --head http://localhost/images/ironic-python-agent.initramfs
Aug 21 20:24:15 localhost startironic.sh[2054]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Aug 21 20:24:15 localhost startironic.sh[2054]:                                  Dload  Upload   Total   Spent    Left  Speed
Aug 21 20:24:15 localhost startironic.sh[2054]: [158B blob data]
Aug 21 20:24:15 localhost startironic.sh[2054]: curl: (22) The requested URL returned error: 404 Not Found
Aug 21 20:24:15 localhost startironic.sh[2054]: + sleep 1
Aug 21 20:24:16 localhost startironic.sh[2054]: + curl --fail --head http://localhost/images/ironic-python-agent.initramfs
Aug 21 20:24:16 localhost startironic.sh[2054]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Aug 21 20:24:16 localhost startironic.sh[2054]:                                  Dload  Upload   Total   Spent    Left  Speed
Aug 21 20:24:16 localhost startironic.sh[2054]: [158B blob data]
Aug 21 20:24:16 localhost startironic.sh[2054]: curl: (22) The requested URL returned error: 404 Not Found
Aug 21 20:24:16 localhost startironic.sh[2054]: + sleep 1
Aug 21 20:24:17 localhost startironic.sh[2054]: + curl --fail --head http://localhost/images/ironic-python-agent.initramfs
Aug 21 20:24:17 localhost startironic.sh[2054]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Aug 21 20:24:17 localhost startironic.sh[2054]:                                  Dload  Upload   Total   Spent    Left  Speed
Aug 21 20:24:17 localhost startironic.sh[2054]: [158B blob data]
Aug 21 20:24:17 localhost startironic.sh[2054]: curl: (22) The requested URL returned error: 404 Not Found

This part may be related to #2234

dantrainor · 2019-08-21T21:19:08Z

Great success. Deployment works on my end, with these changes rebased on top of #2234

hardys · 2019-08-22T06:43:03Z

Why are you not running these services using static pods? They have liveness probes ans such?

Yeah I think static pods may be a better approach - I started out copying the systemd/script approach used for the DNS integration (which have since moved to staticpods launched via the MCO).

These services are specific to the bootstrap VM (same services are launched on the cluster via the MAO as part of bringing up the baremetal-operator), so I guess I can define a static pod via the installer assets, similar to how we're creating this systemd service?

I removed the WIP since Dan confirmed this works around his immediate blocker issues and I raised #2251 to track moving to a staticpod - I'll investigate and propose that as a follow-up if that's OK with @abhinavdahiya

hardys · 2019-08-22T06:44:20Z

Failed again, looks like it's having issues finding the IPA images:

These errors are temporary and expected, we loop waiting for the images to get downloaded by the downloader containers.

derekhiggins · 2019-08-22T06:59:57Z

works for me (kills the service when needed), see "error getting container from store..."

Aug 22 06:44:39 localhost startironic.sh[2221]: + for name in ironic-api ironic-conductor ironic-inspector dnsmasq httpd mariadb
Aug 22 06:44:39 localhost startironic.sh[2221]: + grep -q '"Status": "running"'
Aug 22 06:44:39 localhost startironic.sh[2221]: + podman inspect dnsmasq
Aug 22 06:44:39 localhost startironic.sh[2221]: Error: error getting libpod container inspect data dffc1656d3e965bb691bfee3047071973f9d7669b9fd6d68c3e13fb697a86c43: error getting container from store "dffc1656d>
Aug 22 06:44:39 localhost startironic.sh[2221]: + echo 'ERROR: Unexpected service status for dnsmasq'
Aug 22 06:44:39 localhost startironic.sh[2221]: ERROR: Unexpected service status for dnsmasq
Aug 22 06:44:39 localhost startironic.sh[2221]: + podman inspect dnsmasq
Aug 22 06:44:39 localhost startironic.sh[2221]: Error: error getting libpod container inspect data dffc1656d3e965bb691bfee3047071973f9d7669b9fd6d68c3e13fb697a86c43: error getting container from store "dffc1656d>
Aug 22 06:44:39 localhost systemd[1]: ironic.service: Main process exited, code=exited, status=125/n/a
Aug 22 06:44:39 localhost systemd[1]: ironic.service: Failed with result 'exit-code'.
Aug 22 06:44:49 localhost systemd[1]: ironic.service: Service RestartSec=10s expired, scheduling restart.
Aug 22 06:44:49 localhost systemd[1]: ironic.service: Scheduled restart job, restart counter is at 1.
Aug 22 06:44:49 localhost systemd[1]: Stopped Baremetal Deployment Ironic Services.
Aug 22 06:44:49 localhost systemd[1]: Started Baremetal Deployment Ironic Services.

data/data/bootstrap/baremetal/files/usr/local/bin/startironic.sh.template

russellb · 2019-08-22T12:46:50Z

/lgtm since #2251 was filed to track moving this to a static pod

hardys · 2019-08-22T13:21:52Z

Note that @dhiggins figured out the root cause is a crio restart, ref #2251 (comment) so moving to staticpods will likely be the long term fix, but this has been confirmed by several people as working around the immediate blocker issue

rdoxenham · 2019-08-22T13:46:48Z

Can confirm, I suspect that in 50% of my builds I run into the problem that this PR is solving. Big LGTM from me. Thx!

rdoxenham · 2019-08-23T08:27:21Z

Also seeing a rather odd one which I suspect is because we're not handling the pod errors without this patch:

[kni@provisioner ~]$ openstack baremetal node list
(pymysql.err.ProgrammingError) (1146, u"Table 'ironic.nodes' doesn't exist") [SQL: u'SELECT anon_1.nodes_created_at AS anon_1_nodes_created_at, anon_1.nodes_updated_at AS anon_1_nodes_updated_at, anon_1.nodes_version AS anon_1_nodes_version, anon_1.nodes_id AS anon_1_nodes_id, anon_1.nodes_uuid AS anon_1_nodes_uuid, anon_1.nodes_instance_uuid AS anon_1_nodes_instance_uuid, anon_1.nodes_name AS anon_1_nodes_name, anon_1.nodes_chassis_id AS anon_1_nodes_chassis_id, anon_1.nodes_power_state AS anon_1_nodes_power_state, anon_1.nodes_target_power_state AS anon_1_nodes_target_power_state, anon_1.nodes_provision_state AS anon_1_nodes_provision_state, anon_1.nodes_target_provision_state AS anon_1_nodes_target_provision_state, anon_1.nodes_provision_updated_at AS anon_1_nodes_provision_updated_at, anon_1.nodes_last_error AS anon_1_nodes_last_error, anon_1.nodes_instance_info AS anon_1_nodes_instance_info, anon_1.nodes_properties AS anon_1_nodes_properties, anon_1.nodes_driver AS anon_1_nodes_driver, anon_1.nodes_driver_info AS anon_1_nodes_driver_info, anon_1.nodes_driver_internal_info AS anon_1_nodes_driver_internal_info, anon_1.nodes_clean_step AS anon_1_nodes_clean_step, anon_1.nodes_deploy_step AS anon_1_nodes_deploy_step, anon_1.nodes_resource_class AS anon_1_nodes_resource_class, anon_1.nodes_raid_config AS anon_1_nodes_raid_config, anon_1.nodes_target_raid_config AS anon_1_nodes_target_raid_config, anon_1.nodes_reservation AS anon_1_nodes_reservation, anon_1.nodes_conductor_affinity AS anon_1_nodes_conductor_affinity, anon_1.nodes_conductor_group AS anon_1_nodes_conductor_group, anon_1.nodes_maintenance AS anon_1_nodes_maintenance, anon_1.nodes_maintenance_reason AS anon_1_nodes_maintenance_reason, anon_1.nodes_fault AS anon_1_nodes_fault, anon_1.nodes_console_enabled AS anon_1_nodes_console_enabled, anon_1.nodes_inspection_finished_at AS anon_1_nodes_inspection_finished_at, anon_1.nodes_inspection_started_at AS anon_1_nodes_inspection_started_at, anon_1.nodes_extra AS anon_1_nodes_extra, anon_1.nodes_automated_clean AS anon_1_nodes_automated_clean, anon_1.nodes_protected AS anon_1_nodes_protected, anon_1.nodes_protected_reason AS anon_1_nodes_protected_reason, anon_1.nodes_owner AS anon_1_nodes_owner, anon_1.nodes_allocation_id AS anon_1_nodes_allocation_id, anon_1.nodes_description AS anon_1_nodes_description, anon_1.nodes_bios_interface AS anon_1_nodes_bios_interface, anon_1.nodes_boot_interface AS anon_1_nodes_boot_interface, anon_1.nodes_console_interface AS anon_1_nodes_console_interface, anon_1.nodes_deploy_interface AS anon_1_nodes_deploy_interface, anon_1.nodes_inspect_interface AS anon_1_nodes_inspect_interface, anon_1.nodes_management_interface AS anon_1_nodes_management_interface, anon_1.nodes_network_interface AS anon_1_nodes_network_interface, anon_1.nodes_raid_interface AS anon_1_nodes_raid_interface, anon_1.nodes_rescue_interface AS anon_1_nodes_rescue_interface, anon_1.nodes_storage_interface AS anon_1_nodes_storage_interface, anon_1.nodes_power_interface AS anon_1_nodes_power_interface, anon_1.nodes_vendor_interface AS anon_1_nodes_vendor_interface, node_tags_1.created_at AS node_tags_1_created_at, node_tags_1.updated_at AS node_tags_1_updated_at, node_tags_1.version AS node_tags_1_version, node_tags_1.node_id AS node_tags_1_node_id, node_tags_1.tag AS node_tags_1_tag, node_traits_1.created_at AS node_traits_1_created_at, node_traits_1.updated_at AS node_traits_1_updated_at, node_traits_1.version AS node_traits_1_version, node_traits_1.node_id AS node_traits_1_node_id, node_traits_1.trait AS node_traits_1_trait \nFROM (SELECT nodes.created_at AS nodes_created_at, nodes.updated_at AS nodes_updated_at, nodes.version AS nodes_version, nodes.id AS nodes_id, nodes.uuid AS nodes_uuid, nodes.instance_uuid AS nodes_instance_uuid, nodes.name AS nodes_name, nodes.chassis_id AS nodes_chassis_id, nodes.power_state AS nodes_power_state, nodes.target_power_state AS nodes_target_power_state, nodes.provision_state AS nodes_provision_state, nodes.target_provision_state AS nodes_target_provision_state, nodes.provision_updated_at AS nodes_provision_updated_at, nodes.last_error AS nodes_last_error, nodes.instance_info AS nodes_instance_info, nodes.properties AS nodes_properties, nodes.driver AS nodes_driver, nodes.driver_info AS nodes_driver_info, nodes.driver_internal_info AS nodes_driver_internal_info, nodes.clean_step AS nodes_clean_step, nodes.deploy_step AS nodes_deploy_step, nodes.resource_class AS nodes_resource_class, nodes.raid_config AS nodes_raid_config, nodes.target_raid_config AS nodes_target_raid_config, nodes.reservation AS nodes_reservation, nodes.conductor_affinity AS nodes_conductor_affinity, nodes.conductor_group AS nodes_conductor_group, nodes.maintenance AS nodes_maintenance, nodes.maintenance_reason AS nodes_maintenance_reason, nodes.fault AS nodes_fault, nodes.console_enabled AS nodes_console_enabled, nodes.inspection_finished_at AS nodes_inspection_finished_at, nodes.inspection_started_at AS nodes_inspection_started_at, nodes.extra AS nodes_extra, nodes.automated_clean AS nodes_automated_clean, nodes.protected AS nodes_protected, nodes.protected_reason AS nodes_protected_reason, nodes.owner AS nodes_owner, nodes.allocation_id AS nodes_allocation_id, nodes.description AS nodes_description, nodes.bios_interface AS nodes_bios_interface, nodes.boot_interface AS nodes_boot_interface, nodes.console_interface AS nodes_console_interface, nodes.deploy_interface AS nodes_deploy_interface, nodes.inspect_interface AS nodes_inspect_interface, nodes.management_interface AS nodes_management_interface, nodes.network_interface AS nodes_network_interface, nodes.raid_interface AS nodes_raid_interface, nodes.rescue_interface AS nodes_rescue_interface, nodes.storage_interface AS nodes_storage_interface, nodes.power_interface AS nodes_power_interface, nodes.vendor_interface AS nodes_vendor_interface \nFROM nodes ORDER BY nodes.id ASC \n LIMIT %(param_1)s) AS anon_1 LEFT OUTER JOIN node_tags AS node_tags_1 ON node_tags_1.node_id = anon_1.nodes_id LEFT OUTER JOIN node_traits AS node_traits_1 ON node_traits_1.node_id = anon_1.nodes_id ORDER BY anon_1.nodes_id ASC'] [parameters: {u'param_1': 1000}] (Background on this error at: http://sqlalche.me/e/f405) (HTTP 500)

I have to manually restart Ironic for this to work properly without the patch here.

hardys · 2019-08-23T11:18:32Z

Also seeing a rather odd one which I suspect is because we're not handling the pod errors without this patch:

As discussed this is another manifestation of the same issue, the conductor service got killed so the db-sync didn't happen - this restart approach should work around it until we get the podman-crio restart issues worked out and/or switch to a staticpod

openshift-ci-robot · 2019-08-23T13:03:12Z

@hardys: This pull request references Bugzilla bug 1745004, which is invalid:

expected the bug to target the "4.2.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1745004: baremetal: Use podman inspect to check ironic service status

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-08-23T13:06:24Z

@hardys: This pull request references Bugzilla bug 1745004, which is valid. The bug has been moved to the POST state.

In response to this:

Bug 1745004: baremetal: Use podman inspect to check ironic service status

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

russellb · 2019-08-23T19:01:56Z

/lgtm

e-minguez · 2019-08-27T13:09:43Z

Can someone please give this PR some love? For baremetal deployments is quite exhausting to reinstall if you forgot to restart the ironic service due to the time they take to reboot.
Thanks!!!

data/data/bootstrap/baremetal/files/usr/local/bin/startironic.sh.template

Some people are hitting issues where the containers appear running in podman ps output, but are in fact unresponsive and podman exec/inspect CLI options fail. This may be a libpod bug (looking for related issues), but as a workaround we can check the inspect status, which should mean we can detect zombie containers and restart the ironic.service which appears to solve the issue. Related: openshift-metal3/dev-scripts#753

abhinavdahiya · 2019-08-27T21:49:01Z

/lgtm

openshift-ci-robot · 2019-08-27T21:49:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, hardys, russellb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2019-08-27T23:40:10Z

/retest