From a3bf88ff33417a22c768aba5b71cba20716ecbe0 Mon Sep 17 00:00:00 2001 From: Stephen Benjamin Date: Fri, 15 May 2020 12:14:45 -0400 Subject: [PATCH 1/6] baremetal: improve debuggability of ipi deployments The goal of this enhancement is to improve the day 1 install experience and reduce the perception of complexity in baremetal IPI deployments. --- ...buggability-of-baremetal-ipi-deployment.md | 246 ++++++++++++++++++ 1 file changed, 246 insertions(+) create mode 100644 enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md diff --git a/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md b/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md new file mode 100644 index 0000000000..8986275c1d --- /dev/null +++ b/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md @@ -0,0 +1,246 @@ +--- +title: debuggability-of-baremetal-ipi-deployment +authors: + - "@stbenjam" +reviewers: + - "@abhinavdahiya" + - "@dtantsur" + - "@enxebre" + - "@hardys" + - "@juliakreger" + - "@markmc" + - "@sadasu" +approvers: + - TBD +creation-date: 2020-05-15 +last-updated: 2020-05-15 +status: provisional +see-also: +- https://github.com/openshift/installer/pull/3535 +- https://github.com/openshift/enhancements/pull/212 +replaces: +superseded-by: +--- + +# Improve debuggability of baremetal IPI deployment failures + +## Release Signoff Checklist + +- [ ] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Open Questions + +1. Should the installer error or warn when compute replicas quantity is + not met, even if at lest 2 workers deployed (enough to get a + functional cluster)? +2. In order to bubble up information about worker failures to the + installer, the most likely solution to me seems to use operator + status as Degraded, with a relevant error message. Could we mark + machine-api-operator when workers fail to roll out? + +## Summary + +In OpenShift 4.5, we improved the existing installer validations for +baremetal IPI to identify early problems. Those include identifying +duplicate baremetal host records, insufficient hardware resources to +deploy the requested cluster size, reachability of RHCOS images, and +networking misconfiguration such as overlapping networks or DNS +misconfiguration. + +However, a variety of situations exist where deployments fail for +reasons that were not preventable during the pre-install validations. +These failures in baremetal IPI are hard to diagnose. Errors from +baremetal-operator and ironic are often not presented to the user, and +even when they are the installer doesn't provide context about what +action to take. + +This enhancement request is a broad attempt at categorizing the types of +deployment failures, and what information we could present to the user +to make identifying root causes easier. + +## Motivation + +The goal of this enhancement is to improve the day 1 install experience +and reduce the perception of complexity in baremetal IPI deployments. + +### Goals + +- Any deployment that ends in an unsuccessful install must provide the + user clear and actionable information to diagnose the problem. + +### Non-Goals + +- Addressing the underlying causes of the failures are not the goal of + this enhancement. + +## Proposal + +Broadly, deployments fail due to problems encountered during these +installation activities: + + - Pre-bootstrap (image downloading, manifest creation, etc) + - Infrastructure automation (Terraform) + - Bootstrap + - Bare Metal Host Provisioning (Control Plane and Workers) + - Operator Deployment (i.e., those rolled out by CVO) + +We believe that since 4.5, pre-bootstrap errors are usually detected, +and useful information is presented to the user about how to rectify the +problem, so this enhancement request will focus on failures that occur +from terraform onward. + +### Kinds of deployment failure + +#### Infrastructure Automation (Terraform) + +Baremetal IPI relies on terraform to provision a libvirt bootstrap +virtual machine, and bare metal control plane hosts. We use +terraform-provider-libvirt and terraform-provider-ironic to accomplish +those goals. + +terraform-provider-ironic reports failures when it cannot reach the +Ironic API, or a control plane host fails to provision. In both cases, +we do not provide useful information to the user about what to do. + +#### Bootstrap Failures + +The bootstrap runs a couple of baremetal-specific services, including +Ironic as well as a utility that populates introspection data for +the control plane. + +Bootstrap typically fails for baremetal when we can't download the +machine-os image into our local HTTP cache. Less common, but still +sometimes seen are that services such as dnsmasq, mariadb, ironic-api, +ironic-conductor, or ironic-inspector fail. + +Failures on bootstrap services rarely result in any indication to the +user that something went wrong other than that there was a timeout. + +The installer has a feature for log gathering on bootstrap failure that +does not work on baremetal. This should be the first priority, but even +in this case a user still needs to look into an archive containing many +logs to identify a failure. + +Ideally there would be some mechanism to identify and extract useful +information and display it to the user. + +#### Bare Metal Host Provisioning + +Whether the control plane or worker nodes, provisioning of bare metal +hosts can fail in the same ways, although the communication path to +provide feedback is different in each case. For the control plane, +information about failure is presented to the user via terraform. For +workers, it would be through information on the `BareMetalHost` +resource, and the baremetal-operator logs. + +Provisioning can fail in many ways. The most difficult to troubleshoot +are simply when we fail to hear back from a host. Buggy UEFI firmware +may prevent PXE, a kernel could panic, or even a network cable may be +unplugged. In these cases, we should inform the user what little +information Ironic was able to discern, but also provide a suggestion +that the most effective way to troubleshoot the problem is examination +of the console of the host. + +An infrequent, but possible outcome of deployment to bare metal hosts, +is that Ironic is successful in cleaning, inspecting, and deploying a +host. After Ironic lays down an image on disk and reboots, Ironic marks +the host ‘active’. However, when the host boots again it’s possible that +there’s a catastrophic problem such as a kernel panic or fail to +configure with ignition. From Ironic's perspective, it's done it's duty, +and is unaware the host failed to come up. The feedback to the user is +only that there was a timeout. + +#### Operator Deployment + +Operator deployment failures are rarely platform-specific, although +there is one case that should be addressed. When worker deployment +fails, possibly due to provisioning issues like those described above, a +variety of operators may report failures such as ingress, console, and +others that cannot run on the control plane. + +When this happens, the installer times out, reports to the user a large +number of operators failed to roll out, and no useful context about what +to do or why the operators failed. + +#### User Stories + +##### Show more information from terraform + + - As a user, I want terraform to report last_error and status from + ironic in case of deployment failure. + + - As a user, I want the installer to provide suggestions for causes + of failure. See the existing work for translating terraform error + messages that is being done in https://github.com/openshift/installer/pull/3535. + +#### Extract relevant logs from the bootstrap + + - As a user, I would like the installer to extract and display error + messages from bootstrap journal when relevant errors can be + identified. + +#### Implement bootstrap gather + + - As a user, I want the installer to automatically gather logs when + bootstrap fails on the baremetal IPI platform, like it does for other + platforms. + +See also: + - https://github.com/openshift/installer/issues/2009 + +#### Show errors from machine controllers + + - As a user, I want the installer logs to bubble information up from + either machine-api-operator or cluster-baremetal-operator about why + workers failed to deploy. These operators should be degraded when + machine provisioning fails. + +#### Callback to Metal3 + + - As a user, I want my host to callback to Metal3/Ironic from ignition + when RHCOS boots. + +### Implementation Details/Notes/Constraints + + + +### Risks and Mitigations + +Some stories may impact the design of software managed by teams other +than the baremetal IPI team. These including the installer and +machine-api-operator teams, for example. + +## Design Details + +### Test Plan + +**Note:** *Section not required until targeted at a release.* + +### Graduation Criteria + +**Note:** *Section not required until targeted at a release.* + +### Upgrade / Downgrade Strategy + +Upgrades/downgrades are not applicable, as these are day 1 +considerations only. There is no impact on upgrades or downgrades. + +### Version Skew Strategy + +As these are day 1 considerations for greenfield deployments, no version +skew strategy is needed. + +## Implementation History + + +## Drawbacks + +## Alternatives + +An alternative approach would be to provide troubleshooting +documentation and leave users to uncover the root causes of failures on +their own, which is largely what happens today. From 45ad3cdc3cdaf26662802b6a5f145f39b04fc5ab Mon Sep 17 00:00:00 2001 From: Stephen Benjamin Date: Tue, 19 May 2020 10:09:06 -0400 Subject: [PATCH 2/6] Update enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md Co-authored-by: Russell Bryant --- .../baremetal/debuggability-of-baremetal-ipi-deployment.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md b/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md index 8986275c1d..9f84f7f1b8 100644 --- a/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md +++ b/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md @@ -74,7 +74,7 @@ and reduce the perception of complexity in baremetal IPI deployments. ### Non-Goals -- Addressing the underlying causes of the failures are not the goal of +- Addressing the underlying causes of the failures is not the goal of this enhancement. ## Proposal From 0b73120ac7d1386440c6b58c610f98512f61af8f Mon Sep 17 00:00:00 2001 From: Stephen Benjamin Date: Tue, 19 May 2020 10:10:27 -0400 Subject: [PATCH 3/6] Apply suggestions from code review Co-authored-by: Russell Bryant --- .../debuggability-of-baremetal-ipi-deployment.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md b/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md index 9f84f7f1b8..f2964f4fc1 100644 --- a/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md +++ b/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md @@ -93,13 +93,13 @@ and useful information is presented to the user about how to rectify the problem, so this enhancement request will focus on failures that occur from terraform onward. -### Kinds of deployment failure +### Kinds of deployment failures #### Infrastructure Automation (Terraform) -Baremetal IPI relies on terraform to provision a libvirt bootstrap +Bare metal IPI relies on terraform to provision a libvirt bootstrap virtual machine, and bare metal control plane hosts. We use -terraform-provider-libvirt and terraform-provider-ironic to accomplish +`terraform-provider-libvirt` and `terraform-provider-ironic` to accomplish those goals. terraform-provider-ironic reports failures when it cannot reach the @@ -109,8 +109,8 @@ we do not provide useful information to the user about what to do. #### Bootstrap Failures The bootstrap runs a couple of baremetal-specific services, including -Ironic as well as a utility that populates introspection data for -the control plane. +Ironic and a utility that populates hardware information for +the control plane hosts. Bootstrap typically fails for baremetal when we can't download the machine-os image into our local HTTP cache. Less common, but still From 8b05817c09797e5c571ffd106e91ce0004f588aa Mon Sep 17 00:00:00 2001 From: Stephen Benjamin Date: Tue, 19 May 2020 10:24:34 -0400 Subject: [PATCH 4/6] Add more see-also references, and other clean-ups --- ...buggability-of-baremetal-ipi-deployment.md | 28 +++++++++++-------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md b/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md index f2964f4fc1..7728d433fa 100644 --- a/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md +++ b/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md @@ -16,8 +16,11 @@ creation-date: 2020-05-15 last-updated: 2020-05-15 status: provisional see-also: -- https://github.com/openshift/installer/pull/3535 - https://github.com/openshift/enhancements/pull/212 +- https://github.com/openshift/installer/issues/2009 +- https://github.com/openshift/installer/issues/2569 +- https://github.com/openshift/installer/pull/3535 +- https://storyboard.openstack.org/#!/story/2007664 replaces: superseded-by: --- @@ -98,19 +101,21 @@ from terraform onward. #### Infrastructure Automation (Terraform) Bare metal IPI relies on terraform to provision a libvirt bootstrap -virtual machine, and bare metal control plane hosts. We use -`terraform-provider-libvirt` and `terraform-provider-ironic` to accomplish -those goals. +virtual machine and the bare metal control plane hosts. We use +`terraform-provider-libvirt` and `terraform-provider-ironic` to +accomplish those goals. -terraform-provider-ironic reports failures when it cannot reach the -Ironic API, or a control plane host fails to provision. In both cases, -we do not provide useful information to the user about what to do. +Both providers report failures when encountered, but there's usually +little information provided to the user about what to do in the +OpenShift context. Given a specific terraform error, the installer +should provide specific information to the user about how to continue +troubleshooting. #### Bootstrap Failures The bootstrap runs a couple of baremetal-specific services, including -Ironic and a utility that populates hardware information for -the control plane hosts. +Ironic and a utility that populates hardware details for the control +plane hosts. Bootstrap typically fails for baremetal when we can't download the machine-os image into our local HTTP cache. Less common, but still @@ -204,9 +209,10 @@ See also: - As a user, I want my host to callback to Metal3/Ironic from ignition when RHCOS boots. -### Implementation Details/Notes/Constraints - +See also: + - https://storyboard.openstack.org/#!/story/2007664 +### Implementation Details/Notes/Constraints ### Risks and Mitigations From 1440b6f35d6d1bcf3d66afef6c9d460fbbf6fb28 Mon Sep 17 00:00:00 2001 From: Stephen Benjamin Date: Tue, 19 May 2020 10:33:33 -0400 Subject: [PATCH 5/6] Clarify worker deployments should make operator degraded --- .../debuggability-of-baremetal-ipi-deployment.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md b/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md index 7728d433fa..dda4a92b77 100644 --- a/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md +++ b/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md @@ -139,8 +139,10 @@ Whether the control plane or worker nodes, provisioning of bare metal hosts can fail in the same ways, although the communication path to provide feedback is different in each case. For the control plane, information about failure is presented to the user via terraform. For -workers, it would be through information on the `BareMetalHost` -resource, and the baremetal-operator logs. +workers, it is currently only shown on the `BareMetalHost` resource or +by examining baremetal-operator logs. Failure to deploy a worker should +be reflected by marking either the `machine-api-operator` or the future +`cluster-baremetal-operator` degraded. Provisioning can fail in many ways. The most difficult to troubleshoot are simply when we fail to hear back from a host. Buggy UEFI firmware @@ -232,8 +234,11 @@ machine-api-operator teams, for example. ### Upgrade / Downgrade Strategy -Upgrades/downgrades are not applicable, as these are day 1 -considerations only. There is no impact on upgrades or downgrades. +This enhancement largely consists of day 1 considerations, however we +are suggesting that worker deployment failures be reflected as a +Degraded operator status either in `machine-api-operator` or +`cluster-baremetal-operator`. This would prevent an upgrade and is an +intentional change in behavior. ### Version Skew Strategy @@ -242,7 +247,6 @@ skew strategy is needed. ## Implementation History - ## Drawbacks ## Alternatives From 545d73877e53c2045725645a7a958e4b9c80aaf0 Mon Sep 17 00:00:00 2001 From: Stephen Benjamin Date: Wed, 20 May 2020 09:50:47 -0400 Subject: [PATCH 6/6] Address comments - Open questions seem to be resolved - Rename bootstrap -> bootstrap host --- .../debuggability-of-baremetal-ipi-deployment.md | 14 ++------------ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md b/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md index dda4a92b77..48ac083835 100644 --- a/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md +++ b/enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md @@ -35,16 +35,6 @@ superseded-by: - [ ] Graduation criteria for dev preview, tech preview, GA - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) -## Open Questions - -1. Should the installer error or warn when compute replicas quantity is - not met, even if at lest 2 workers deployed (enough to get a - functional cluster)? -2. In order to bubble up information about worker failures to the - installer, the most likely solution to me seems to use operator - status as Degraded, with a relevant error message. Could we mark - machine-api-operator when workers fail to roll out? - ## Summary In OpenShift 4.5, we improved the existing installer validations for @@ -113,7 +103,7 @@ troubleshooting. #### Bootstrap Failures -The bootstrap runs a couple of baremetal-specific services, including +The bootstrap host runs some baremetal-specific services including Ironic and a utility that populates hardware details for the control plane hosts. @@ -123,7 +113,7 @@ sometimes seen are that services such as dnsmasq, mariadb, ironic-api, ironic-conductor, or ironic-inspector fail. Failures on bootstrap services rarely result in any indication to the -user that something went wrong other than that there was a timeout. +user that something went wrong other than a timeout. The installer has a feature for log gathering on bootstrap failure that does not work on baremetal. This should be the first priority, but even