-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add stage-based exit codes to the openshift-installer #1063
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
--- | ||
title: coarse-grained-exit-codes | ||
authors: | ||
- "@deadsk2" | ||
reviewers: | ||
- "@staebler" | ||
approvers: | ||
- "@pdillon" | ||
- "@sdodson" | ||
api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers) | ||
- TBD | ||
creation-date: 2022-03-16 | ||
last-updated: 2022-03-16 | ||
tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement | ||
- TBD | ||
see-also: | ||
replaces: | ||
superseded-by: | ||
--- | ||
|
||
# Coarse Grained Exit Codes from the Installer | ||
|
||
## Summary | ||
|
||
`openshift-installer` callers will now get more granular exit codes. | ||
Callers of `openshift-installer` can expect non-zero exit codes *that may change over time* as | ||
the `openshift-installer` evolves over time. | ||
`openshift-installer` provides *no guarantee* that exit codes will not change as result of changes like | ||
greater granularity, lower granularity, re-organization, or other needs. | ||
`openshift-installer` *does guarantee* that a particular exit code will not be re-used to have a | ||
different meaning in the same y-level version. | ||
`openshift-installer` will make reasonable efforts to avoid re-using a particular exit code to have | ||
different meanings across y-level versions, but for sufficiently compelling reasons may do so. | ||
Comment on lines
+28
to
+33
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This addresses my main concern of encountering Hyrum's Law. I think it is better to call out that we're not offering a guarantee here, than to establish a de facto api. |
||
|
||
## Motivation | ||
|
||
Automated categorization of failures can assist in efforts like | ||
1. reliability determination of provision on various clouds | ||
2. directing effort to proper teams depending on failure modes | ||
3. tracking of which installation stages have problems on various platforms | ||
4. automated retry of infrastructure failures by CI step registry | ||
|
||
These are reasons why the data is useful and by providing it, other teams will be able to build these | ||
secondary system. | ||
Other teams *must properly handle previously unknown exit codes*. | ||
|
||
### Goals | ||
|
||
1. Coarse grained (order of 10 or less values), stage-based exit codes to assist in categorizing | ||
failures, but not debugging failures. | ||
|
||
### Non-Goals | ||
|
||
1. Detailed, machine-parseable, cause-based failure information is outside the scope of this enhancement. | ||
|
||
## Proposal | ||
|
||
The installer knows what stage of the installation has failed when it exits. | ||
The stages include | ||
1. install-config handling | ||
2. infrastructure creation | ||
3. bootstrapping | ||
4. wait-for-cluster-install | ||
5. destroy | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is destroy? Do you mean the step of destroying the bootstrap? Or do yo mean destroying the entire cluster? |
||
6. everything else | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for not sneaking in "cvo reports that the cluster installed but some cluster operators are now reporting degraded". 😄 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I considered this and decided against it for a couple reasons. I'll enumerate here and if you like, I'll add them to the KEP to help place further boundaries on what (in my opinion) is reasonable.
|
||
|
||
The installer will not introspect errors to provide deeper analysis, but these stages (and whatever future | ||
ones the install team identifies) may have unique exit codes when they fail. | ||
Categorizing failures in this way will allow machine categorization of the type of failure by other systems. | ||
The CI step registry is a good example of a consumer, but issuing a destroy after an infrastructure creation | ||
and retrying the `openshift-install` seems like a fairly common use-case since infrastructure creation | ||
failures often happen for environmental reasons and are reasonably likely to succeed on a retry. | ||
|
||
### Known Exit Codes | ||
|
||
Key things to remember: | ||
1. Callers of `openshift-installer` can expect non-zero exit codes *that may change over time* as | ||
the `openshift-installer` evolves over time. | ||
2. `openshift-installer` provides *no guarantee* that exit codes will not change as result of changes like | ||
greater granularity, lower granularity, re-organization, or other needs. | ||
3. `openshift-installer` *does guarantee* that a particular exit code will not be re-used to have a | ||
different meaning in the same y-level version. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How will a user know what the exit codes for a given version mean? Docs only? Sub-command help? Other? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Keeping this enhancement up to date seems reasonable. I wouldn't get carried away with documenting what amount to best effort codes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe human readable stderr output like: |
||
4. `openshift-installer` will make reasonable efforts to avoid re-using a particular exit code to have | ||
different meanings across y-level versions, but for sufficiently compelling reasons may do so. | ||
5. Consumers must properly handle previously unknown exit codes. | ||
|
||
| Stage | Exit Code | | ||
| --- | --- | | ||
| Generic | whatever other value is produced | | ||
| infrastructure creation | 3 | | ||
| bootstrapping | 4 | | ||
| wait-for-cluster-install | 5 | | ||
| install-config verification | TBD | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Whenever we turn this TBD into something else can we get rid of the word verification unless we envision tying this to things like quota verification, etc. I don't know if this is just limited to ensuring inputs conform to the install-config api or more than that. |
||
|
||
### User Stories | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there is already a consumer in mind, describing that use case here would provide helpful context. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
TRT/openshift CI. Present and ready. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hive/OSD seem like prime candidates to use these codes, too. We should get some input from them about how the codes are defined. |
||
|
||
### API Extensions | ||
|
||
### Implementation Details/Notes/Constraints [optional] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As shown by #5702, this handling is done by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As shown by #5702, this handling is done by |
||
|
||
Approval of the implementation lies with the installer team, but [openshift/installer#5702](https://github.com/openshift/installer/pull/5702) | ||
demonstrates using `errors.Unwrap` and logical `errors.Is` to keep all return code logic contained in the command package. | ||
This keeps exit code logic from leaking into the installer library itself. | ||
|
||
### Risks and Mitigations | ||
|
||
## Design Details | ||
|
||
### Open Questions [optional] | ||
|
||
### Test Plan | ||
|
||
### Graduation Criteria | ||
|
||
#### Dev Preview -> Tech Preview | ||
|
||
#### Tech Preview -> GA | ||
|
||
#### Removing a deprecated feature | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
The `openshift-installer` is only excecuted once and does not leave any persisted data | ||
in the cluster itself, so there is not an upgrade/downgrade concern for changing | ||
existing non-zero return codes into other non-zero return codes. | ||
|
||
### Version Skew Strategy | ||
|
||
The `openshift-installer` is only excecuted once and does not leave any persisted data | ||
in the cluster itself, so there is no version skew concern for changing | ||
existing non-zero return codes into other non-zero return codes. | ||
|
||
Consumers that installer multiple levels of `openshift-installer` may see exit codes | ||
change from one non-zero value to another as the product evolves, but since they are all | ||
failing exit codes and the exit code is explicitly not determining cause, impacts are minimal. | ||
|
||
### Operational Aspects of API Extensions | ||
|
||
#### Failure Modes | ||
|
||
#### Support Procedures | ||
|
||
## Implementation History | ||
|
||
## Drawbacks | ||
|
||
Consumers may expect to "pin" an exit code forever. | ||
Since consumers know that our product evolves over time and since our exit codes are based on stages, not symptoms, | ||
consumers must always handle novel exit codes as proper failures. | ||
This ensures that any change in an exit code will fail in the safe direction. | ||
This is also the case today. | ||
In addition, if a consumer wishes to identify exact behavior, the `openshift-installer version` command | ||
can be used to determine version and react accordingly. | ||
|
||
## Alternatives | ||
|
||
A more detailed solution involved structured, machine readable output could be created. | ||
Such a solution would be more expressive, but additional expression isn't required for several use-cases today. | ||
Building these exit codes does not make it any more difficult to provide more detailed structured output in the future. | ||
|
||
## Infrastructure Needed [optional] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addresses my main concern of encountering Hyrum's Law: it is better to call out that we're not offering a guarantee here than to establish a de facto api.