-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add stage-based exit codes to the openshift-installer #1063
add stage-based exit codes to the openshift-installer #1063
Conversation
Ken is @xueqzhan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Left a few notes.
Are there any particular clients we need to get feedback from?
reviewers: | ||
- "@staebler" | ||
approvers: | ||
- "@pdillion" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "@pdillion" | |
- "@pdillon" |
`openshift-installer` provides *no guarantee* that exit codes will not change as result of changes like | ||
greater granularity, lower granularity, re-organization, or other needs. | ||
`openshift-installer` *does guarantee* that a particular exit code will not be re-used to have a | ||
different meaning in the same y-level version. | ||
`openshift-installer` will make reasonable efforts to avoid re-using a particular exit code to have | ||
different meanings across y-level versions, but for sufficiently compelling reasons may do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addresses my main concern of encountering Hyrum's Law: it is better to call out that we're not offering a guarantee here than to establish a de facto api.
`openshift-installer` provides *no guarantee* that exit codes will not change as result of changes like | ||
greater granularity, lower granularity, re-organization, or other needs. | ||
`openshift-installer` *does guarantee* that a particular exit code will not be re-used to have a | ||
different meaning in the same y-level version. | ||
`openshift-installer` will make reasonable efforts to avoid re-using a particular exit code to have | ||
different meanings across y-level versions, but for sufficiently compelling reasons may do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addresses my main concern of encountering Hyrum's Law. I think it is better to call out that we're not offering a guarantee here, than to establish a de facto api.
|
||
### API Extensions | ||
|
||
### Implementation Details/Notes/Constraints [optional] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As shown by #5702, this handling is done by logrus
. All relevant logrus
calls are contained in the cmd
package. Keeping the semantics of the error codes at this high-level does not require significant integration with other packages. Retrieving errors with more detail would require a much more sophisticated design.
|
||
### API Extensions | ||
|
||
### Implementation Details/Notes/Constraints [optional] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As shown by #5702, this handling is done by logrus
. All relevant logrus
calls are contained in the cmd
package. Keeping the semantics of the error codes at this high-level does not require significant integration with other packages. Retrieving errors with more detail would require a much more sophisticated design.
| infrastructure creation | 3 | | ||
| bootstrapping | 4 | | ||
| wait-for-cluster-install | 5 | | ||
| destroy | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding destroy was my suggestion, but on second thought a generic error for destroy is probably not worth the added complexity. These "stages"--except destroy--are all part of the create cluster
flow. Destroy is only run explicitly so it seems that the Generic error code would capture just as much information.
| infrastructure creation | 3 | | ||
| bootstrapping | 4 | | ||
| wait-for-cluster-install | 5 | | ||
| destroy | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding destroy was my suggestion, but on second thought a generic error for destroy is probably not worth the added complexity. These "stages"--except destroy--are all part of the create cluster
flow. Destroy is only run explicitly so it seems that the Generic error code would capture just as much infromation.
Updated for comments.
An explicit ack from @xueqzhan and/or @dgoodwin . Given flexibility on the consumer side (which they must logically have today), I think future changes can be addressed with minor tweaks (probably just table entries). |
2. `openshift-installer` provides *no guarantee* that exit codes will not change as result of changes like | ||
greater granularity, lower granularity, re-organization, or other needs. | ||
3. `openshift-installer` *does guarantee* that a particular exit code will not be re-used to have a | ||
different meaning in the same y-level version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will a user know what the exit codes for a given version mean? Docs only? Sub-command help? Other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will a user know what the exit codes for a given version mean? Docs only? Sub-command help? Other?
Keeping this enhancement up to date seems reasonable. I wouldn't get carried away with documenting what amount to best effort codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe human readable stderr output like:
exit(1) -- Failed to handle install-config
| wait-for-cluster-install | 5 | | ||
| destroy | | | ||
|
||
### User Stories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is already a consumer in mind, describing that use case here would provide helpful context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is already a consumer in mind, describing that use case here would provide helpful context.
TRT/openshift CI. Present and ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive/OSD seem like prime candidates to use these codes, too. We should get some input from them about how the codes are defined.
/cc @2uasimojo
This design sufficiently satisfies the requirement of TRT use case. Thanks @deads2k! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me.
2. infrastructure creation | ||
3. bootstrapping | ||
4. wait-for-cluster-install | ||
5. destroy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is destroy? Do you mean the step of destroying the bootstrap? Or do yo mean destroying the entire cluster?
| Stage | Exit Code | | ||
| --- | --- | | ||
| Generic | whatever other value is produced | | ||
| install-config verification | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the intention to use the error code of 2 here?
| install-config verification | | | |
| install-config verification | 2 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the intention to use the error code of 2 here?
No. This was a stage that @patrickdillon suggested, but I wasn't ready to assign it because it wasn't in the first cut of the PR. I'll reshuffle to add this to the end.
| wait-for-cluster-install | 5 | | ||
| destroy | | | ||
|
||
### User Stories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive/OSD seem like prime candidates to use these codes, too. We should get some input from them about how the codes are defined.
/cc @2uasimojo
3. bootstrapping | ||
4. wait-for-cluster-install | ||
5. destroy | ||
6. everything else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for not sneaking in "cvo reports that the cluster installed but some cluster operators are now reporting degraded". 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for not sneaking in "cvo reports that the cluster installed but some cluster operators are now reporting degraded".
I considered this and decided against it for a couple reasons. I'll enumerate here and if you like, I'll add them to the KEP to help place further boundaries on what (in my opinion) is reasonable.
- having "operators are degraded" or "operators are progressing" would be seemingly handy for consumers, but those conditions are not mutually exclusive, making the actual return values, one, the other, or both.
- when the cluster fails and the installer can determine "operator state is bad" and communicates that as exit code 5, the caller can investigate the operator status themselves. The other failure modes cannot be reliably determined by contacting the kube-apiserver since it may not be running.
SGTM |
I'm not seeing a strong motivation for hive to make use of this:
Today we worry about two states: success and failure. On failure, we invoke a destroy and (maybe) try again from the beginning. We detect and categorize failures by scraping the logs, and expose some logic allowing the consumer to configure retries based on regex-matching those scrapings. That's brittle, but finer-grained than what these coarse exit codes would support, and can be adapted/adjusted by config (as opposed to code) in effectively realtime. If this KEP were a thing, there are a couple of hive use cases I can see being possible -- though again, perhaps not worth the trouble of implementing:
/cc @akhil-rane @abutcher @suhanime @newtonheath @gregsheremeta in case your opinions should differ significantly from mine :) |
Looks like they are inclined to pass, but the TRT/CI use-case remains, so there's still an interested and ready consumer with a PR and staffing to complete and maintain the work with no conflict with future hive desires. |
be943b7
to
645a0a2
Compare
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: patrickdillon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@deads2k: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
| infrastructure creation | 3 | | ||
| bootstrapping | 4 | | ||
| wait-for-cluster-install | 5 | | ||
| install-config verification | TBD | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever we turn this TBD into something else can we get rid of the word verification unless we envision tying this to things like quota verification, etc. I don't know if this is just limited to ensuring inputs conform to the install-config api or more than that.
/lgtm |
Describing the limitations on exit-codes from openshift-installer, where they could be used, what they should not be used for, and how this doesn't preclude producing a different interaction later.
@patrickdillon @sdodson @staebler
@dgoodwin do you know Ken's github name?