add stage-based exit codes to the openshift-installer #1063

deads2k · 2022-03-17T21:28:55Z

Describing the limitations on exit-codes from openshift-installer, where they could be used, what they should not be used for, and how this doesn't preclude producing a different interaction later.

@patrickdillon @sdodson @staebler
@dgoodwin do you know Ken's github name?

dgoodwin · 2022-03-18T09:53:48Z

Ken is @xueqzhan

patrickdillon

Looks good to me. Left a few notes.

Are there any particular clients we need to get feedback from?

patrickdillon · 2022-03-18T13:32:41Z

enhancements/installer/coarse-grained-exit-codes.md

+reviewers:
+  - "@staebler"
+approvers:
+  - "@pdillion"


Suggested change

- "@pdillion"

- "@pdillon"

patrickdillon · 2022-03-18T13:37:17Z

enhancements/installer/coarse-grained-exit-codes.md

+`openshift-installer` provides *no guarantee* that exit codes will not change as result of changes like
+greater granularity, lower granularity, re-organization, or other needs.
+`openshift-installer` *does guarantee* that a particular exit code will not be re-used to have a
+different meaning in the same y-level version.
+`openshift-installer` will make reasonable efforts to avoid re-using a particular exit code to have
+different meanings across y-level versions, but for sufficiently compelling reasons may do so.


This addresses my main concern of encountering Hyrum's Law: it is better to call out that we're not offering a guarantee here than to establish a de facto api.

patrickdillon · 2022-03-18T13:37:17Z

enhancements/installer/coarse-grained-exit-codes.md

+`openshift-installer` provides *no guarantee* that exit codes will not change as result of changes like
+greater granularity, lower granularity, re-organization, or other needs.
+`openshift-installer` *does guarantee* that a particular exit code will not be re-used to have a
+different meaning in the same y-level version.
+`openshift-installer` will make reasonable efforts to avoid re-using a particular exit code to have
+different meanings across y-level versions, but for sufficiently compelling reasons may do so.


This addresses my main concern of encountering Hyrum's Law. I think it is better to call out that we're not offering a guarantee here, than to establish a de facto api.

patrickdillon · 2022-03-18T14:00:54Z

enhancements/installer/coarse-grained-exit-codes.md

+
+### API Extensions
+
+### Implementation Details/Notes/Constraints [optional]


As shown by #5702, this handling is done by logrus. All relevant logrus calls are contained in the cmd package. Keeping the semantics of the error codes at this high-level does not require significant integration with other packages. Retrieving errors with more detail would require a much more sophisticated design.

patrickdillon · 2022-03-18T14:00:54Z

enhancements/installer/coarse-grained-exit-codes.md

+
+### API Extensions
+
+### Implementation Details/Notes/Constraints [optional]


As shown by #5702, this handling is done by logrus. All relevant logrus calls are contained in the cmd package. Keeping the semantics of the error codes at this high-level does not require significant integration with other packages. Retrieving errors with more detail would require a much more sophisticated design.

patrickdillon · 2022-03-18T14:10:57Z

enhancements/installer/coarse-grained-exit-codes.md

+| infrastructure creation | 3 |
+| bootstrapping | 4 | 
+| wait-for-cluster-install | 5 |
+| destroy | |


Adding destroy was my suggestion, but on second thought a generic error for destroy is probably not worth the added complexity. These "stages"--except destroy--are all part of the create cluster flow. Destroy is only run explicitly so it seems that the Generic error code would capture just as much information.

patrickdillon · 2022-03-18T14:10:57Z

enhancements/installer/coarse-grained-exit-codes.md

+| infrastructure creation | 3 |
+| bootstrapping | 4 | 
+| wait-for-cluster-install | 5 |
+| destroy | |


Adding destroy was my suggestion, but on second thought a generic error for destroy is probably not worth the added complexity. These "stages"--except destroy--are all part of the create cluster flow. Destroy is only run explicitly so it seems that the Generic error code would capture just as much infromation.

deads2k · 2022-03-18T15:19:37Z

Updated for comments.

Are there any particular clients we need to get feedback from?

An explicit ack from @xueqzhan and/or @dgoodwin . Given flexibility on the consumer side (which they must logically have today), I think future changes can be addressed with minor tweaks (probably just table entries).

dhellmann · 2022-03-18T15:17:16Z

enhancements/installer/coarse-grained-exit-codes.md

+2. `openshift-installer` provides *no guarantee* that exit codes will not change as result of changes like
+   greater granularity, lower granularity, re-organization, or other needs.
+3. `openshift-installer` *does guarantee* that a particular exit code will not be re-used to have a
+   different meaning in the same y-level version.


How will a user know what the exit codes for a given version mean? Docs only? Sub-command help? Other?

How will a user know what the exit codes for a given version mean? Docs only? Sub-command help? Other?

Keeping this enhancement up to date seems reasonable. I wouldn't get carried away with documenting what amount to best effort codes.

Maybe human readable stderr output like:
exit(1) -- Failed to handle install-config

dhellmann · 2022-03-18T15:19:42Z

enhancements/installer/coarse-grained-exit-codes.md

+| wait-for-cluster-install | 5 |
+| destroy | |
+
+### User Stories


If there is already a consumer in mind, describing that use case here would provide helpful context.

If there is already a consumer in mind, describing that use case here would provide helpful context.

TRT/openshift CI. Present and ready.

https://github.com/openshift/release/blob/master/ci-operator/step-registry/gather/must-gather/gather-must-gather-commands.sh#L26

Hive/OSD seem like prime candidates to use these codes, too. We should get some input from them about how the codes are defined.
/cc @2uasimojo

xueqzhan · 2022-03-18T15:29:17Z

Updated for comments.

Are there any particular clients we need to get feedback from?

An explicit ack from @xueqzhan and/or @dgoodwin . Given flexibility on the consumer side (which they must logically have today), I think future changes can be addressed with minor tweaks (probably just table entries).

This design sufficiently satisfies the requirement of TRT use case. Thanks @deads2k!

staebler

This seems reasonable to me.

staebler · 2022-03-18T15:44:01Z

enhancements/installer/coarse-grained-exit-codes.md

+2. infrastructure creation
+3. bootstrapping
+4. wait-for-cluster-install
+5. destroy


What is destroy? Do you mean the step of destroying the bootstrap? Or do yo mean destroying the entire cluster?

staebler · 2022-03-18T15:46:43Z

enhancements/installer/coarse-grained-exit-codes.md

+| Stage | Exit Code |
+| --- | --- | 
+| Generic | whatever other value is produced |
+| install-config verification | |


Was the intention to use the error code of 2 here?

No. This was a stage that @patrickdillon suggested, but I wasn't ready to assign it because it wasn't in the first cut of the PR. I'll reshuffle to add this to the end.

staebler · 2022-03-18T15:48:43Z

enhancements/installer/coarse-grained-exit-codes.md

+| wait-for-cluster-install | 5 |
+| destroy | |
+
+### User Stories


Hive/OSD seem like prime candidates to use these codes, too. We should get some input from them about how the codes are defined.
/cc @2uasimojo

staebler · 2022-03-18T15:51:48Z

enhancements/installer/coarse-grained-exit-codes.md

+3. bootstrapping
+4. wait-for-cluster-install
+5. destroy
+6. everything else


Thanks for not sneaking in "cvo reports that the cluster installed but some cluster operators are now reporting degraded". 😄

Thanks for not sneaking in "cvo reports that the cluster installed but some cluster operators are now reporting degraded".

I considered this and decided against it for a couple reasons. I'll enumerate here and if you like, I'll add them to the KEP to help place further boundaries on what (in my opinion) is reasonable.

having "operators are degraded" or "operators are progressing" would be seemingly handy for consumers, but those conditions are not mutually exclusive, making the actual return values, one, the other, or both.

when the cluster fails and the installer can determine "operator state is bad" and communicates that as exit code 5, the caller can investigate the operator status themselves. The other failure modes cannot be reliably determined by contacting the kube-apiserver since it may not be running.

dgoodwin · 2022-03-18T16:14:51Z

SGTM

2uasimojo · 2022-03-18T21:32:19Z

Hive/OSD seem like prime candidates to use these codes, too. We should get some input from them about how the codes are defined.
/cc @2uasimojo

I'm not seeing a strong motivation for hive to make use of this:

The cross-version guarantees are explicitly limited (you said it twice; I'm thinking you might be serious).
Hive doesn't predetermine the installer version it invokes.
We could discover the version and map to the appropriate set of error codes as suggested... but it would have to be worth the trouble.

Today we worry about two states: success and failure. On failure, we invoke a destroy and (maybe) try again from the beginning. We detect and categorize failures by scraping the logs, and expose some logic allowing the consumer to configure retries based on regex-matching those scrapings. That's brittle, but finer-grained than what these coarse exit codes would support, and can be adapted/adjusted by config (as opposed to code) in effectively realtime.

If this KEP were a thing, there are a couple of hive use cases I can see being possible -- though again, perhaps not worth the trouble of implementing:

Today when we invoke that destroy on failure, there are cases where the destroy fails repeatably and we're stuck. We've been told to file those as bugs against the installer (destroy should be idempotent). However, if we got the "install config processing" exit code, we could be "sure" there's nothing to clean up, and could simply bail without bothering to invoke the destroy.
If I get an error code in one of the later stages, can I reinvoke the installer and have it idempotently pick up where it left off? That would be a time-saver for sure. Though again, having to understand which exit codes that applies to, for which versions of the installer, might be more trouble than it's worth to save that time.

/cc @akhil-rane @abutcher @suhanime @newtonheath @gregsheremeta in case your opinions should differ significantly from mine :)

deads2k · 2022-03-19T00:43:51Z

Hive/OSD seem like prime candidates to use these codes, too. We should get some input from them about how the codes are defined.
/cc @2uasimojo

Looks like they are inclined to pass, but the TRT/CI use-case remains, so there's still an interested and ready consumer with a PR and staffing to complete and maintain the work with no conflict with future hive desires.

patrickdillon · 2022-03-21T19:41:48Z

/approve

openshift-ci · 2022-03-21T19:42:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [patrickdillon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-03-21T19:43:22Z

@deads2k: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

sdodson · 2022-03-21T19:52:30Z

enhancements/installer/coarse-grained-exit-codes.md

+| infrastructure creation | 3 |
+| bootstrapping | 4 | 
+| wait-for-cluster-install | 5 |
+| install-config verification | TBD |


Whenever we turn this TBD into something else can we get rid of the word verification unless we envision tying this to things like quota verification, etc. I don't know if this is just limited to ensuring inputs conform to the install-config api or more than that.

sdodson · 2022-03-21T19:52:50Z

/lgtm
I don't see anything that should inhibit implementation

openshift-ci bot requested review from kbsingh and mrunalp March 17, 2022 21:29

patrickdillon reviewed Mar 18, 2022

View reviewed changes

dhellmann reviewed Mar 18, 2022

View reviewed changes

staebler reviewed Mar 18, 2022

View reviewed changes

openshift-ci bot requested a review from 2uasimojo March 18, 2022 15:53

add stage-based exit codes to the openshift-installer

645a0a2

deads2k force-pushed the coarse-exit-codes-2 branch from be943b7 to 645a0a2 Compare March 21, 2022 19:34

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 21, 2022

sdodson reviewed Mar 21, 2022

View reviewed changes

openshift-ci bot assigned sdodson Mar 21, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 21, 2022

openshift-merge-robot merged commit 0355c50 into openshift:master Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add stage-based exit codes to the openshift-installer #1063

add stage-based exit codes to the openshift-installer #1063

deads2k commented Mar 17, 2022

dgoodwin commented Mar 18, 2022

patrickdillon left a comment

patrickdillon Mar 18, 2022

patrickdillon Mar 18, 2022

patrickdillon Mar 18, 2022

patrickdillon Mar 18, 2022

patrickdillon Mar 18, 2022

patrickdillon Mar 18, 2022

patrickdillon Mar 18, 2022

deads2k commented Mar 18, 2022

dhellmann Mar 18, 2022

deads2k Mar 18, 2022

sdodson Mar 21, 2022 •

edited

Loading

dhellmann Mar 18, 2022

deads2k Mar 18, 2022 •

edited

Loading

staebler Mar 18, 2022

xueqzhan commented Mar 18, 2022 •

edited

Loading

staebler left a comment

staebler Mar 18, 2022

staebler Mar 18, 2022

deads2k Mar 18, 2022

staebler Mar 18, 2022

staebler Mar 18, 2022

deads2k Mar 18, 2022 •

edited

Loading

dgoodwin commented Mar 18, 2022

2uasimojo commented Mar 18, 2022

deads2k commented Mar 19, 2022

patrickdillon commented Mar 21, 2022

openshift-ci bot commented Mar 21, 2022

openshift-ci bot commented Mar 21, 2022

sdodson Mar 21, 2022

sdodson commented Mar 21, 2022


		### API Extensions

		### Implementation Details/Notes/Constraints [optional]

	\| install-config verification \| \|
	\| install-config verification \| 2 \|

add stage-based exit codes to the openshift-installer #1063

add stage-based exit codes to the openshift-installer #1063

Conversation

deads2k commented Mar 17, 2022

dgoodwin commented Mar 18, 2022

patrickdillon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Mar 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdodson Mar 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k Mar 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xueqzhan commented Mar 18, 2022 • edited Loading

staebler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k Mar 18, 2022 • edited Loading

Choose a reason for hiding this comment

dgoodwin commented Mar 18, 2022

2uasimojo commented Mar 18, 2022

deads2k commented Mar 19, 2022

patrickdillon commented Mar 21, 2022

openshift-ci bot commented Mar 21, 2022

openshift-ci bot commented Mar 21, 2022

Choose a reason for hiding this comment

sdodson commented Mar 21, 2022

sdodson Mar 21, 2022 •

edited

Loading

deads2k Mar 18, 2022 •

edited

Loading

xueqzhan commented Mar 18, 2022 •

edited

Loading

deads2k Mar 18, 2022 •

edited

Loading