Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add stage-based exit codes to the openshift-installer #1063

Merged
merged 1 commit into from
Mar 21, 2022

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented Mar 17, 2022

Describing the limitations on exit-codes from openshift-installer, where they could be used, what they should not be used for, and how this doesn't preclude producing a different interaction later.

@patrickdillon @sdodson @staebler
@dgoodwin do you know Ken's github name?

@openshift-ci openshift-ci bot requested review from kbsingh and mrunalp March 17, 2022 21:29
@dgoodwin
Copy link
Contributor

Ken is @xueqzhan

Copy link
Contributor

@patrickdillon patrickdillon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Left a few notes.

Are there any particular clients we need to get feedback from?

reviewers:
- "@staebler"
approvers:
- "@pdillion"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "@pdillion"
- "@pdillon"

Comment on lines +28 to +33
`openshift-installer` provides *no guarantee* that exit codes will not change as result of changes like
greater granularity, lower granularity, re-organization, or other needs.
`openshift-installer` *does guarantee* that a particular exit code will not be re-used to have a
different meaning in the same y-level version.
`openshift-installer` will make reasonable efforts to avoid re-using a particular exit code to have
different meanings across y-level versions, but for sufficiently compelling reasons may do so.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addresses my main concern of encountering Hyrum's Law: it is better to call out that we're not offering a guarantee here than to establish a de facto api.

Comment on lines +28 to +33
`openshift-installer` provides *no guarantee* that exit codes will not change as result of changes like
greater granularity, lower granularity, re-organization, or other needs.
`openshift-installer` *does guarantee* that a particular exit code will not be re-used to have a
different meaning in the same y-level version.
`openshift-installer` will make reasonable efforts to avoid re-using a particular exit code to have
different meanings across y-level versions, but for sufficiently compelling reasons may do so.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addresses my main concern of encountering Hyrum's Law. I think it is better to call out that we're not offering a guarantee here, than to establish a de facto api.


### API Extensions

### Implementation Details/Notes/Constraints [optional]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As shown by #5702, this handling is done by logrus. All relevant logrus calls are contained in the cmd package. Keeping the semantics of the error codes at this high-level does not require significant integration with other packages. Retrieving errors with more detail would require a much more sophisticated design.


### API Extensions

### Implementation Details/Notes/Constraints [optional]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As shown by #5702, this handling is done by logrus. All relevant logrus calls are contained in the cmd package. Keeping the semantics of the error codes at this high-level does not require significant integration with other packages. Retrieving errors with more detail would require a much more sophisticated design.

| infrastructure creation | 3 |
| bootstrapping | 4 |
| wait-for-cluster-install | 5 |
| destroy | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding destroy was my suggestion, but on second thought a generic error for destroy is probably not worth the added complexity. These "stages"--except destroy--are all part of the create cluster flow. Destroy is only run explicitly so it seems that the Generic error code would capture just as much information.

| infrastructure creation | 3 |
| bootstrapping | 4 |
| wait-for-cluster-install | 5 |
| destroy | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding destroy was my suggestion, but on second thought a generic error for destroy is probably not worth the added complexity. These "stages"--except destroy--are all part of the create cluster flow. Destroy is only run explicitly so it seems that the Generic error code would capture just as much infromation.

@deads2k
Copy link
Contributor Author

deads2k commented Mar 18, 2022

Updated for comments.

Are there any particular clients we need to get feedback from?

An explicit ack from @xueqzhan and/or @dgoodwin . Given flexibility on the consumer side (which they must logically have today), I think future changes can be addressed with minor tweaks (probably just table entries).

2. `openshift-installer` provides *no guarantee* that exit codes will not change as result of changes like
greater granularity, lower granularity, re-organization, or other needs.
3. `openshift-installer` *does guarantee* that a particular exit code will not be re-used to have a
different meaning in the same y-level version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will a user know what the exit codes for a given version mean? Docs only? Sub-command help? Other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will a user know what the exit codes for a given version mean? Docs only? Sub-command help? Other?

Keeping this enhancement up to date seems reasonable. I wouldn't get carried away with documenting what amount to best effort codes.

Copy link
Member

@sdodson sdodson Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe human readable stderr output like:
exit(1) -- Failed to handle install-config

| wait-for-cluster-install | 5 |
| destroy | |

### User Stories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is already a consumer in mind, describing that use case here would provide helpful context.

Copy link
Contributor Author

@deads2k deads2k Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is already a consumer in mind, describing that use case here would provide helpful context.

TRT/openshift CI. Present and ready.

https://github.com/openshift/release/blob/master/ci-operator/step-registry/gather/must-gather/gather-must-gather-commands.sh#L26

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive/OSD seem like prime candidates to use these codes, too. We should get some input from them about how the codes are defined.
/cc @2uasimojo

@xueqzhan
Copy link
Contributor

xueqzhan commented Mar 18, 2022

Updated for comments.

Are there any particular clients we need to get feedback from?

An explicit ack from @xueqzhan and/or @dgoodwin . Given flexibility on the consumer side (which they must logically have today), I think future changes can be addressed with minor tweaks (probably just table entries).

This design sufficiently satisfies the requirement of TRT use case. Thanks @deads2k!

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me.

2. infrastructure creation
3. bootstrapping
4. wait-for-cluster-install
5. destroy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is destroy? Do you mean the step of destroying the bootstrap? Or do yo mean destroying the entire cluster?

| Stage | Exit Code |
| --- | --- |
| Generic | whatever other value is produced |
| install-config verification | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the intention to use the error code of 2 here?

Suggested change
| install-config verification | |
| install-config verification | 2 |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the intention to use the error code of 2 here?

No. This was a stage that @patrickdillon suggested, but I wasn't ready to assign it because it wasn't in the first cut of the PR. I'll reshuffle to add this to the end.

| wait-for-cluster-install | 5 |
| destroy | |

### User Stories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive/OSD seem like prime candidates to use these codes, too. We should get some input from them about how the codes are defined.
/cc @2uasimojo

3. bootstrapping
4. wait-for-cluster-install
5. destroy
6. everything else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for not sneaking in "cvo reports that the cluster installed but some cluster operators are now reporting degraded". 😄

Copy link
Contributor Author

@deads2k deads2k Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for not sneaking in "cvo reports that the cluster installed but some cluster operators are now reporting degraded".

I considered this and decided against it for a couple reasons. I'll enumerate here and if you like, I'll add them to the KEP to help place further boundaries on what (in my opinion) is reasonable.

  1. having "operators are degraded" or "operators are progressing" would be seemingly handy for consumers, but those conditions are not mutually exclusive, making the actual return values, one, the other, or both.
  2. when the cluster fails and the installer can determine "operator state is bad" and communicates that as exit code 5, the caller can investigate the operator status themselves. The other failure modes cannot be reliably determined by contacting the kube-apiserver since it may not be running.

@openshift-ci openshift-ci bot requested a review from 2uasimojo March 18, 2022 15:53
@dgoodwin
Copy link
Contributor

SGTM

@2uasimojo
Copy link
Member

Hive/OSD seem like prime candidates to use these codes, too. We should get some input from them about how the codes are defined.
/cc @2uasimojo

I'm not seeing a strong motivation for hive to make use of this:

  • The cross-version guarantees are explicitly limited (you said it twice; I'm thinking you might be serious).
  • Hive doesn't predetermine the installer version it invokes.
  • We could discover the version and map to the appropriate set of error codes as suggested... but it would have to be worth the trouble.

Today we worry about two states: success and failure. On failure, we invoke a destroy and (maybe) try again from the beginning. We detect and categorize failures by scraping the logs, and expose some logic allowing the consumer to configure retries based on regex-matching those scrapings. That's brittle, but finer-grained than what these coarse exit codes would support, and can be adapted/adjusted by config (as opposed to code) in effectively realtime.

If this KEP were a thing, there are a couple of hive use cases I can see being possible -- though again, perhaps not worth the trouble of implementing:

  • Today when we invoke that destroy on failure, there are cases where the destroy fails repeatably and we're stuck. We've been told to file those as bugs against the installer (destroy should be idempotent). However, if we got the "install config processing" exit code, we could be "sure" there's nothing to clean up, and could simply bail without bothering to invoke the destroy.
  • If I get an error code in one of the later stages, can I reinvoke the installer and have it idempotently pick up where it left off? That would be a time-saver for sure. Though again, having to understand which exit codes that applies to, for which versions of the installer, might be more trouble than it's worth to save that time.

/cc @akhil-rane @abutcher @suhanime @newtonheath @gregsheremeta in case your opinions should differ significantly from mine :)

@deads2k
Copy link
Contributor Author

deads2k commented Mar 19, 2022

Hive/OSD seem like prime candidates to use these codes, too. We should get some input from them about how the codes are defined.
/cc @2uasimojo

Looks like they are inclined to pass, but the TRT/CI use-case remains, so there's still an interested and ready consumer with a PR and staffing to complete and maintain the work with no conflict with future hive desires.

@deads2k deads2k force-pushed the coarse-exit-codes-2 branch from be943b7 to 645a0a2 Compare March 21, 2022 19:34
@patrickdillon
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 21, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 21, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 21, 2022

@deads2k: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

| infrastructure creation | 3 |
| bootstrapping | 4 |
| wait-for-cluster-install | 5 |
| install-config verification | TBD |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever we turn this TBD into something else can we get rid of the word verification unless we envision tying this to things like quota verification, etc. I don't know if this is just limited to ensuring inputs conform to the install-config api or more than that.

@sdodson
Copy link
Member

sdodson commented Mar 21, 2022

/lgtm
I don't see anything that should inhibit implementation

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 21, 2022
@openshift-merge-robot openshift-merge-robot merged commit 0355c50 into openshift:master Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants