Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 openstackmachine: do not set transient error message and reason #1301

Conversation

seanschneeweiss
Copy link
Contributor

@seanschneeweiss seanschneeweiss commented Jul 18, 2022

What this PR does / why we need it:

With this PR we set failureReason & failureMessage of the OpenStackMachine for the following three cases:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1116

Special notes for your reviewer:

  • Introduced separate functions for setting the failureReason& failureMessage to have a choice when setting the failureReason compared to just setting UpdateMachineError for everything.
  • openstackcluster will pe part of a separate PR
  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • if necessary:
    • includes documentation
    • adds unit tests

/hold
Sean Schneeweiss sean.schneeweiss@mercedes-benz.com, Mercedes-Benz Tech Innovation GmbH, Provider Information

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jul 18, 2022
@netlify
Copy link

netlify bot commented Jul 18, 2022

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit b9689c6
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/636a0f0ad771bd000923fea3
😎 Deploy Preview https://deploy-preview-1301--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 18, 2022
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2022
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2022
@seanschneeweiss seanschneeweiss force-pushed the seanschneeweiss/transient-errors branch from 9285559 to cb34674 Compare September 19, 2022 12:14
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2022
@seanschneeweiss seanschneeweiss force-pushed the seanschneeweiss/transient-errors branch 2 times, most recently from f1cab55 to 59e7bc5 Compare September 25, 2022 17:44
@seanschneeweiss seanschneeweiss changed the title [WIP] 🐛 openstackmachine: do not set transient error message and reason 🐛 openstackmachine: do not set transient error message and reason Sep 25, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2022
@seanschneeweiss seanschneeweiss force-pushed the seanschneeweiss/transient-errors branch from 59e7bc5 to 109b017 Compare September 25, 2022 17:56
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 10, 2022
@seanschneeweiss seanschneeweiss force-pushed the seanschneeweiss/transient-errors branch from 109b017 to 81c9126 Compare October 19, 2022 08:13
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2022
Copy link
Contributor

@bavarianbidi bavarianbidi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also often get authentication failures in our osm.status.failureMessages, e.g.:

status:
  addresses:
  - address: 10.30.3.242
    type: InternalIP
  failureMessage: 'OpenStack instance cannot be created: get server list: Successfully
    re-authenticated, but got error executing request: Authentication failed'
  failureReason: UpdateError
  instanceState: ACTIVE
  ready: true

WDYT about catching authentication errors for all cases and reconcile them again?

controllers/openstackmachine_controller.go Outdated Show resolved Hide resolved
@seanschneeweiss seanschneeweiss added this to the v0.6.4 milestone Oct 21, 2022
@seanschneeweiss
Copy link
Contributor Author

WDYT about catching authentication errors for all cases and reconcile them again?
@bavarianbidi

Authentication failed relates to error code 401.
https://github.com/gophercloud/gophercloud/blob/0226ea5963656bcd435ab17af475e2345e523fcb/errors.go#L181-L183

It seems that only error codes >= 500 are considered retryable errors (as of now).

func IsRetryable(err error) bool {
var errUnexpectedResponseCode gophercloud.ErrUnexpectedResponseCode
if errors.As(err, &errUnexpectedResponseCode) {
statusCode := errUnexpectedResponseCode.GetStatusCode()
return statusCode >= 500 && statusCode != http.StatusNotImplemented
}
return false
}

I'd say that IsRetryable should be extended to also include some of the errors from https://github.com/gophercloud/gophercloud/blob/master/errors.go.

E.g.

  • 408: The server timed out waiting for the request
  • 429: Too many requests have been sent in a given amount of time. Pause requests, wait up to one minute, and try again.
  • and potentially 401: Authentication failed (as per your recommendation).

I think it is better to make the OpenStack calls via gophercloud retryable then filtering when setting the failureMessage/failureReason.

I'm no expert about OpenStack error codes and how gophercloud handles them - hopefully someone else can support me in this.

@bavarianbidi
Copy link
Contributor

I'd say that IsRetryable should be extended to also include some of the errors from gophercloud/gophercloud@master/errors.go.

E.g.

  • 408: The server timed out waiting for the request
  • 429: Too many requests have been sent in a given amount of time. Pause requests, wait up to one minute, and try again.
  • and potentially 401: Authentication failed (as per your recommendation).

I think it is better to make the OpenStack calls via gophercloud retryable then filtering when setting the failureMessage/failureReason.

I'm no expert about OpenStack error codes and how gophercloud handles them - hopefully someone else can support me in this.

sorry @mdbooth / @jichenjc for direct mentioning.

I'm fine with the proposed solution from @seanschneeweiss regarding extending IsRetryable. But i'm not sure if we oversee something. WDYT?

(FYI: this issue is currently the only one which is marked for milestone 0.6.4)

@seanschneeweiss
Copy link
Contributor Author

seanschneeweiss commented Nov 4, 2022

@bavarianbidi Let's discuss the modification of IsRetryable in a separate issue/PR. Would you agree?

Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great cleanup, thanks. Comments inline. Push at your discretion.

controllers/openstackmachine_controller.go Outdated Show resolved Hide resolved
api/v1alpha6/openstackmachine_types.go Outdated Show resolved Hide resolved
@mdbooth
Copy link
Contributor

mdbooth commented Nov 4, 2022

/lgtm
/hold

The /hold is in case you want to address the review comments. Feel free to remove it.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 4, 2022
@oscr
Copy link
Contributor

oscr commented Nov 5, 2022

I just want to say thank you for working on this @seanschneeweiss and reviewers! We are really looking forward to this feature.

@seanschneeweiss seanschneeweiss force-pushed the seanschneeweiss/transient-errors branch from 81c9126 to d28b391 Compare November 6, 2022 09:39
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 6, 2022
@seanschneeweiss
Copy link
Contributor Author

@mdbooth thank you for the great review. All suggestions were added. Would be great if you could check again.

@bavarianbidi
Copy link
Contributor

@bavarianbidi Let's discuss the modification of IsRetryable in a separate issue/PR. Would you agree?

@seanschneeweiss sure, i'm totally fine with that.

Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Agree with the nit, but it's not user facing so really not very important. No need for re-review if you decide to fix it!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2022
Signed-off-by: Sean Schneeweiss <sean.schneeweiss@mercedes-benz.com>
@seanschneeweiss seanschneeweiss force-pushed the seanschneeweiss/transient-errors branch from d28b391 to b9689c6 Compare November 8, 2022 08:10
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 8, 2022
@seanschneeweiss
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 8, 2022
@tobiasgiese
Copy link
Member

tobiasgiese commented Nov 8, 2022

/retest

Edit: I think there is an issue with Prow currently

image

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lentzi90, mdbooth, seanschneeweiss

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [mdbooth,seanschneeweiss]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@seanschneeweiss
Copy link
Contributor Author

/retest

1 similar comment
@seanschneeweiss
Copy link
Contributor Author

/retest

@tobiasgiese
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2022
@k8s-ci-robot k8s-ci-robot merged commit c9cca1d into kubernetes-sigs:main Nov 9, 2022
@lentzi90
Copy link
Contributor

lentzi90 commented Nov 9, 2022

I think we should cherry pick this since we have it in the v0.6.4 milestone.
/cherry-pick release-0.6

@k8s-infra-cherrypick-robot

@lentzi90: #1301 failed to apply on top of branch "release-0.6":

Applying: openstackmachine: do not set transient error message and reason
Using index info to reconstruct a base tree...
A	api/v1alpha6/openstackmachine_types.go
M	controllers/openstackmachine_controller.go
M	controllers/openstackmachine_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging controllers/openstackmachine_controller_test.go
Auto-merging controllers/openstackmachine_controller.go
CONFLICT (content): Merge conflict in controllers/openstackmachine_controller.go
Auto-merging api/v1alpha5/openstackmachine_types.go
CONFLICT (content): Merge conflict in api/v1alpha5/openstackmachine_types.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 openstackmachine: do not set transient error message and reason
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

I think we should cherry pick this since we have it in the v0.6.4 milestone.
/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lentzi90
Copy link
Contributor

lentzi90 commented Nov 9, 2022

Oh that was not so straight forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle transient OpenStack API errors for OpenStackMachines
8 participants