Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

Handle polling errors and update status appropriately #2368

Merged
merged 2 commits into from
Oct 1, 2018
Merged

Handle polling errors and update status appropriately #2368

merged 2 commits into from
Oct 1, 2018

Conversation

nilebox
Copy link
Contributor

@nilebox nilebox commented Sep 28, 2018

This PR is a

  • Feature
  • Bug Fix
  • Documentation

What this PR does / why we need it:
When broker returns any error in the last_operation response, we just retry without updating the status. As a result, in case of provisioning we have instance reporting that it's still provisioning, as if nothing is wrong:

  conditions:
  - lastTransitionTime: 2018-09-20T07:10:16Z
    message: The instance is being provisioned asynchronously
    reason: Provisioning
    status: "False"
    type: Ready

Which issue(s) this PR fixes

Fixes #2369

Please leave this checklist in the PR comment so that maintainers can ensure a good PR.

Merge Checklist:

  • New feature
    • Tests
    • Documentation
  • SVCat CLI flag
  • Server Flag for config
    • Chart changes
    • removing a flag by marking deprecated and hiding to avoid
      breaking the chart release and existing clients who provide a
      flag that will get an error when they try to update

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 28, 2018
return c.processServiceInstancePollingFailureRetryTimeout(instance, readyCond)
}

return c.continuePollingServiceInstance(instance)
if httpErr, ok := osb.IsHTTPError(err); ok {
if isRetriableHTTPStatus(httpErr.StatusCode) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on the fence whether to keep retrying on getting 400 BadRequest.
It doesn't make sense to retry polling until the issue is fixed on the broker side though.

It seems okay to do svcat touch instance ... after fixing the issue on the broker side?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a 400 is a terminal error that should stop polling so the user can fix and then use touch to retry try it. We are working on a new command svcat set instance --param foo=bar that will let you fix an instance and touch it in a single command, but it's not merged yet.

).msg("Status: 403; ErrorMessage: <nil>; Description: <nil>; ResponseError: <nil>")
if err := checkEvents(events, expectedEvent.stringArr()); err != nil {
).msg("Status: 400; ErrorMessage: <nil>; Description: <nil>; ResponseError: <nil>")
// Event is sent twice: one for Ready condition and one for Failed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we see 2 events is that we send separate events for Ready and Failed condition in https://github.com/kubernetes-incubator/service-catalog/blob/621d4e22f5fd6ad137d350e8122b59e01d0c3845/pkg/controller/controller_instance.go#L2710-L2716

It may be better to send only the Failed one, but this is out of scope of this PR.

@@ -1103,13 +1110,18 @@ func isServiceInstanceProcessedAlready(instance *v1beta1.ServiceInstance) bool {
// processServiceInstancePollingFailureRetryTimeout marks the instance as having
// failed polling due to its reconciliation retry duration expiring
func (c *controller) processServiceInstancePollingFailureRetryTimeout(instance *v1beta1.ServiceInstance, readyCond *v1beta1.ServiceInstanceCondition) error {
msg := "Stopping reconciliation retries because too much time has elapsed"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send the event here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The event is being sent inside processServiceInstancePollingTerminalFailure method invoked 2 lines below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather in functions invoked by processServiceInstancePollingTerminalFailure... :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it all looks very spaghetti :-\

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artkoshelev You're welcome to submit a PR to refactor this for readability! 😀 Nile did the right thing by sticking with fixing just the bug in this PR.

return c.processServiceInstancePollingFailureRetryTimeout(instance, readyCond)
}

return c.continuePollingServiceInstance(instance)
if httpErr, ok := osb.IsHTTPError(err); ok {
if isRetriableHTTPStatus(httpErr.StatusCode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a 400 is a terminal error that should stop polling so the user can fix and then use touch to retry try it. We are working on a new command svcat set instance --param foo=bar that will let you fix an instance and touch it in a single command, but it's not merged yet.

reason := errorPollingLastOperationReason
message := fmt.Sprintf("Error polling last operation: %v", err)
glog.V(4).Info(pcb.Message(message))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message was removed, but any polling error messages should continue to be logged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will bring it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -1103,13 +1110,18 @@ func isServiceInstanceProcessedAlready(instance *v1beta1.ServiceInstance) bool {
// processServiceInstancePollingFailureRetryTimeout marks the instance as having
// failed polling due to its reconciliation retry duration expiring
func (c *controller) processServiceInstancePollingFailureRetryTimeout(instance *v1beta1.ServiceInstance, readyCond *v1beta1.ServiceInstanceCondition) error {
msg := "Stopping reconciliation retries because too much time has elapsed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artkoshelev You're welcome to submit a PR to refactor this for readability! 😀 Nile did the right thing by sticking with fixing just the bug in this PR.

Copy link
Contributor

@carolynvs carolynvs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: carolynvs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 29, 2018
@jboyd01
Copy link
Contributor

jboyd01 commented Oct 1, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2018
@k8s-ci-robot k8s-ci-robot merged commit b1a1c45 into kubernetes-retired:master Oct 1, 2018
jboyd01 pushed a commit to jboyd01/service-catalog that referenced this pull request Oct 31, 2018
…ired#2368)

* Handle polling errors and update status appropriately

* Added back the log message about polling error
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Instance status doesn't get updated with polling errors
5 participants