Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for error codes #590

Labels
area/ops-productivity Operator productivity related (how to improve operations) effort/1w Effort for issue is around 1 week kind/enhancement Enhancement, improvement, extension needs/planning Needs (more) planning with other MCM maintainers priority/3 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@ialidzhikov
Copy link
Member

ialidzhikov commented Jan 25, 2021

What would you like to be added:
Under the gardener org various CRs and API resources support error codes - Shoot, Etcd, extension CRs.
It will be very helpful if the Machine status also supports error codes.

Currently a Machine that fails to be created for example because of invalid credentials has the following example status:

status:
  lastOperation:
    description: "Cloud provider message - AuthFailure: AWS was not able to validate
      the provided access credentials\n\tstatus code: 401, request id: <omitted>"
    lastUpdateTime: "2021-01-25T17:49:06Z"
    state: Failed
    type: Delete

There is nothing wrong with the above status but it is not very helpful when another component needs to interpret the error and to mark it properly (whether it is ERR_INFRA_UNAUTHORIZED, ERR_INFRA_QUOTA_EXCEEDED or any other error code).

We could add lastError to the Machine status that can hold the error code related to the last failed operation:

status:
  lastError:
    description: "Cloud provider message - AuthFailure: AWS was not able to validate
      the provided access credentials\n\tstatus code: 401, request id: <omitted>"
    code: ERR_INFRA_UNAUTHORIZED
  lastOperation:
    description: "Cloud provider message - AuthFailure: AWS was not able to validate
      the provided access credentials\n\tstatus code: 401, request id: <omitted>"
    lastUpdateTime: "2021-01-25T17:49:06Z"
    state: Failed
    type: Delete

Initially this enhancement is requested in gardener/gardener#3020 where the concrete requirement for machine-controller-manager is to properly detect and flag misconfigured PodDisruptionBudgets that require zero voluntary Pod evictions and prevent graceful Node drain.

Why is this needed:

  • To establish a proper contract between machine-controller-manager and component that heavily rely on Machine status (extension controllers, gardenlet)
  • To enrich the Machine status with valuable information

/area ops-productivity

@ialidzhikov ialidzhikov added the kind/enhancement Enhancement, improvement, extension label Jan 25, 2021
@gardener-robot gardener-robot added the area/ops-productivity Operator productivity related (how to improve operations) label Jan 25, 2021
@rfranzke
Copy link
Member

There is nothing wrong with the above status but it is not very helpful when another component needs to interpret the error and to mark it properly (whether it is ERR_INFRA_UNAUTHORIZED, ERR_INFRA_QUOTA_EXCEEDED or any other error code).

Hm, how would MCM do it? It cannot do it better compared to how Gardener/the extension do it, or?

Initially this enhancement is requested in gardener/gardener#3020 where the concrete requirement for machine-controller-manager is to properly detect and flag misconfigured PodDisruptionBudgets that require zero voluntary Pod evictions and prevent graceful Node drain.

As discussed earlier, a more streamlined approach could be a condition on the Machine (instead of an error code).

Generally, I'm not sure if this complexity in MCM would be helpful. The error codes are a Gardener thing only and I don't see a strong reason to introduce them in other components as well.

@ialidzhikov
Copy link
Member Author

ialidzhikov commented Jan 25, 2021

Hm, how would MCM do it? It cannot do it better compared to how Gardener/the extension do it, or?

MCM interacts with the cloud provider sdk and it can check the type of the returned error. Based on the returned error type from the cloud provider sdk, it can evaluate whether it is authentication/authorization/quota error and properly set the error code in the status. I believe this approach is more stable than the pattern matching from provider extension controller side over the Machine status.lastOperation.description. IMO one drawback of the current handling is that the error message under Machine status.lastOperation.description can be changed depending on the cloud provider sdk and this change then needs to be maintained in g/g - ref https://github.com/gardener/gardener/blob/v1.15.5/pkg/apis/core/v1beta1/helper/errors.go#L52-L59. I still don't have any draft locally but my assumption is that the error codes will be handled better in machine-controller-manager than Gardener/the extension. WDYT?

As discussed earlier, a more streamlined approach could be a condition on the Machine (instead of an error code).

The PDB case is different from authentication/authorization/quota errors and will check how to properly handle and indicate these type of errors.

@rfranzke
Copy link
Member

I agree that the error code mapping can be improved because it's somewhat fragile at the moment. We have an open issue about moving the error code mapping out of g/g and to the provider extensions, but I doubt that this would change the way how the mapping is done (string-based).
In the overall picture, assuming that MCM would support the error codes, the problem would still be that the provider extensions most commonly use Terraform and have to map the output string to error codes. Also, we have other places like the Event reading for failing Services where we parse the CCM's message. MCM is only one piece of the puzzle, and I'm not sure whether we would gain that much if it would support the codes.

@prashanth26 prashanth26 added priority/3 Priority (lower number equals higher priority) effort/1w Effort for issue is around 1 week labels Mar 30, 2021
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Sep 27, 2021
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Mar 26, 2022
@himanshu-kun himanshu-kun added needs/planning Needs (more) planning with other MCM maintainers and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Feb 22, 2023
@himanshu-kun
Copy link
Contributor

We discussed internally and propose the following machineStatus:

status:
  lastOperation:
    description: "Cloud provider message - AuthFailure: AWS was not able to validate
      the provided access credentials\n\tstatus code: 401, request id: <omitted>"
    lastUpdateTime: "2021-01-25T17:49:06Z"
    reasonCode: ResourceExhausted
    state: Failed
    type: Delete

MCM already has a well maintained collection of different kinds of error codes.

We prefer having a reasonCode than lastErr struct as the description would get duplicated in the latter.

This issue is requirement for solving a more urgent issue on CA
@ialidzhikov @rfranzke Pls let us know if you feel otherwise
cc @unmarshall

@himanshu-kun
Copy link
Contributor

himanshu-kun commented Aug 17, 2023

While introducing this mapping introduced in AWS we should also try to see if changes in this PR gardener/machine-controller-manager-provider-aws#59 can be reverted.

cc @rishabh-11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ops-productivity Operator productivity related (how to improve operations) effort/1w Effort for issue is around 1 week kind/enhancement Enhancement, improvement, extension needs/planning Needs (more) planning with other MCM maintainers priority/3 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
5 participants