Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix node decommissioning itself #61356

Merged
merged 3 commits into from
Mar 4, 2021

Commits on Mar 3, 2021

  1. grpcutil: rename IsAuthError and handle PermissionDenied

    `IsAuthenticationError` is often used to return RPC errors immediately
    instead of retrying them. Previously, this only considered
    `codes.Unauthenticated` but not `codes.PermissionDenied`. This could
    cause various operations to run indefinitely due to internal RPC
    retries, e.g. after a node has been decommissioned and lost access to
    the cluster.
    
    Since permission errors should also return immediately, this patch
    changes the function to handle `codes.PermissionDenied` as well.
    However, since a missing permission is an authorization failure rather
    than an authentication failure, the function was renamed to
    `IsAuthError` such that it can cover both authn and authz failures.
    
    Release justification: low risk, high benefit changes to existing functionality
    
    Release note (bug fix): Fix operation hangs when a node loses access to
    cluster RPC (e.g. after it has been decommissioned), and immediately
    return an error instead.
    erikgrinaker committed Mar 3, 2021
    Configuration menu
    Copy the full SHA
    d5f73c6 View commit details
    Browse the repository at this point in the history
  2. server: move TestDecommissionNodeStatus into server_test.go

    The test `TestDecommissionNodeStatus` was initially placed in a separate
    file `decommission_test.go`, even though it belonged in
    `server_test.go`. This was done to avoid import cycles between the
    `testcluster` package and the `server` package, as the
    `decommission_test.go` file could use the `server_test` package instead.
    
    This commit creates the cluster with `serverutils.StartNewTestCluster()`
    instead of `testcluster.StartTestCluster()`, which avoids the import
    cycle and allows it to be moved into `server_test.go`. This required
    exposing `Server.Decommission()` via `serverutils.TestServerInterface`
    as well.
    
    Release justification: non-production code changes
    Release note: None
    erikgrinaker committed Mar 3, 2021
    Configuration menu
    Copy the full SHA
    247f251 View commit details
    Browse the repository at this point in the history
  3. server: fix node decommissioning itself

    Previously, if a node was asked to decommission itself, the
    decommissioning process would sometimes hang or fail since the node
    would become decommissioned and lose RPC access to the rest of the
    cluster while it was building the response.
    
    This patch returns an empty response from the
    `adminServer.Decommission()` call when setting the final
    `DECOMMISSIONED` status, thereby avoiding further use of cluster RPC
    after decommissioning itself. It also defers self-decommissioning until
    the end if multiple nodes are being decommissioned.
    
    This change is backwards-compatible with old CLI versions, which will
    simply output the now-empty status result set before completing with
    "No more data reported on target nodes". The CLI has been updated to
    simply omit the empty response.
    
    Release justification: bug fixes and low-risk updates to new functionality
    
    Release note (bug fix): Fixed a bug from 21.1-alpha where a node
    decommissioning process could sometimes hang or fail when the
    decommission request was submitted via the node being decommissioned.
    erikgrinaker committed Mar 3, 2021
    Configuration menu
    Copy the full SHA
    badefde View commit details
    Browse the repository at this point in the history