Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCM should not try to drain machines that have not joined the cluster #465

Closed
rfranzke opened this issue May 28, 2020 · 6 comments · Fixed by #480
Closed

MCM should not try to drain machines that have not joined the cluster #465

rfranzke opened this issue May 28, 2020 · 6 comments · Fixed by #480
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) effort/1d Effort for issue is around 1 day exp/beginner Issue that requires only basic skills kind/bug Bug priority/2 Priority (lower number equals higher priority) status/new Issue is new and unprocessed

Comments

@rfranzke
Copy link
Member

What happened:
MCM is trying to delete some Machine objects. Now it tries to drain the corresponding Node objects, however, as the mentioned Machines never joined the cluster, there are no such Node objects. The drain fails forever with:

status:
  currentStatus:
    lastUpdateTime: "2020-05-28T15:30:18Z"
    phase: Terminating
  lastOperation:
    description: Drain failed - resource name may not be empty
    lastUpdateTime: "2020-05-28T15:45:52Z"
    state: Failed
    type: Delete

Logs of the machine-controller-manager:

I0528 15:45:42.432992       1 deployment.go:448] Processing the machinedeployment "shoot--foo--bar-cpu-worker" (with replicas 4)
W0528 15:45:42.639551       1 machine.go:658] Drain failed for machine "shoot--foo--bar-cpu-worker-7cdf986ff9-pzcgs".
Buf:
ErrBuf:
Err-Message:resource name may not be empty
W0528 15:45:42.785280       1 machine.go:658] Drain failed for machine "shoot--foo--bar-cpu-worker-7cdf986ff9-62mcz".
Buf:
ErrBuf:
Err-Message:resource name may not be empty
W0528 15:45:42.836888       1 machine.go:658] Drain failed for machine "shoot--foo--bar-cpu-worker-7cdf986ff9-k46s5".
Buf:
ErrBuf:
Err-Message:resource name may not be empty
I0528 15:45:42.883235       1 machine.go:551] Deleting Machine "shoot--foo--bar-cpu-worker-7cdf986ff9-pzcgs"
E0528 15:45:42.883357       1 drain.go:193] Error getting details for node: "". Err: resource name may not be empty
I0528 15:45:42.883371       1 drain.go:175] Machine drain ended on 2020-05-28 15:45:42.883368449 +0000 UTC m=+47.035130779 and took 90.578µs for ""

What you expected to happen:
MCM should not try to drain these nodes.

How to reproduce it (as minimally and precisely as possible):
Create machine objects that won't join the cluster and then try to delete them.

Environment:
MCM v0.29.0

@rfranzke rfranzke added the kind/bug Bug label May 28, 2020
@rfranzke
Copy link
Member Author

/cc @tim-ebert

@ghost
Copy link

ghost commented May 28, 2020

@tim-ebert

Message

/cc @tim-ebert

@vlerenc
Copy link
Member

vlerenc commented May 29, 2020

/bark @rfranzke
The /cc command is like /ping and the others. Maybe you want simply no reaction to /cc? Or only no sweets? ;-)

@ghost
Copy link

ghost commented May 29, 2020

@rfranzke

Message

/bark @rfranzke
The /cc command is like /ping and the others. Maybe you want simply no reaction to /cc? Or only no sweets? ;-)

@prashanth26 prashanth26 added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) exp/beginner Issue that requires only basic skills size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) status/new Issue is new and unprocessed labels Jun 1, 2020
@vpnachev
Copy link
Member

vpnachev commented Jun 25, 2020

@gardener/mcm-maintainers any updates on this issue?

I have an azure VM that failed to join and now the deletion is stuck, however the only resource in azure portal that I see is the network interface - the VM itself has been deleted.

PS. I am using the latest MCM version v0.31.0

@prashanth26
Copy link
Contributor

prashanth26 commented Jun 26, 2020

Hi Guys,

We haven't made any progress on this issue currently. I think we need to pick this issue on prio as this is affecting multiple clusters.

Although these machines get deleted after the drain timeout, still is not a good idea to keep such machines lying around.

/priority critical

cc: @hardikdr @amshuman-kr

@gardener-robot gardener-robot added the priority/critical Needs to be resolved soon, because it impacts users negatively label Jun 26, 2020
@gardener-robot gardener-robot added priority/2 Priority (lower number equals higher priority) effort/1d Effort for issue is around 1 day and removed priority/critical Needs to be resolved soon, because it impacts users negatively size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) effort/1d Effort for issue is around 1 day exp/beginner Issue that requires only basic skills kind/bug Bug priority/2 Priority (lower number equals higher priority) status/new Issue is new and unprocessed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants