Wait for complete disconnection with pod slaves #248

yue9944882 · 2017-11-13T07:55:09Z

BUG Fix

Sometimes it can be failed to delete a pod slave with following trace.

Nov 13, 2017 5:51:01 AM org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate
SEVERE: Failed to terminate pod for slave *****
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: DELETE at: *****
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:315)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:268)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:237)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:230)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleDelete(OperationSupport.java:202)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.deleteThis(BaseOperation.java:579)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.delete(BaseOperation.java:525)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.delete(BaseOperation.java:62)
	at org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave._terminate(KubernetesSlave.java:139)
	at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:67)
	at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:59)
	at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:43)
	at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
	at hudson.model.Queue._withLock(Queue.java:1338)
	at hudson.model.Queue.withLock(Queue.java:1215)
	at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
	at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:51)
	at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

This is because pod disconnecting itself is a asynchronous process.

kubernetes-plugin/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesSlave.java

Line 169 in 4303ec3

computer.disconnect(offlineCause);

If we explicitly block on that, and do the pod deletion after safe disconnection. This error will just disappear.

carlossg · 2017-11-13T08:13:19Z

you can push more commits to your branch, no need to create new PRs

carlossg · 2017-11-13T08:34:20Z

No, you can force push to your branch and it will update the PR

carlossg · 2017-11-21T12:11:35Z

src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesSlave.java

+            // Assuming pod template has some error itself
+            // Simply return might leave some kuberentes pod of ERROR state 
+            return;
+        }


} catch(TimeoutException | InterruptedException | ExecutionException e){ String msg = String.format("Error waiting for agent disconnection %s: %s", name, e.getMessage()); LOGGER.log(Level.INFO, msg, e);

carlossg · 2017-11-21T12:12:12Z

I think it should be simpler, and we don't care if disconnection does not happen, we will continue with pod deletion

yue9944882 · 2017-11-21T12:24:19Z

@carlossg
In my case, I often scale up ~500 jenkins nodes via jenkins and disconnect them after about 10 to 20 min. It can be very inconvenient to inspect jenkins master logs because hundreds lines of java exception stacktraces will flood into stdout/stderr due to incorrect disconnection. Btw, my jenkins master is also deployed in k8s, and all the logs are redirected into docker json file log which makes it even more inconvenient.

At least, it will be helpful for large scale production deployment.

carlossg · 2017-11-21T12:27:00Z

WDYM ? my suggestion will print the same exceptions as yours

yue9944882 · 2017-11-21T12:44:34Z

I think it will be fine if we simply continue to delete pods when InterruptExcetion/ExecutionException happens.
But TimeoutException may means that the pod is in offline/error/not-exist status, it can still cause some other exception when continuing delete the pod. How about output some error log and just return when timeout happens? @carlossg

Allow configuring the timeout as a system property

retest

Wait for complete disconnection with pod slaves

0d7002f

Set timeout 5s for slave disconnection

cdd9b69

carlossg requested changes Nov 21, 2017

View reviewed changes

Ignore errors during disconnection

ef179d2

Allow configuring the timeout as a system property

carlossg previously approved these changes Nov 21, 2017

View reviewed changes

Remove unused imports

4765385

carlossg merged commit bb91ab3 into jenkinsci:master Nov 21, 2017

carlossg mentioned this pull request Nov 21, 2017

[JENKINS-35246] Test that Jenkins nodes are deleted after agent disconnection #201

Closed

yue9944882 deleted the bugfix/complete-pod-disconnect-before-deletion branch April 10, 2019 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for complete disconnection with pod slaves #248

Wait for complete disconnection with pod slaves #248

yue9944882 commented Nov 13, 2017

carlossg commented Nov 13, 2017

carlossg commented Nov 13, 2017

carlossg Nov 21, 2017

carlossg commented Nov 21, 2017

yue9944882 commented Nov 21, 2017 •

edited

Loading

carlossg commented Nov 21, 2017

yue9944882 commented Nov 21, 2017 •

edited

Loading

Wait for complete disconnection with pod slaves #248

Wait for complete disconnection with pod slaves #248

Conversation

yue9944882 commented Nov 13, 2017

BUG Fix

carlossg commented Nov 13, 2017

carlossg commented Nov 13, 2017

carlossg Nov 21, 2017

Choose a reason for hiding this comment

carlossg commented Nov 21, 2017

yue9944882 commented Nov 21, 2017 • edited Loading

carlossg commented Nov 21, 2017

yue9944882 commented Nov 21, 2017 • edited Loading

yue9944882 commented Nov 21, 2017 •

edited

Loading

yue9944882 commented Nov 21, 2017 •

edited

Loading