Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for complete disconnection with pod slaves #248

Conversation

yue9944882
Copy link
Contributor

BUG Fix

Sometimes it can be failed to delete a pod slave with following trace.

Nov 13, 2017 5:51:01 AM org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate
SEVERE: Failed to terminate pod for slave *****
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: DELETE at: *****
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:315)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:268)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:237)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:230)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleDelete(OperationSupport.java:202)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.deleteThis(BaseOperation.java:579)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.delete(BaseOperation.java:525)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.delete(BaseOperation.java:62)
	at org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave._terminate(KubernetesSlave.java:139)
	at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:67)
	at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:59)
	at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:43)
	at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
	at hudson.model.Queue._withLock(Queue.java:1338)
	at hudson.model.Queue.withLock(Queue.java:1215)
	at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
	at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:51)
	at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

This is because pod disconnecting itself is a asynchronous process.

If we explicitly block on that, and do the pod deletion after safe disconnection. This error will just disappear.

@carlossg
Copy link
Contributor

you can push more commits to your branch, no need to create new PRs

@carlossg
Copy link
Contributor

No, you can force push to your branch and it will update the PR

// Assuming pod template has some error itself
// Simply return might leave some kuberentes pod of ERROR state
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} catch(TimeoutException | InterruptedException | ExecutionException e){
String msg = String.format("Error waiting for agent disconnection %s: %s", name, e.getMessage());
LOGGER.log(Level.INFO, msg, e);

@carlossg
Copy link
Contributor

I think it should be simpler, and we don't care if disconnection does not happen, we will continue with pod deletion

@yue9944882
Copy link
Contributor Author

yue9944882 commented Nov 21, 2017

@carlossg
In my case, I often scale up ~500 jenkins nodes via jenkins and disconnect them after about 10 to 20 min. It can be very inconvenient to inspect jenkins master logs because hundreds lines of java exception stacktraces will flood into stdout/stderr due to incorrect disconnection. Btw, my jenkins master is also deployed in k8s, and all the logs are redirected into docker json file log which makes it even more inconvenient.

At least, it will be helpful for large scale production deployment.

@carlossg
Copy link
Contributor

WDYM ? my suggestion will print the same exceptions as yours

@yue9944882
Copy link
Contributor Author

yue9944882 commented Nov 21, 2017

I think it will be fine if we simply continue to delete pods when InterruptExcetion/ExecutionException happens.
But TimeoutException may means that the pod is in offline/error/not-exist status, it can still cause some other exception when continuing delete the pod. How about output some error log and just return when timeout happens? @carlossg

Allow configuring the timeout as a system property
carlossg
carlossg previously approved these changes Nov 21, 2017
@carlossg carlossg merged commit bb91ab3 into jenkinsci:master Nov 21, 2017
@yue9944882 yue9944882 deleted the bugfix/complete-pod-disconnect-before-deletion branch April 10, 2019 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants