JENKINS-40825 close resource leak, fix broken pipe error #182

marvinthepa · 2017-08-03T15:15:32Z

resource leak

First of all, Launcher.launch is called not once, but again and again.
I.e. overwriting shared members of the enclosing class that a different
thread is still using (and only cleaning the ones that were written last).
That is a bad idea, and caused the thread and connection leak.

So, move them (watch and proc) to local variables, and close them
elsewhere:

watch was already cleaned up in ContainerExecProc
proc.kill() was not correctly executed anyway, and this is still broken. See discussion below.

pipe not connected

Second of all, waitQuietly on the started latch was what caused
the "pipe not connected" errors:

In case of InterruptedException, it caused the launcher to just continue
as if started had already been counted down (i.e. the websocket connection
established), when in reality the thread just interrupted itself because of a timeout
(probably from DurableTaskStep.check()).

still TODO: `proc.kill` when step is interrupted

If the sh step inside container is interrupted, DefaultStepContext will create a new Launcher and DurableTaskStep will call kill on it - not call kill on existing procs that have finished (or not). If that one does not kill the process (which DurableTaskStep checks), ContainerStepExecution.stop will also not be called.

Keeping a single reference of proc around and calling kill on it when ContainerStepExecution.stop is called (that is what the current master does) would not work anyway.

This is because both ContainerExecDecorator.decorate and Launcher.launch are called multiple times (at least to start the actual script, and then to run a number of ps executions every 10 seconds to check on the process) - so we are very likely to call kill on the wrong instance.

Even if we keep track of all the procs we created - how will we know which is the right one?

The "right" solution is one like in the Docker Pipeline Plugin:
I.e. open a new connection to the container, find the right process to kill (using the environment variable JENKINS_SERVER_COOKIE=durable*, and kill it.

I can implement that solution, but as it has little to do with JENKINS-40825, do you want it on a new pull request?

Sorry for writing such a "novel", I hope I can make myself clear.

iocanel

The only thing I am not sure about, is if we should remove the call to close() from stop and form the the BodyExecutionCallback.

iocanel · 2017-08-03T15:21:46Z

src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java

+ started.await();
+ } catch (InterruptedException e) {
+ closeWatch(watch);
+ throw new IOException("interrupted while waiting for websocket connection");


Maybe an InterruptedIOException would be better here.

Sounds good.

iocanel · 2017-08-03T15:37:09Z

src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerStepExecution.java

 .start();
 return false;
 }

 @Override
 public void stop(Throwable cause) throws Exception {
 LOGGER.log(Level.FINE, "Stopping container step.");
- closeQuietly(client, decorator);


This is likely to be called on interruption.

What happens to the proc in this case? Is it handled by Jenkins, or we should do the cleanup ourselves?

This also applies to the callback.

'stop' is not called on interruption, I checked (by manually aborting the build - what other interruptions exist?).

Unfortunately I don't know which callback you are referring to.

Sorry, it was obvious which callback you meant.

I didn't check that one. If it is actually called on interruption, we can collect all the procs and do an additional clean up there - although I don't think it will be necessary.

@iocanel:

The callback is actually called on interruption.
I will check if keeping a reference of all the procs makes sense.

@iocanel see c9143f8

michaelajr · 2017-08-03T15:44:26Z

So we just tried out the code for this PR. We had a single job, running concurrent builds, on exclusive agents with random labels. This ensured each running job was on its on label and agent. The podTemplate code simply switched to the JNLP container and did a sleep 300. We were running 17 agents. The first 7 jobs got this during its sleep:

[sleep] Running shell script
+ sleep 300
Cannot contact jenkins-agent-lm0wn-pk3k5: java.io.IOException: interrupted while waiting for websocket connection

The other jobs then immediately stopped printing output and appeared frozen at points before the sleep.

The jobs that had the websocket connection line after the sleep completed successfully with

[sleep] Running shell script
+ sleep 300
Cannot contact jenkins-agent-lm0wn-pk3k5: java.io.IOException: interrupted while waiting for websocket connection
[Pipeline] }
[Pipeline] // container
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // podTemplate
[Pipeline] End of Pipeline
Finished: SUCCESS

(actually 6 got this and a 7th succeeded without the log line)

All the other jobs that were frozen failed with:

java.io.IOException: interrupted while waiting for websocket connection
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:169)
	at hudson.Launcher$ProcStarter.start(Launcher.java:384)
	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:157)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:64)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:167)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:224)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:150)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:108)
	at sun.reflect.GeneratedMethodAccessor199.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1218)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1027)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:150)
	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:123)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:148)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:152)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:122)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:122)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:122)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:122)
	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:16)
	at WorkflowScript.run(WorkflowScript:63)
	at ___cps.transform___(Native Method)
	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
	at sun.reflect.GeneratedMethodAccessor151.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.dispatch(CollectionLiteralBlock.java:55)
	at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.item(CollectionLiteralBlock.java:45)
	at sun.reflect.GeneratedMethodAccessor148.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:103)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
	at sun.reflect.GeneratedMethodAccessor151.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:60)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
	at sun.reflect.GeneratedMethodAccessor151.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:173)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:162)
	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:122)
	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:261)
	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:162)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:19)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:35)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:32)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:32)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:330)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Finished: FAILURE

marvinthepa · 2017-08-03T16:50:11Z

@michaelajr

Thanks for the test. The exception you are getting now would likely have been the "pipe not connected" error before this. This means that Jenkins takes more than 10 seconds to connect to kubernetes and start executing the command. Unfortunately, Jenkins will just interrupt us in this case...

Are you able to paste a minimal example that you are using? That might help, as I was not able to trigger this again just running the same build 10 times...

Probably we are hitting some other limit here, I can't imagine that it takes so long to start executing the command.

Maybe the thread pool is exausted, maybe we should try re-using connections instead of opening a new one for each 'ps' that Jenkins wants us to execute... will investigate tomorrow.

marvinthepa · 2017-08-04T09:17:54Z

Added the implementation of kill.

Will now investigate @michaelajr's issues.

Maybe it helps to add the suggested fix for waitUntilContainerIsReady in marvinthepa@996952c#commitcomment-23453964. I will try that first.

marvinthepa · 2017-08-04T12:49:35Z

@michaelajr I can reproduce your problems, however only if I run A LOT of jobs at exactly the same time - e.g. 4 of 30 jobs fail..

michaelajr · 2017-08-04T13:13:46Z

@marvinthepa We saw a greater failure ratio yesterday. We definitely need to run/start multiple jobs at or around the same time. We'll test again this morning with the new code.

marvinthepa · 2017-08-04T14:53:18Z

@iocanel:

I saw a few "Pumpers" hanging around without 1fafc92 (only about 2 for 100 builds or so - I am testing a lot), that's why I applied it. Seems to work so far.

austinmoore- · 2017-08-04T16:42:58Z

@marvinthepa Hey, I work with @michaelajr. I just tested the new commits you've just added.

I have a pipeline job that just sleeps for 300 seconds: https://gist.github.com/austinmoore-/b26b61d65072400b64ec205ba31628cb

First, I ran 15 instances of this job. 3/15 of the jobs failed with the IOException and stacktrace mentioned above.

After than I ran it again but 20 times. 7/20 jobs failed with the same IOException and stacktrace.

I can reproduce your problems, however only if I run A LOT of jobs at exactly the same time - e.g. 4 of 30 jobs fail..

I'm not sure, but the failures might only occur when running the same job. When I ran through our deployment jobs (8 simultaneous jobs), I did not encounter this error. Were you running the same job, or different jobs?

michaelajr · 2017-08-04T17:02:38Z

To add to @austinmoore- 's comment, under normal circumstances there are no jobs that run concurrent builds. It was just our test job where we executed concurrent builds, each with it's own EXCLUSIVE agent and label. The fact that we were able to run 8 of our REAL jobs at the same time without this issue is a huge step forward. So this is good news. Would be nice to understand why we are seeing such a high failure rate when we run concurrent builds of the same job.

elementalvoid · 2017-08-04T19:21:28Z

EDIT: The following is based on a local build from ed766ca.

To add some more experiences to this. It still appears the issue is related to running a job that uses more than one container in the Pod Template. We've found that when using a Pod Template that has greater than 1 container defined you're okay to run many parallel jobs (even the same job) so long as you don't use that extra container. As soon as you run commands in the additional containers we get the following exception and the job fails:

java.io.InterruptedIOException: interrupted while waiting for websocket connection
    at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:179)
    at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:108)

What we've been using for testing is the following Pipeline:

Integer delaySec = 60

timestamps {
    node {
          for(i in [1,2,3,4,5]) {
                println "starting sleep ${delaySec} iteration: ${i}"
                stage("sleep ${i}") {
                        container('mk-foo') {
                            sh("sleep ${delaySec} && pwd")
                        }
                }
                stage("ls ${i}") {
                        container('mk-foo') {
                            sh('ls')
                        }
                }
                println "done sleeping: iteration: ${i}"
          }
    }
}

mk-foo is our extra container -- it's just a node:6 container with no special config.

Then we click the build button ten or so times. In some cases the loop will work but the ls command will take approximately one minute to execute (I suspect a timeout somewhere). In other cases they'll fail as soon as the exception occurs.

marvinthepa · 2017-08-07T06:41:25Z

I think I found another (probably big) part of the problem.

OkHttpClient per default only runs 5 concurrent requests against the same hostname, which is probably to little when you are constantly opening websocket connections to that one host.

Seems like I will have to open another pull request to make this configurable in the kubernetes-client, but I will verify my assumption first.

iocanel · 2017-08-07T07:56:29Z

Given that each pod is meant to serve a single job, I can't see why we would possibly need more than 2 concurrent concurrent connection to a single pod (one for the sh part and one for the probe). If we don't have any leaking, we should be just fine.

marvinthepa · 2017-08-07T09:11:12Z

@iocanel: It is not about connections to the pod - it is about connections to KubernetesCloud.serverUrl, which is the same for all builds.

iocanel · 2017-08-07T09:42:03Z

You are right! I think I need more coffee !!!

marvinthepa · 2017-08-07T09:47:51Z

Here you go: ☕️ .

JeanMertz · 2017-08-07T12:52:34Z

Really excited to see the progress being made here, the investigation being done by @marvinthepa, and the help from @iocanel. Keep up the good work 👍

I'm a colleague of @soudy, who initially reported JENKINS-40825, so if there's anything that we can do to help out, let use now.

Also: ☕️

marvinthepa · 2017-08-07T14:21:32Z

@iocanel: I added two new commits:

Added the changed waiting behavior client.pods().inNamespace(namespace).withName(podName).waitUntilReady(CONTAINER_WAIT_TIMEOUT_SECONDS, TimeUnit.SECONDS)
Made the number of concurrent requests to Kubernetes configurable

Could I kindly ask you to review those changes?

I am especially unsure about the configuration in the second one - don't know if it makes sense to make it configurable from the GUI.

I think it should definitely be added to the Config object in the kubernetes-client, as reaching into the KubernetesClient feels dirty:

client.getHttpClient().dispatcher().setMaxRequestsPerHost(maxRequestsPerHost);

Then we could also set it via Java system property or environment variable (instead of or in addition to the GUI).
In any way I think we should provide a more sensible default than 5 - I set it to 32 for now as that worked reasonably well in my tests, and I guess it should not overwhelm the kubernetes master.

Please let me know what you think. Always happy to incorporate improvements.

austinmoore- · 2017-08-07T14:57:25Z

@marvinthepa @iocanel Hey, I just tried out the new changes. I ran 20 jobs, and they all succeeded with no issue! This seems to fix all of our issues. This really great! We'll continue to test this.

elementalvoid · 2017-08-07T15:13:42Z

We've also tested the latest changes with the pipeline I described before. 100 Jobs with no issues! We're getting ready to push this out to our main Jenkins instance. I'll let you know how it goes.

Thanks for the quick work on this @marvinthepa.

elementalvoid · 2017-08-07T15:18:05Z

src/main/resources/org/csanchez/jenkins/plugins/kubernetes/KubernetesCloud/config.jelly

@@ -47,6 +47,10 @@
 <f:textbox default="10"/>
 </f:entry>

+ <f:entry title="${%Max number of current connections}" field="maxRequestsPerHostStr">


It would be nice to have a more detailed description of what this does on the configuration page. Or maybe changing the title to Max current connections to Kubernetes API would be sufficient.

I changed the wording a little bit (tried to keep it as short as possible, sacrificing grammar and elegance) and also pushed the corresponding help file which I forgot to git add 🤦‍♀️ 🤦‍♂️ , which contains a longer description. Okay like that?

elementalvoid · 2017-08-07T16:30:51Z

I have good news: We've upgraded our main Jenkins master with a plugin built from this PR and all is looking well. Prior to this we would begin to experience the broken pipe issue with a single handful of Jobs. Since the upgrade we've been running approximately 15 Jobs at minimum and about 22 max -- all have either succeeded or failed for legitimate reasons. We're no longer experiencing the broken pipe issue.

Thanks again for the quick fix!

michaelajr · 2017-08-07T17:00:17Z

We too have our production Jenkins running this fix. Working great.

iocanel · 2017-08-07T18:51:54Z

That's excellent news.

I am cutting a kubernetes-client 2.6.0 release right now, that allows you to config max concurrent requests per host.

iocanel · 2017-08-08T07:34:23Z

2.6.1 was released so we should align this pr accordingly.

iocanel · 2017-08-08T17:50:39Z

Can you rebase, so that we can merge?

mikesplain · 2017-08-08T17:54:07Z

FYI we've tested this with the 2.6.1 changes as well and all our issues seem fixed! Great job!!

First of all, `Launcher.launch` is called not once, but again and again. I.e. overwriting shared members of the enclosing class that a different thread is still using (and only cleaning the ones that were written last). That is a bad idea, and caused the thread and connection leak. So, move them (`watch` and `proc`) to local variables, and close them elsewhere: * `proc.kill()` should be taken care of by the shell step (that needs to be verified) * `watch` was already cleaned up in `ContainerExecProc` anyway. Second of all, `waitQuietly` on the `started` latch was what caused the "pipe not connected" errors: In case of InterruptedException, it caused the launcher to just continue as if `started` had already been counted down (i.e. the websocket connection established), when in reality the thread just interrupted itself because of a timeout (probably from [`DurableTaskStep.check()`][1]). [1]: https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/workflow-durable-task-step-2.13/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java#L303

Strongly inspired by [Docker Pipeline Plugin][1], but sending just a single command to the container to be faster. The command only uses linux standard tools and works on debian and alpine, so it should be sufficiently portable. Gets rid of one TODO :-). [1]: https://github.com/jenkinsci/docker-workflow-plugin/blob/45f0c04/src/main/java/org/jenkinsci/plugins/docker/workflow/WithContainerStep.java#L249,

This solves the actual problem causing JENKINS-40825 - OkHttp limiting itself to 5 requests per hosts, which is not sufficient when running 20 builds at once. Each build uses a websocket connection for the command it runs, and additionally spawns more connections to check if the command is still running.

if log level is not set to FINEST..

Although this *should* not happen, just free up the watches when the ContainerStep finishes for good measure.

we should be fine if we get deserialized - how are we supposed to clean up those procs in that case?

marvinthepa · 2017-08-09T09:18:51Z

@iocanel: done, please review again.

JeanMertz · 2017-08-10T21:38:06Z

Unfortunately, with the latest Jenkins version, and the master branch version of this plugin, I am still seeing this issue occur:

ERROR: Execution failed
java.io.IOException: Pipe not connected
	at java.io.PipedOutputStream.write(PipedOutputStream.java:140)
	at java.io.OutputStream.write(OutputStream.java:75)
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$ClosableLauncher.launch(ContainerExecDecorator.java:179)
	at hudson.Launcher$ProcStarter.start(Launcher.java:450)
	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:157)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:64)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:167)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:224)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:150)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:108)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
	at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:19)
	at h.instance(/var/jenkins_home/jobs/blendle/jobs/core-api/branches/nodeport/builds/8/libs/jenkins-pipeline-scripts/vars/h.groovy:197)
	at ___cps.transform___(Native Method)
	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
	at sun.reflect.GeneratedMethodAccessor285.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:173)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:162)
	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:122)
	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:261)
	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:162)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:19)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:35)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:32)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:32)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:330)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

This stack trace was preceded by:

ERROR: Execution failed
java.io.IOException: Failed to execute shell script inside container [instance-1] of pod [jenkins-slave-zzgsj-vq16h]. Timed out waiting for container to become ready!
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$ClosableLauncher.launch(ContainerExecDecorator.java:113)
	at hudson.Launcher$ProcStarter.start(Launcher.java:450)
	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:157)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:64)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:167)

This happened, even though the pod was clearly up and running (using kubectl describe).

marvinthepa · 2017-08-11T08:37:30Z

@JeanMertz:
I don't think so.

at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$ClosableLauncher.launch(ContainerExecDecorator.java:179)

ClosableLauncher does not exist in this pull request. Are you sure you were building from jenkinsci/kubernetes-plugin/master, and not from some other branch? Looks like you are running a plugin built from #180, not this pull request.

JeanMertz · 2017-08-11T09:48:37Z

Hmm. Good find @marvinthepa. In fact, I used the new Jenkins integration in this repo to download the hpi from here:

https://ci.jenkins.io/blue/organizations/jenkins/Plugins%2Fkubernetes-plugin/detail/master/10/artifacts

But perhaps something didn't go right there. I'll check and get back to you.

JeanMertz · 2017-08-11T10:00:53Z

I just checked on our CI, but it does look like the HPI is correctly built from the latest master branch:

Kubernetes plugin 0.13-SNAPSHOT (private-ec238be6-win2012-6b7d90$)

(not quite sure what the win2012 implies, though). I'll try to rebuild manually from my local machine and see if that changes things.

marvinthepa · 2017-08-11T12:04:28Z

@JeanMertz stupid question, but did you actually restart your Jenkins?

JeanMertz · 2017-08-13T21:35:15Z

UPDATE actually, scratch that. below results where from yesterday. I just ran 3 batches of 10 concurrent jobs, and none of them failed. Not sure what happened before, but I can't seem to reproduce it anymore.

I'm not sure what caused this, but it seems to be working now 👍

However, starting 20 jobs, I do still get errors, not sure if they are related though:

java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
	at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


io.fabric8.kubernetes.client.KubernetesClientException: error dialing backend: ssh: rejected: connect failed (Connection refused)
	at io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener.onFailure(ExecWebSocketListener.java:215)
	at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:185)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

idogoren · 2017-08-14T06:06:23Z

Hi,

When do you expect to release a new version of the plugin?
I've tried using version 0.11 & 0.12 but I'm still expirencing lots of "pipe not connected" and other issues.
When I've tested using 0.13-SNAPSHOT, it didn't reproduce for several runs so it looks much more stable.

gm42 · 2017-08-15T10:26:38Z

I have a reproducible cascade of "Pipe not connected" errors when many build jobs (~20) are triggered, or simply when Jenkins is "unthawed" from some time being down.

With the 0.13-SNAPSHOT version I verify that the "Pipe not connected" problem is gone.

gm42 · 2017-08-18T08:19:06Z

I have to report that although the pipe not connected problem is gone, containers fail with "script returned exit code 2". It might be due to reasons unrelated to this though.

marvinthepa · 2017-08-18T08:31:17Z

@gm42 if it is -2 not 2, then this might be because Jenkins is not able to get the exit code of the script,
see https://github.com/jenkinsci/durable-task-plugin/blob/0e4656cbd2b8fbd44e42cf2a67decdb202255605/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L208.

If you can reproduce this with a minimal example, it might be worthwhile to investigate this further, as it might point to another problem.

gm42 · 2017-08-18T08:33:49Z

@marvinthepa oh yes, it is actually script returned exit code -2. I also have cases of 2 but those are regular.

Unfortunately all the logs are gone as I have recently rebuilt the Jenkins deployment, I will not be able to reproduce this - sorry, we are moving away from the plugin for now.

gm42 · 2017-08-18T08:36:56Z

This is how the notifications of build failures look like (5 different projects, obviously some cascade effect):

FAILED: Job 'some/project1/master [24]' (https://example.com/job/org/job/project1/job/master/24/) with
```script returned exit code -2```


[12:23] 
FAILED: Job 'some/project2/master [21]' (https://example.com/job/org/job/project2/job/master/21/) with
```script returned exit code -2```


[12:23] 
FAILED: Job 'some/project3/master [16]' (https://example.com/job/org/job/project3/job/master/16/) with
```script returned exit code -2```


[12:23] 
FAILED: Job 'some/project4/master [23]' (https://example.com/job/org/job/project4/job/master/23/) with
```script returned exit code -2```


[12:23] 
FAILED: Job 'some/project5/master [43]' (https://example.com/job/org/job/project5/job/master/43/) with
```script returned exit code -2```

All failing very closely.

idogoren · 2017-08-23T09:40:10Z

Hi,

I've managed to reproduce the script returned exit code -2 issue with a very simple example.
I'm using the kubernetes plugin with a snaphot version of 0.13, Jenkins version is 2.60.2.
My Jenkinsfile is:

`def a = [
'test1': {
getPod('Pod1')
{
node('Pod1') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test2': {
getPod('Pod2')
{
node('Pod2') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test3': {
getPod('Pod3')
{
node('Pod3') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test4': {
getPod('Pod4')
{
node('Pod4') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test5': {
getPod('Pod5')
{
node('Pod5') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test6': {
getPod('Pod6')
{
node('Pod6') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test7': {
getPod('Pod7')
{
node('Pod7') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test8': {
getPod('Pod8')
{
node('Pod8') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test9': {
getPod('Pod9')
{
node('Pod9') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test10': {
getPod('Pod10')
{
node('Pod10') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test11': {
getPod('Pod11')
{
node('Pod11') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test12': {
getPod('Pod12')
{
node('Pod12') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test13': {
getPod('Pod13')
{
node('Pod13') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test14': {
getPod('Pod14')
{
node('Pod14') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test15': {
getPod('Pod15')
{
node('Pod15') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test16': {
getPod('Pod16')
{
node('Pod16') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test17': {
getPod('Pod17')
{
node('Pod17') {
container('devx') {
sh "echo hello world"
}
}
}
}
]

parallel a
parallel a
parallel a

def getPod(podNum, body) {
node ('master'){
podTemplate(name: jaasOwner + '-jaas', label: podNum, containers: [
containerTemplate( name: 'jnlp',
image: 'jnlp:1.0.1',
args: '${computer.jnlpmac} ${computer.name}'
),
containerTemplate( name: 'devx',
image: 'ubuntu:16.04',
ttyEnabled: true,
privileged: true,
command: 'cat',
alwaysPullImage: true
)
], body)
}
}`

I'm using as internal jnlp container, which is part of my organization so you will need to change it to another one.
Other than that, you should be able to run it.

marvinthepa · 2017-08-23T12:05:27Z

@i059304

I took the freedom to simplify your example a little bit:

def a = (1..17).toList().collectEntries {
    def label = "test$it"
    [
            "test$it": {
                podTemplate(name: "pod_test_$it", label: label, containers: [
                        containerTemplate(name: 'devx',
                                image: 'ubuntu:16.04',
                                ttyEnabled: true,
                                command: 'cat',
                        )
                ]) {
                    node(label) {
                        container('devx') {
                            sh "echo hello world"
                        }
                    }
                }
            }
    ]
}

stage('a') {
    parallel a
}
stage('b') {
    parallel a
}
stage('c') {
    parallel a
}

However, this runs successfully for me.

lionelve · 2017-08-24T11:56:01Z

This does seem to fix the "pipe not connected" error. It was getting really bad so thanks!

jamiecoleman92 · 2017-08-31T11:10:26Z

Any ideas when an official 0.13 release that will include this bug fix will be made available?

mikesplain · 2017-09-06T20:36:25Z

@jamiecoleman92 Looks like it was released a few hours ago https://github.com/jenkinsci/kubernetes-plugin/releases/tag/kubernetes-1.0

jamiecoleman92 · 2017-09-07T10:49:04Z

@mikesplain Thanks for letting me know. I am currently testing the new plugin in our pipeline as I type!

marvinthepa mentioned this pull request Aug 3, 2017

[JENKINS-40825] move transients into DecoratedLauncher #180

Closed

iocanel requested changes Aug 3, 2017

View reviewed changes

elementalvoid reviewed Aug 7, 2017

View reviewed changes

marvinthepa force-pushed the bugfix/JENKINS-40825-broken-pipe-now-really branch from 294ea6d to d4ec16e Compare August 8, 2017 09:04

Martin Sander added 2 commits August 9, 2017 09:46

Martin Sander added 6 commits August 9, 2017 10:46

Use kubernetes-client Config instead of reaching into client

968d888

typo

15c6842

prevent excessive string creation

f0f9198

if log level is not set to FINEST..

Be more defensive about freeing resources

ef93e9a

Although this *should* not happen, just free up the watches when the ContainerStep finishes for good measure.

silence findbugs error

a11d177

we should be fine if we get deserialized - how are we supposed to clean up those procs in that case?

marvinthepa force-pushed the bugfix/JENKINS-40825-broken-pipe-now-really branch from 9fc0dea to a11d177 Compare August 9, 2017 09:02

iocanel approved these changes Aug 9, 2017

View reviewed changes

iocanel merged commit 1eb72f9 into jenkinsci:master Aug 9, 2017

marvinthepa deleted the bugfix/JENKINS-40825-broken-pipe-now-really branch August 21, 2017 15:55

rawlingsj mentioned this pull request Aug 22, 2017

error starting build pod java.io.IOException: Pipe not connected fabric8io/fabric8-pipeline-library#206

Open

JENKINS-40825 close resource leak, fix broken pipe error #182

JENKINS-40825 close resource leak, fix broken pipe error #182

Conversation

marvinthepa commented Aug 3, 2017

resource leak

pipe not connected

still TODO: proc.kill when step is interrupted

iocanel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelajr commented Aug 3, 2017 • edited Loading

marvinthepa commented Aug 3, 2017

marvinthepa commented Aug 4, 2017

marvinthepa commented Aug 4, 2017

michaelajr commented Aug 4, 2017

marvinthepa commented Aug 4, 2017

austinmoore- commented Aug 4, 2017 • edited Loading

michaelajr commented Aug 4, 2017

elementalvoid commented Aug 4, 2017 • edited Loading

marvinthepa commented Aug 7, 2017

iocanel commented Aug 7, 2017

marvinthepa commented Aug 7, 2017

iocanel commented Aug 7, 2017

marvinthepa commented Aug 7, 2017

JeanMertz commented Aug 7, 2017 • edited Loading

marvinthepa commented Aug 7, 2017

austinmoore- commented Aug 7, 2017

elementalvoid commented Aug 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elementalvoid commented Aug 7, 2017

michaelajr commented Aug 7, 2017

iocanel commented Aug 7, 2017

iocanel commented Aug 8, 2017

iocanel commented Aug 8, 2017

mikesplain commented Aug 8, 2017

marvinthepa commented Aug 9, 2017

JeanMertz commented Aug 10, 2017

marvinthepa commented Aug 11, 2017

JeanMertz commented Aug 11, 2017

JeanMertz commented Aug 11, 2017 • edited Loading

marvinthepa commented Aug 11, 2017

JeanMertz commented Aug 13, 2017 • edited Loading

idogoren commented Aug 14, 2017

gm42 commented Aug 15, 2017

gm42 commented Aug 18, 2017

marvinthepa commented Aug 18, 2017

gm42 commented Aug 18, 2017

gm42 commented Aug 18, 2017

idogoren commented Aug 23, 2017 • edited Loading

marvinthepa commented Aug 23, 2017 • edited Loading

lionelve commented Aug 24, 2017

jamiecoleman92 commented Aug 31, 2017

mikesplain commented Sep 6, 2017

jamiecoleman92 commented Sep 7, 2017

still TODO: `proc.kill` when step is interrupted

michaelajr commented Aug 3, 2017 •

edited

Loading

austinmoore- commented Aug 4, 2017 •

edited

Loading

elementalvoid commented Aug 4, 2017 •

edited

Loading

JeanMertz commented Aug 7, 2017 •

edited

Loading

JeanMertz commented Aug 11, 2017 •

edited

Loading

JeanMertz commented Aug 13, 2017 •

edited

Loading

idogoren commented Aug 23, 2017 •

edited

Loading

marvinthepa commented Aug 23, 2017 •

edited

Loading