Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JENKINS-40825 close resource leak, fix broken pipe error #182

Conversation

marvinthepa
Copy link
Contributor

resource leak

First of all, Launcher.launch is called not once, but again and again.
I.e. overwriting shared members of the enclosing class that a different
thread is still using (and only cleaning the ones that were written last).
That is a bad idea, and caused the thread and connection leak.

So, move them (watch and proc) to local variables, and close them
elsewhere:

  • watch was already cleaned up in ContainerExecProc
  • proc.kill() was not correctly executed anyway, and this is still broken. See discussion below.

pipe not connected

Second of all, waitQuietly on the started latch was what caused
the "pipe not connected" errors:

In case of InterruptedException, it caused the launcher to just continue
as if started had already been counted down (i.e. the websocket connection
established), when in reality the thread just interrupted itself because of a timeout
(probably from DurableTaskStep.check()).

still TODO: proc.kill when step is interrupted

If the sh step inside container is interrupted, DefaultStepContext will create a new Launcher and DurableTaskStep will call kill on it - not call kill on existing procs that have finished (or not). If that one does not kill the process (which DurableTaskStep checks), ContainerStepExecution.stop will also not be called.

Keeping a single reference of proc around and calling kill on it when ContainerStepExecution.stop is called (that is what the current master does) would not work anyway.

This is because both ContainerExecDecorator.decorate and Launcher.launch are called multiple times (at least to start the actual script, and then to run a number of ps executions every 10 seconds to check on the process) - so we are very likely to call kill on the wrong instance.

Even if we keep track of all the procs we created - how will we know which is the right one?

The "right" solution is one like in the Docker Pipeline Plugin:
I.e. open a new connection to the container, find the right process to kill (using the environment variable JENKINS_SERVER_COOKIE=durable*, and kill it.

I can implement that solution, but as it has little to do with JENKINS-40825, do you want it on a new pull request?

Sorry for writing such a "novel", I hope I can make myself clear.

Copy link
Contributor

@iocanel iocanel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I am not sure about, is if we should remove the call to close() from stop and form the the BodyExecutionCallback.

started.await();
} catch (InterruptedException e) {
closeWatch(watch);
throw new IOException("interrupted while waiting for websocket connection");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an InterruptedIOException would be better here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

.start();
return false;
}

@Override
public void stop(Throwable cause) throws Exception {
LOGGER.log(Level.FINE, "Stopping container step.");
closeQuietly(client, decorator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely to be called on interruption.

What happens to the proc in this case? Is it handled by Jenkins, or we should do the cleanup ourselves?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also applies to the callback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'stop' is not called on interruption, I checked (by manually aborting the build - what other interruptions exist?).

Unfortunately I don't know which callback you are referring to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it was obvious which callback you meant.

I didn't check that one. If it is actually called on interruption, we can collect all the procs and do an additional clean up there - although I don't think it will be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iocanel:

The callback is actually called on interruption.
I will check if keeping a reference of all the procs makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelajr
Copy link

michaelajr commented Aug 3, 2017

So we just tried out the code for this PR. We had a single job, running concurrent builds, on exclusive agents with random labels. This ensured each running job was on its on label and agent. The podTemplate code simply switched to the JNLP container and did a sleep 300. We were running 17 agents. The first 7 jobs got this during its sleep:

[sleep] Running shell script
+ sleep 300
Cannot contact jenkins-agent-lm0wn-pk3k5: java.io.IOException: interrupted while waiting for websocket connection

The other jobs then immediately stopped printing output and appeared frozen at points before the sleep.

The jobs that had the websocket connection line after the sleep completed successfully with

[sleep] Running shell script
+ sleep 300
Cannot contact jenkins-agent-lm0wn-pk3k5: java.io.IOException: interrupted while waiting for websocket connection
[Pipeline] }
[Pipeline] // container
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // podTemplate
[Pipeline] End of Pipeline
Finished: SUCCESS

(actually 6 got this and a 7th succeeded without the log line)

All the other jobs that were frozen failed with:

java.io.IOException: interrupted while waiting for websocket connection
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:169)
	at hudson.Launcher$ProcStarter.start(Launcher.java:384)
	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:157)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:64)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:167)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:224)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:150)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:108)
	at sun.reflect.GeneratedMethodAccessor199.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1218)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1027)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:150)
	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:123)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:148)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:152)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:122)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:122)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:122)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:122)
	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:16)
	at WorkflowScript.run(WorkflowScript:63)
	at ___cps.transform___(Native Method)
	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
	at sun.reflect.GeneratedMethodAccessor151.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.dispatch(CollectionLiteralBlock.java:55)
	at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.item(CollectionLiteralBlock.java:45)
	at sun.reflect.GeneratedMethodAccessor148.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:103)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
	at sun.reflect.GeneratedMethodAccessor151.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:60)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
	at sun.reflect.GeneratedMethodAccessor151.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:173)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:162)
	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:122)
	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:261)
	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:162)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:19)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:35)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:32)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:32)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:330)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Finished: FAILURE

@marvinthepa
Copy link
Contributor Author

@michaelajr

Thanks for the test. The exception you are getting now would likely have been the "pipe not connected" error before this. This means that Jenkins takes more than 10 seconds to connect to kubernetes and start executing the command. Unfortunately, Jenkins will just interrupt us in this case...

Are you able to paste a minimal example that you are using? That might help, as I was not able to trigger this again just running the same build 10 times...

Probably we are hitting some other limit here, I can't imagine that it takes so long to start executing the command.

Maybe the thread pool is exausted, maybe we should try re-using connections instead of opening a new one for each 'ps' that Jenkins wants us to execute... will investigate tomorrow.

@marvinthepa
Copy link
Contributor Author

Added the implementation of kill.

Will now investigate @michaelajr's issues.

Maybe it helps to add the suggested fix for waitUntilContainerIsReady in marvinthepa@996952c#commitcomment-23453964. I will try that first.

@marvinthepa
Copy link
Contributor Author

@michaelajr I can reproduce your problems, however only if I run A LOT of jobs at exactly the same time - e.g. 4 of 30 jobs fail..

@michaelajr
Copy link

@marvinthepa We saw a greater failure ratio yesterday. We definitely need to run/start multiple jobs at or around the same time. We'll test again this morning with the new code.

@marvinthepa
Copy link
Contributor Author

@iocanel:

I saw a few "Pumpers" hanging around without 1fafc92 (only about 2 for 100 builds or so - I am testing a lot), that's why I applied it. Seems to work so far.

@austinmoore-
Copy link
Contributor

austinmoore- commented Aug 4, 2017

@marvinthepa Hey, I work with @michaelajr. I just tested the new commits you've just added.

I have a pipeline job that just sleeps for 300 seconds: https://gist.github.com/austinmoore-/b26b61d65072400b64ec205ba31628cb

First, I ran 15 instances of this job. 3/15 of the jobs failed with the IOException and stacktrace mentioned above.

After than I ran it again but 20 times. 7/20 jobs failed with the same IOException and stacktrace.

I can reproduce your problems, however only if I run A LOT of jobs at exactly the same time - e.g. 4 of 30 jobs fail..

I'm not sure, but the failures might only occur when running the same job. When I ran through our deployment jobs (8 simultaneous jobs), I did not encounter this error. Were you running the same job, or different jobs?

@michaelajr
Copy link

To add to @austinmoore- 's comment, under normal circumstances there are no jobs that run concurrent builds. It was just our test job where we executed concurrent builds, each with it's own EXCLUSIVE agent and label. The fact that we were able to run 8 of our REAL jobs at the same time without this issue is a huge step forward. So this is good news. Would be nice to understand why we are seeing such a high failure rate when we run concurrent builds of the same job.

@elementalvoid
Copy link

elementalvoid commented Aug 4, 2017

EDIT: The following is based on a local build from ed766ca.

To add some more experiences to this. It still appears the issue is related to running a job that uses more than one container in the Pod Template. We've found that when using a Pod Template that has greater than 1 container defined you're okay to run many parallel jobs (even the same job) so long as you don't use that extra container. As soon as you run commands in the additional containers we get the following exception and the job fails:

java.io.InterruptedIOException: interrupted while waiting for websocket connection
    at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:179)
    at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:108)

What we've been using for testing is the following Pipeline:

Integer delaySec = 60

timestamps {
    node {
          for(i in [1,2,3,4,5]) {
                println "starting sleep ${delaySec} iteration: ${i}"
                stage("sleep ${i}") {
                        container('mk-foo') {
                            sh("sleep ${delaySec} && pwd")
                        }
                }
                stage("ls ${i}") {
                        container('mk-foo') {
                            sh('ls')
                        }
                }
                println "done sleeping: iteration: ${i}"
          }
    }
}

mk-foo is our extra container -- it's just a node:6 container with no special config.

Then we click the build button ten or so times. In some cases the loop will work but the ls command will take approximately one minute to execute (I suspect a timeout somewhere). In other cases they'll fail as soon as the exception occurs.

@marvinthepa
Copy link
Contributor Author

I think I found another (probably big) part of the problem.

OkHttpClient per default only runs 5 concurrent requests against the same hostname, which is probably to little when you are constantly opening websocket connections to that one host.

Seems like I will have to open another pull request to make this configurable in the kubernetes-client, but I will verify my assumption first.

@iocanel
Copy link
Contributor

iocanel commented Aug 7, 2017

Given that each pod is meant to serve a single job, I can't see why we would possibly need more than 2 concurrent concurrent connection to a single pod (one for the sh part and one for the probe). If we don't have any leaking, we should be just fine.

@marvinthepa
Copy link
Contributor Author

@iocanel: It is not about connections to the pod - it is about connections to KubernetesCloud.serverUrl, which is the same for all builds.

@iocanel
Copy link
Contributor

iocanel commented Aug 7, 2017

You are right! I think I need more coffee !!!

@marvinthepa
Copy link
Contributor Author

Here you go: ☕️ .

@JeanMertz
Copy link
Contributor

JeanMertz commented Aug 7, 2017

Really excited to see the progress being made here, the investigation being done by @marvinthepa, and the help from @iocanel. Keep up the good work 👍

I'm a colleague of @soudy, who initially reported JENKINS-40825, so if there's anything that we can do to help out, let use now.

Also: ☕️

@marvinthepa
Copy link
Contributor Author

@iocanel: I added two new commits:

  • Added the changed waiting behavior client.pods().inNamespace(namespace).withName(podName).waitUntilReady(CONTAINER_WAIT_TIMEOUT_SECONDS, TimeUnit.SECONDS)
  • Made the number of concurrent requests to Kubernetes configurable

Could I kindly ask you to review those changes?

I am especially unsure about the configuration in the second one - don't know if it makes sense to make it configurable from the GUI.

I think it should definitely be added to the Config object in the kubernetes-client, as reaching into the KubernetesClient feels dirty:

client.getHttpClient().dispatcher().setMaxRequestsPerHost(maxRequestsPerHost);

Then we could also set it via Java system property or environment variable (instead of or in addition to the GUI).
In any way I think we should provide a more sensible default than 5 - I set it to 32 for now as that worked reasonably well in my tests, and I guess it should not overwhelm the kubernetes master.

Please let me know what you think. Always happy to incorporate improvements.

@austinmoore-
Copy link
Contributor

@marvinthepa @iocanel Hey, I just tried out the new changes. I ran 20 jobs, and they all succeeded with no issue! This seems to fix all of our issues. This really great! We'll continue to test this.

@elementalvoid
Copy link

We've also tested the latest changes with the pipeline I described before. 100 Jobs with no issues! We're getting ready to push this out to our main Jenkins instance. I'll let you know how it goes.

Thanks for the quick work on this @marvinthepa.

@@ -47,6 +47,10 @@
<f:textbox default="10"/>
</f:entry>

<f:entry title="${%Max number of current connections}" field="maxRequestsPerHostStr">

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a more detailed description of what this does on the configuration page. Or maybe changing the title to Max current connections to Kubernetes API would be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the wording a little bit (tried to keep it as short as possible, sacrificing grammar and elegance) and also pushed the corresponding help file which I forgot to git add 🤦‍♀️ 🤦‍♂️ , which contains a longer description. Okay like that?

@elementalvoid
Copy link

I have good news: We've upgraded our main Jenkins master with a plugin built from this PR and all is looking well. Prior to this we would begin to experience the broken pipe issue with a single handful of Jobs. Since the upgrade we've been running approximately 15 Jobs at minimum and about 22 max -- all have either succeeded or failed for legitimate reasons. We're no longer experiencing the broken pipe issue.

Thanks again for the quick fix!

@michaelajr
Copy link

We too have our production Jenkins running this fix. Working great.

@iocanel
Copy link
Contributor

iocanel commented Aug 7, 2017

That's excellent news.

I am cutting a kubernetes-client 2.6.0 release right now, that allows you to config max concurrent requests per host.

@iocanel
Copy link
Contributor

iocanel commented Aug 8, 2017

2.6.1 was released so we should align this pr accordingly.

@marvinthepa marvinthepa force-pushed the bugfix/JENKINS-40825-broken-pipe-now-really branch from 294ea6d to d4ec16e Compare August 8, 2017 09:04
@iocanel
Copy link
Contributor

iocanel commented Aug 8, 2017

Can you rebase, so that we can merge?

@mikesplain
Copy link

FYI we've tested this with the 2.6.1 changes as well and all our issues seem fixed! Great job!!

Martin Sander added 2 commits August 9, 2017 09:46
First of all, `Launcher.launch` is called not once, but again and again.
I.e. overwriting shared members of the enclosing class that a different
thread is still using (and only cleaning the ones that were written last).
That is a bad idea, and caused the thread and connection leak.

So, move them (`watch` and `proc`) to local variables, and close them
elsewhere:
* `proc.kill()` should be taken care of by the shell step (that needs to be
verified)
* `watch` was already cleaned up in `ContainerExecProc`
anyway.

Second of all, `waitQuietly` on the `started` latch was what caused
the "pipe not connected" errors:

In case of InterruptedException, it caused the launcher to just continue
as if `started` had already been counted down (i.e. the websocket connection
established), when in reality the thread  just interrupted itself because of a timeout
(probably from [`DurableTaskStep.check()`][1]).

[1]: https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/workflow-durable-task-step-2.13/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java#L303
Strongly inspired by [Docker Pipeline Plugin][1], but sending
just a single command to the container to be faster.

The command only uses linux standard tools and works on
debian and alpine, so it should be sufficiently portable.

Gets rid of one TODO :-).

[1]: https://github.com/jenkinsci/docker-workflow-plugin/blob/45f0c04/src/main/java/org/jenkinsci/plugins/docker/workflow/WithContainerStep.java#L249,
Martin Sander added 6 commits August 9, 2017 10:46
This solves the actual problem causing JENKINS-40825 - OkHttp limiting
itself to 5 requests per hosts, which is not sufficient when running
20 builds at once.
Each build uses a websocket connection for the command it runs,
and additionally spawns more connections to check if the command
is still running.
if log level is not set to FINEST..
Although this *should* not happen, just free up the watches when the
ContainerStep finishes for good measure.
we should be fine if we get deserialized - how are we supposed to clean
up those procs in that case?
@marvinthepa marvinthepa force-pushed the bugfix/JENKINS-40825-broken-pipe-now-really branch from 9fc0dea to a11d177 Compare August 9, 2017 09:02
@marvinthepa
Copy link
Contributor Author

@iocanel: done, please review again.

@iocanel iocanel merged commit 1eb72f9 into jenkinsci:master Aug 9, 2017
@JeanMertz
Copy link
Contributor

Unfortunately, with the latest Jenkins version, and the master branch version of this plugin, I am still seeing this issue occur:

ERROR: Execution failed
java.io.IOException: Pipe not connected
	at java.io.PipedOutputStream.write(PipedOutputStream.java:140)
	at java.io.OutputStream.write(OutputStream.java:75)
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$ClosableLauncher.launch(ContainerExecDecorator.java:179)
	at hudson.Launcher$ProcStarter.start(Launcher.java:450)
	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:157)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:64)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:167)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:224)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:150)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:108)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
	at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:19)
	at h.instance(/var/jenkins_home/jobs/blendle/jobs/core-api/branches/nodeport/builds/8/libs/jenkins-pipeline-scripts/vars/h.groovy:197)
	at ___cps.transform___(Native Method)
	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
	at sun.reflect.GeneratedMethodAccessor285.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:173)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:162)
	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:122)
	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:261)
	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:162)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:19)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:35)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:32)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:32)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:330)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

This stack trace was preceded by:

ERROR: Execution failed
java.io.IOException: Failed to execute shell script inside container [instance-1] of pod [jenkins-slave-zzgsj-vq16h]. Timed out waiting for container to become ready!
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$ClosableLauncher.launch(ContainerExecDecorator.java:113)
	at hudson.Launcher$ProcStarter.start(Launcher.java:450)
	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:157)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:64)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:167)

This happened, even though the pod was clearly up and running (using kubectl describe).

@marvinthepa
Copy link
Contributor Author

@JeanMertz:
I don't think so.

at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$ClosableLauncher.launch(ContainerExecDecorator.java:179)

ClosableLauncher does not exist in this pull request. Are you sure you were building from jenkinsci/kubernetes-plugin/master, and not from some other branch? Looks like you are running a plugin built from #180, not this pull request.

@JeanMertz
Copy link
Contributor

Hmm. Good find @marvinthepa. In fact, I used the new Jenkins integration in this repo to download the hpi from here:

https://ci.jenkins.io/blue/organizations/jenkins/Plugins%2Fkubernetes-plugin/detail/master/10/artifacts

But perhaps something didn't go right there. I'll check and get back to you.

@JeanMertz
Copy link
Contributor

JeanMertz commented Aug 11, 2017

I just checked on our CI, but it does look like the HPI is correctly built from the latest master branch:

Kubernetes plugin 0.13-SNAPSHOT (private-ec238be6-win2012-6b7d90$)

screen shot 2017-08-11 at 11 59 37

(not quite sure what the win2012 implies, though). I'll try to rebuild manually from my local machine and see if that changes things.

@marvinthepa
Copy link
Contributor Author

@JeanMertz stupid question, but did you actually restart your Jenkins?

@JeanMertz
Copy link
Contributor

JeanMertz commented Aug 13, 2017

UPDATE actually, scratch that. below results where from yesterday. I just ran 3 batches of 10 concurrent jobs, and none of them failed. Not sure what happened before, but I can't seem to reproduce it anymore.


I'm not sure what caused this, but it seems to be working now 👍

However, starting 20 jobs, I do still get errors, not sure if they are related though:

java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
	at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


io.fabric8.kubernetes.client.KubernetesClientException: error dialing backend: ssh: rejected: connect failed (Connection refused)
	at io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener.onFailure(ExecWebSocketListener.java:215)
	at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:185)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

@idogoren
Copy link

Hi,

When do you expect to release a new version of the plugin?
I've tried using version 0.11 & 0.12 but I'm still expirencing lots of "pipe not connected" and other issues.
When I've tested using 0.13-SNAPSHOT, it didn't reproduce for several runs so it looks much more stable.

@gm42
Copy link

gm42 commented Aug 15, 2017

I have a reproducible cascade of "Pipe not connected" errors when many build jobs (~20) are triggered, or simply when Jenkins is "unthawed" from some time being down.

With the 0.13-SNAPSHOT version I verify that the "Pipe not connected" problem is gone.

@gm42
Copy link

gm42 commented Aug 18, 2017

I have to report that although the pipe not connected problem is gone, containers fail with "script returned exit code 2". It might be due to reasons unrelated to this though.

@marvinthepa
Copy link
Contributor Author

@gm42 if it is -2 not 2, then this might be because Jenkins is not able to get the exit code of the script,
see https://github.com/jenkinsci/durable-task-plugin/blob/0e4656cbd2b8fbd44e42cf2a67decdb202255605/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java#L208.

If you can reproduce this with a minimal example, it might be worthwhile to investigate this further, as it might point to another problem.

@gm42
Copy link

gm42 commented Aug 18, 2017

@marvinthepa oh yes, it is actually script returned exit code -2. I also have cases of 2 but those are regular.

Unfortunately all the logs are gone as I have recently rebuilt the Jenkins deployment, I will not be able to reproduce this - sorry, we are moving away from the plugin for now.

@gm42
Copy link

gm42 commented Aug 18, 2017

This is how the notifications of build failures look like (5 different projects, obviously some cascade effect):

FAILED: Job 'some/project1/master [24]' (https://example.com/job/org/job/project1/job/master/24/) with
```script returned exit code -2```


[12:23] 
FAILED: Job 'some/project2/master [21]' (https://example.com/job/org/job/project2/job/master/21/) with
```script returned exit code -2```


[12:23] 
FAILED: Job 'some/project3/master [16]' (https://example.com/job/org/job/project3/job/master/16/) with
```script returned exit code -2```


[12:23] 
FAILED: Job 'some/project4/master [23]' (https://example.com/job/org/job/project4/job/master/23/) with
```script returned exit code -2```


[12:23] 
FAILED: Job 'some/project5/master [43]' (https://example.com/job/org/job/project5/job/master/43/) with
```script returned exit code -2```

All failing very closely.

@idogoren
Copy link

idogoren commented Aug 23, 2017

Hi,

I've managed to reproduce the script returned exit code -2 issue with a very simple example.
I'm using the kubernetes plugin with a snaphot version of 0.13, Jenkins version is 2.60.2.
My Jenkinsfile is:

`def a = [
'test1': {
getPod('Pod1')
{
node('Pod1') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test2': {
getPod('Pod2')
{
node('Pod2') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test3': {
getPod('Pod3')
{
node('Pod3') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test4': {
getPod('Pod4')
{
node('Pod4') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test5': {
getPod('Pod5')
{
node('Pod5') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test6': {
getPod('Pod6')
{
node('Pod6') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test7': {
getPod('Pod7')
{
node('Pod7') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test8': {
getPod('Pod8')
{
node('Pod8') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test9': {
getPod('Pod9')
{
node('Pod9') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test10': {
getPod('Pod10')
{
node('Pod10') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test11': {
getPod('Pod11')
{
node('Pod11') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test12': {
getPod('Pod12')
{
node('Pod12') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test13': {
getPod('Pod13')
{
node('Pod13') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test14': {
getPod('Pod14')
{
node('Pod14') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test15': {
getPod('Pod15')
{
node('Pod15') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test16': {
getPod('Pod16')
{
node('Pod16') {
container('devx') {
sh "echo hello world"
}
}
}
},
'test17': {
getPod('Pod17')
{
node('Pod17') {
container('devx') {
sh "echo hello world"
}
}
}
}
]

parallel a
parallel a
parallel a

def getPod(podNum, body) {
node ('master'){
podTemplate(name: jaasOwner + '-jaas', label: podNum, containers: [
containerTemplate( name: 'jnlp',
image: 'jnlp:1.0.1',
args: '${computer.jnlpmac} ${computer.name}'
),
containerTemplate( name: 'devx',
image: 'ubuntu:16.04',
ttyEnabled: true,
privileged: true,
command: 'cat',
alwaysPullImage: true
)
], body)
}
}`

I'm using as internal jnlp container, which is part of my organization so you will need to change it to another one.
Other than that, you should be able to run it.

@marvinthepa
Copy link
Contributor Author

marvinthepa commented Aug 23, 2017

@i059304

I took the freedom to simplify your example a little bit:

def a = (1..17).toList().collectEntries {
    def label = "test$it"
    [
            "test$it": {
                podTemplate(name: "pod_test_$it", label: label, containers: [
                        containerTemplate(name: 'devx',
                                image: 'ubuntu:16.04',
                                ttyEnabled: true,
                                command: 'cat',
                        )
                ]) {
                    node(label) {
                        container('devx') {
                            sh "echo hello world"
                        }
                    }
                }
            }
    ]
}

stage('a') {
    parallel a
}
stage('b') {
    parallel a
}
stage('c') {
    parallel a
}

However, this runs successfully for me.

@lionelve
Copy link

This does seem to fix the "pipe not connected" error. It was getting really bad so thanks!

@jamiecoleman92
Copy link

Any ideas when an official 0.13 release that will include this bug fix will be made available?

@mikesplain
Copy link

@jamiecoleman92
Copy link

@mikesplain Thanks for letting me know. I am currently testing the new plugin in our pipeline as I type!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.