-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
weird behavior during log wait condition on during container start #344
Comments
i think this problem was caused by a corrupt i do, however, have a different problem now related to stopping the container.
and this is what i get when it hangs:
i haven't figured out if this is a docker issue or a plugin issue. initially i thought it was a problem w/ it taking too long for docker to kill the container, so i added thoughts? |
Sorry, no idea here so far. Could you create a thread dump when it hangs in order to see where the plugin is currently ? |
maybe - let me see if i can get it to hang outside of jenkins and i can hit it w/ |
it looks like
i wonder if there's a way to force |
this is the output from a successful
so my current thought is the container is talking too long to shutdown and the i'm still digging deeper |
I'm running into same httpclient issue too. In my case, it's stuck on a GET to unix://127.0.0.1:1/v1.18/containers/23fb3c95d18c/json
This is running on an EC2 instance, Ubuntu 14.04 Please let me know if there is any other information which will help. |
Here is the tail-end of maven-debug logs I got from a different repro.
What's interesting to me is that the request is made in one thread, and received in another thread (in the case http-outgoing-2 vs. http-outgoing-1). I can see it is the case with the OP's example of a failed POST, where the GET is made on http-outgoing-35 and the 200 OK is received on http-outgoing-34. In the OP's example of a successful POST, the same thread makes the request and receives the response. Not sure if this is pertinent information... |
I think this is normal behaviour of Apache's HcClient to send the request on one thread and receive it on another. At least this is what I alway see for every request. |
Interesting in your logs are though that the server responds with a content-length of 1739 but doesn't continue to send sth (as it seems). This could have two reasons: Either the client is not reading (but from the stacktrace it is blocking for this read, so dont think that is the case) or the server is not sending (which I think is more likely). IMO this here is an issue on the Docker daemon side, could you check the logs there, too ? |
What's strange here is that the response stream contains still some input from a log reading connection first before the HTTP response header is received. This could be the reason for the hang since the http client still wait for a valid response, I guess. So the problem seems to be that connections are reused to early by HcClient (i.e. when still in progress of processing log output) ? You probably dont have the logs anymore, but when this happens the next time, could you check in the log where the blocking connection is also used (here it would be |
@brian98point6 Sorry, I misunderstood the logs. What you see are not threads, but connection objects which are potentially reused because the plugin uses a pooling thread connector. |
If someone would have full log with debug enabled for the blocking case, this would help me quite a bit. The sequence of connection leasing / using / releasing can be best seen with
|
There seems to be issues with long running HTTP request like fetching logs with chunked encodings and pooling connections. It seems that connections get released to soon and then reused in a different context when there is still log data to read. Now logging uses a non-pooled, single connection client for doing async log fetching which hopefully resolves the issues reported in #344 and #259.
Thanks for clarifying threads vs. connection objects. You can tell I'm just guessing around :) Output of grep " Connection" of the debug log I saved last week: I noticed that there was a "route allocated: 0" and found that http-outgoing-1 was closed when my postgres container was ready, and then used again during the db-migrate container run. Is that odd? |
@brian98point6 Thanks a lot ;-). BTW, could you try 0.15.1 released just today ? I changed the log handling quite a bit to use a dedicated HttpClient, so it shouldnt interfere with the rest anymore. from my feeling, this should fix a lot of the issues we had. would be happy if you could give it a try ;-) |
i haven't seen this problem happen for a little while. @brian98point6, if you still see this problem happening, can you try upgrading your version of docker? unfortunately i didn't record the version i was using when i filed this issue so it's also possible some of this issue was on docker's end. |
@rhuss I tried v0.15.1 with the older docker version. It's successful 5 for 5, looking good. Thanks!! @jgangemi I tried v0.15.0 after upgrading docker from 1.7.0 to 1.9.1. It's still failing about half the time. Are you on the latest version of docker now? Meanwhile, I'll use 0.15.1 in our CI pipeline and confirm that the issue is resolved by tomorrow. |
I've run maybe another 20 times. No hanging. So I think the issue is resolved with v0.15.1. Thank you so much! |
@brian98point6 @jgangemi I found some issues probably within the Unix socket implementation. How do you connect to the Docker daemon, via TCP or via Unix socket ? |
both - on my mac it's tcp, on linux it's the unix socket. |
Ditto, but we haven't been able to reproduce this issue on mac/tcp, only on linux/unix sockets. BTW, I'm on the same team as @thomasvandoren. Mentioning this because I assume this is related to #436 (comment) |
i'm going to try and see if i can verify if this is a unix socket vs tcp thing myself right now b/c this issue has crept back up for us. |
ok - tcp works fine, unix socket does not, however where it decides to hang in our rather large reactor build is non-deterministic. |
In my analysis it looked like that the unix socket get access gets confused sometimes when accessing from multiple threads. We use a second thread for checking the log wait condition asynchronously, so a prober solution would be to avoid this second thread and do the check on the main thread which also prints the loggigng. Currently I don't see another solution except fixing the jcr-unix socket library. |
is there another unix socket library we could try? |
not that I'm aware of. |
from a quick google search... https://github.com/mcfunley/juds |
We are still seeing this problem with 0.15.9 |
I tried junixsocket, but it had even more severe issues. I really think the single thread solution is the way to go, but it's not a small task. Hopefully we can tackle that soon, but we also love contributions ;-) |
the last reason of in the meantime, using the the |
so this is a fun one...
it seems like every so often, something is interrupting the stream returned from docker while waiting on a
log
condition. the annoying things is, i have no real test case that can easily reproduce this.it seems to occur at random during a rather large monolithic build. each project that needs the container starts one, which in turns causes a bunch of sql scripts to run which take ~30s or so.
basically the
read
call is blocking on data it expects to be there, but isn't. i'm kind of at a loss as to what could be causing this b/c it just fails at random times on an operation that has been successful before. when running w/ the-X
option, there it's also random as to where in the logs the stream cuts out.any thoughts on this? i'm gonna play around in the logging code tomorrow and turn up docker's logging to see if it will tell me anything useful.
The text was updated successfully, but these errors were encountered: