-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operation timed out (Read failed) during push to us.gcr.io #2014
Comments
Hi @praveen4463.
|
Hi @briandealwis, do you really think its a duplicate of #691? That issue relates to expired credentials and 'unauthorized' exception while I am getting 'operation time-out'. Also, I am using the default distroless OpenJDK image not any large image like the other issue originally states. Thus there shouldn't be anything taking very long time leading to expired credentials (also I am using 'gcloud' as credential helper). How come it's a duplicate? I am curious. I am not able to understand the cause of this issue and if this is going to happen how will I create images using jib? Can you please explain what could have made my build fail as everything look normal? |
I apologize: I leapt to a conclusion as you posted that the docker push took 13 minutes, and the common failure case there is credential expiry. We know that Jib needs to be more robust in the face of network failures such as timeouts like you're reporting. We have an issue open to retry pushes in the face of timeouts (#1409). I'll add a note there that we should also investigate using chunked uploads. |
Just wanted to answer your question below, just in case you didn't realize:
When a Java application makes network connections and/or read/write data, the application may set some timeout values for such network operations at the JVM level. And if such a network operation does not progress by the set timeout, the JVM network stack cuts off the operation. In your case, a simple HTTP GET connection to GCR (not even uploading anything) timed out at the socket level after Jib was not able to read any data from the GCR server in a "timely" fashion, hence the following failure:
From your accounts and logs, looks like it is your network that is universally slow, not that Jib's network operations are unexpectedly slow. You can increase the Java net timeout values to whatever you want for Jib (for example, BTW, it is a bit surprising that a simple HTTP GET may actually time out in your environment. If I were you, I would avoid using GCR, given the extreme degree of your network unreliability with respect to GCR. |
I think @chanseokoh specifically means network unreliability. |
@chanseokoh Thanks for looking into it, I agree with some of your observations regarding the read operation failure but its worth noting from the logs that jib was able to push all layers except just one 40230baf97c5b3d6f026fc9f734a12b9b485c7337d15c5c35e8a3e3ba7eab22f where an I/O error occurred and due to that it termed the whole operation as a failure. Let me explain this in detail: There were 4 layers and jib tried to push them one by one to GCR, it first read the registry to see if a layer exists, if exist and same, it skips the push otherwise it tries to push. Here are those layers as per my log, 8b37076acdbeb0264388f03aefe4a023d47fc59c426df22c6a4268ae07b5d280 - already exists on registry (pushed successfully by last build failure), size 39479382 40230baf97c5b3d6f026fc9f734a12b9b485c7337d15c5c35e8a3e3ba7eab22f - I/O error, Operation timed out, after spending 43745.0 ms (over 43 seconds), size 4250 c6c66cab87735c35fa5845823954ed52b01d1df7c7a064a655ae98eb1f2a11c6 - pushed successfully during the build, size 45543 dff9b703c0343d28d384747d74159f1a57cd947bd0abebc09d138ab6f448602f - already exists on registry (pushed successfully by last build failure), size 1736 so the layer starting with '402' specifically was the one that failed. Let's now work on why that layer failed. Please read the exception from the bottom to see what was on the stack before the I/O exception occurred. I can read that jib invoked GCR to check the layer, it got a response, but probably that response was not what jib expected leading to failure in reading within allotted time.
So, it seems the issue is not with the network but with the code that reads the response from registry, probably some encoding issue, wrong calculation of input stream size or something else that's causing the read failure after a response is received. There doesn't really seems anything with the network to me. Also, looks like there is no retry mechanism in jib after such a failure where jib try to read from the registry again in hope of getting expected response just like docker did. if you read my comments regarding docker push, you will note that I said docker got stuck on a particular layer for several minutes, it failed on that and reattempted as well, before it was able to push it. So my observation says, it is not my network fault really or something that I and GCR has any connectivity issues (cause I do lot of stuff on GCR only, for example starting tens of VM's together, sql operations, pushing big files to buckets etc. without any issue). It could be something within the 'layer read operation within jib' that might need to be fixed. Regarding increasing the timeout, I had already made it to 10 minutes that day but it failed around the same time when it was 1 minute, which tells that it's something within the 'response read' operation of jib and not related with network. please let me know what you think. Can this be reopened and looked into? |
@praveen4463 at this point, the evidence points at network issues:
If you can provide a reproducible test case we can definitely look at it more. It might be better to try using Wireshark or some other networking tool to to investigate underlying network reports. |
@briandealwis I heard you there, I'll try do investigate that way and inform if something is worth sharing. |
I'll try my best to make you convinced that it was the network read timeout while attempting to read an HTTP response from the server, so please bear with me if what I'm going to explain sounds too pedantic.
It is unclear to me from which part of the log above or by what kind of reasoning you came to the conclusion that Jib got some kind of "response" (be it a complete HTTP response with all valid HTTP headers or just a partial segment of it) from the HTTP GET request initiated by Jib for checking a blob, but rather, what is evident from the log is that the there was a read timeout (socket timeout) while the Apache HTTP client library (
The blob check HTTP GET is only for checking if a layer exists on the registry. The server does not return any content but rather just returns an HTTP code. There isn't a layer or something else to read. Jib just checks the return code. But don't get me wrong. I don't rule out something really weird is gong on on your side, if this is more or less consistently reproducible (e.g., you always time out with that particular HTTP GET). For example,
But given your accounts, I think the chances of these are slimmer than a common read timeout, so I'm just leaning heavily toward the network issue. But as @briandealwis said, if this is consistently reproducible, there are ways we can dig into the low level HTTP protocol stuff to better understand what is really happening.
Your observation is right. This is a known issue (#1409, mostly) as @briandealwis mentioned. I can think of a few workarounds for now, just in case you are blocked:
As I said, often you need to lift the timeouts imposed by the kernel. On my machine for example, I found the connection timeout is set to 30 seconds, so setting the timeout over 30 seconds at the JVM level is meaningless. |
Sorry, I think its clear now that it failed before a chunk was filled into buffer, I read readLine/fillBuffer and came to conclusion that a response was received but it seems those several method calls during socket read operation that were on stack before the exception.
Oh, I got you, didn't know it checks just the return code, but I was also not specifically expecting it to read a returned layer, I thought it matches the hash of it that was returned by the server to see whether its stale or latest. Thanks for those insights, I'll try some more build and let you guys know if anything is reproducible given I get a similar issue. Thanks for the workarounds as well, I'd implement the first one to invoke jib in a loop.
Yeah I remember you said that in last post, that's a good piece of information. I'll bump that too. Thanks again! |
Environment:
Description of the issue:
I tried building an image for the first time for a new project. It failed thrice in a row with Operation timed out (Read failed). I increased the timeout to 60 sec but it failed again.
I then used
dockerBuild
that built to docker daemon and tried pushing usingdocker push
. It took very long time (~ 13 minutes) and reattempted after a failure in middle but was able to successfully push. The image size showed by docker is 169MB, GCR shows virtual size 86.5MB.After docker successfully pushed to GCR, I executed
jib build
again with default timeout and surprisingly it succeeded and pushed in no time (32 sec) (as there was no change to project) whereas it was timing out just before docker pushed the image.Expected behavior:
I believe jib should've pushed it without any exception.
Steps to reproduce:
mvn -X -Djib.serialize=true -Djib.console=plain compile jib:build
with the given configuration.jib-maven-plugin
Configuration:Log output:
The text was updated successfully, but these errors were encountered: