-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Images created by Jib randomly causes 500 Internal server error in Red Hat Quay #1986
Comments
Hi @hbjastad. Does this happen with quay.io? Or is there a container image with Quay that we can use to reproduce? |
Probably because jib creates schema 2 images? Has quay rolled out schema 2 support fully yet? |
I just tried pushing to quay.io, and the push is rejected:
|
@jonjohnsonjr Quay still seems to default to Schema 1, as @briandealwis and I observed today. People have been individually requesting whitelisting their namespaces, even until recently. |
I'm surprised it's causing a 500 instead of the 415 like you're seeing. |
@hbjastad An internal server error usually indicates a bug in the server implementation. Jib has revealed implementation bugs in other registries (#534). Perhaps RedHat support could reach out to cloud-tools-for-java-build-team-external@googlegroups.com with a way to test against a hosted Quay as the instructions are somewhat involved. |
Yes, I agree it's surprising to get a 500, and yes, that usually indicates a bug in the server implementation. So while I firmly put the blame on RedHat in this case, they have not been able to find the problem. And since the problem has only occured with images generated by Jib, I was hoping that they could get some assistance... Unfortunately, quay.io has not been upgraded to version 3 (which is required for schema 2 support), so that instance can not be used for reproducing the problem. |
Is there any useful information in the Red Hat Case #: 02456747? We can't access it (or where to find it). If they hit 500, they must have logged something in their server logs and should be able to relate that to their code where in the code the error is coming. If there is a Quay repository against which we can directly test and reproduce this, we could also try various things. At least you can provide us detailed Jib logs so that we get to know at which stage this error happens; follow these instructions to capture low-level HTTP traffic. Also pass And what's the frequency of this happening? It's weird because what Jib generates and pushes will always be identical. |
The only interesting thing in the RH case (which is internal and only available to RH), is that Here is the last part of the log output, with the suggested options enabled:
Sorry for the lack of formatting, I'm on an old version of IE that is a struggle to use. |
If you run with |
@hbjastad sorry, looks like the requests were not serialized. The instructions say to use But I can already infer from the above log that it is either one of these:
These are the final HTTP requests of the last push stage to complete an image push by uploading image manifests (for the tag Now, I don't immediately find anything particularly wrong with the manifest being uploaded (note you upload the exact same manifest JSON for the two tags): manifest being uploaded (click to expand){
"schemaVersion": 2,
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"config": {
"mediaType": "application/vnd.docker.container.image.v1+json",
"digest": "sha256:716c75ad66f56d3d4d9bc8c60c9e98e672e86604bf74b41002ef0b748cc10363",
"size": 6394
},
"layers": [
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"digest": "sha256:dc4e07da33bca48edefbcba8dadefa8e7ffc6fe3e8ee4db140600a62862a16ac",
"size": 75838348
},
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"digest": "sha256:5c2598df456066d90ae3ed81592c54247800071c7ee8845985a93db7e95e936f",
"size": 1318
},
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"digest": "sha256:50408d34b7db1a7cac449215c8bf82b020a4e61bd542f66a8766b4804f3882fe",
"size": 107047961
},
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"digest": "sha256:4ba3169fe2368e12f3c389fcbb4a50b5de415158498523d4de8fc9a26b222c86",
"size": 73145
},
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"digest": "sha256:c0c6d1b7edbf7547f7097c8f5526df037f79ac8ec4d19bc1837111002f691a13",
"size": 201442152
},
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"digest": "sha256:39695cdcbd9269bf29fd1c1cfc51059b8015f157efeaabfc461b82f069663a4e",
"size": 68161562
},
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"digest": "sha256:68c6c0f7af33cdc5751c57a2395c557471fe8228946456b91ce8e46bcb66a113",
"size": 2057357
},
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"digest": "sha256:1fc1bfa6e7965b4df5526e5b61e308223301cf1d336994c6bb3dcecc3303c4ac",
"size": 23282
},
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"digest": "sha256:046e7a6c5df301fd2c1a1615b4712bd733ea3d7a38813c09d550aa48d7c9137a",
"size": 124861
},
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"digest": "sha256:539a556737849eab550afdd62d11bbe72faa16642c0f2419d1ed25ec36e04243",
"size": 1483
}
]
} I wonder if it is the |
Yes, I don't think it's the SNAPSHOT-version, cause we have the same problem also with non-SNAPSHOT versions. I have now upgraded to 1.6.1 (we're trying to keep up with new releases). Funnily, at first run I got another error: But also, now with 1.6.1, I made 50 runs with the helloworld project without a single error. So I will do some more testing and get back to you. |
Was that with |
After working a full day with 1.6.1, we have not seen the 500 error occur a single time. Let's give it some more time before reaching conclusions, but if 1.6.1 has not removed the error it is at least happening far less frequently. |
We also observe this random 500 Internal Server Error with Quay 3.0.4. Version 1.6.1 still fails similar to 1.4 or 1.5. The error rate remains the same: 25% (100 sequential runs), errors are equally distributed. Usage of |
Hi @sergue1, it's good to hear someone else is also experiencing this problem. Today we are also experiencing the problem again with 1.6.1, after removing -Djib.serialize Can we please get |
I believe you can use something like: https://stackoverflow.com/questions/3231797/specify-system-property-to-maven-project/13159023#13159023 to set the system property. We don't really have plans to expose it as a direct config option. As for the root cause of the issue, one possible theory is too many simultaneous connections to a local quay server are causing issues. It's just a guess, would be interesting if someone could confirm this. |
Yeah, maybe the server is a bit slow to finalize all the individual layer blobs uploaded concurrently, and by the time the server was asked to accept the manifest JSON that references these blobs via digests to stitch all of them, the server fails somehow as the internal server state is not stabilized. (I assume the error occurs only when uploading a manifest JSON as the final operation as evident in #1986 (comment).) But this is my pure speculation as well. Just remember that using @sergue1 are you using an on-prem registry too? Regardless, is it possible to obtain and look into the server log and see if there's anything useful? (Might need to increase logging level, but I think usually servers will log at least something for errors like 500.) You can do the same analysis (without |
Also, |
Quay developer here; I just wanted to take a moment to chime in.
Great catch! The https://quay.io service does not (yet) support Schema 2 images, yet. To the best of my knowledge, some on-premise installations may have this enabled depending on their configuration. Our initial struggle was being able to reproduce this issue locally. There's a ton of really great information in here that will get us one step closer. I, personally, have never heard of jib until this issue popped up so it's all new ground for me. The HTTP 500 response sounds like an issue on our end. I can't promise when it will be fixed, but regardless of the payload received by Quay, it should respond with an appropriate status. I will do my best to provide updates if anyone is interested. Please don't hesitate to |
@kurtismullins I can't seem to find a quay github repo, so just posting this here as motivation for quay.io to move to support v2.2 soon: https://docs.docker.com/engine/deprecated/#pushing-and-pulling-with-image-manifest-v2-schema-1 |
@kurtismullins, it's great to see you picking up this thread, now I can sit back and relax until the problem is solved :-) I believe you have all the information you need in this thread, and you can use the helloworld example from this repo to reproduce it. In the mean time we will use |
Please have a look at #2013 also, which is not the same as this issue, but very much related. |
Update: I've been able to create the "Hello World" example using jib and push it to my local (development) Quay environment. Thanks to all for the reproduction steps! Here's some initial observations:
It's still pretty premature but I believe that exception is not caught and handled correctly (hence the 5xx) and I'm not sure which chunk (if any?) is malformed. On my local system and example, the request which generates this error is:
Moving forward, I'm digging through the raw requests using Wireshark to see if I can find a malformed chunk. My goal is to produce a testable scenario to trigger this exception so that we can at least get it handled properly on Quay's end. If I find a malformed request being generated by jib and am able to reproduce it, I'll share that information as well. For reference purposes, here's essentially my current test-case:
Note: After encountering the HTTP 500 error initially for a given image, a subsequent retry succeeds. Hence the purpose of my changes to I am far from a Java, maven, or jib expert so please don't hesitate to share suggestions if you have them. I had hoped I could use |
@kurtismullins thanks for taking a look. Default base layer cache locations: Default application layer cache: |
The proxy should work. For example, if I do
and do
And what do you mean exactly by a "chunk"? If you are talking about the "Docker chunked upload", Jib doesn't do it but always does "monolithic" uploads. |
@kurtismullins if your mitmproxy is on the localhost, I think you'll also need to specify |
Thank you! That seemed to work during the
Great question. I actually don't have more context on that Exception at the moment. My assumption was Chunked Encoding but you just mentioned that it's always "monolithic" so maybe that's not the case. I'm going to continue digging in. Edit:
|
@kurtismullins thanks for the update. To clarify, Jib always does monolithic uploads instead of chunked uploads (consisting of multiple HTTP PATCH requests). Here, "chunked" in this context refers to the the "monolithic" vs "chunked" in the Docker API spec for pushing an image. Now, for the chunked HTTP transfer encoding, Jib doesn't explicitly craft chunked bodies. If the message is being chunked, it may be the HTTP client library that we use that automatically does the chunking for some reason. |
Sorry, I'll take that back. I've monitored HTTP headers using |
@kurtismullins I think our HTTP library (Google HTTP Client for Java) is working fine and the chunked body is well-formed. The library is being used extensively by the public too. Do you see a specific body that is missing a newline termination?
|
Thank you for checking @chanseokoh! From a quick glance, it looks good to me. I would also assume that Google HTTP Client for Java would work well and is pretty heavily tested. It does seem to work most of the time with Quay, as well. Quay also relies on a mainstream (Python) HTTP library/framework. I also imagine it's been pretty well tested by the public. I'm going to try to put together a minimal example to see if I can reproduce the issue while eliminating jib. I don't believe that Quay is doing anything that would affect this "low-level" portion of the stack but I could be wrong. I am curious why configuring jib to run in serial would eliminate the issue. It would make it a lot easier to isolate the problem! |
No idea here too. But in case you missed it, our theory is that too many simultaneous connections causes the issue somehow. Related, you can increase the number of concurrent uploads with |
I just wanted to chime back in. I have setup a simple environment using Docker to run three services. My intention is to replicate a sample user environment. These containers are running on a public cloud instance and I am running Containers:
Unfortunately I am unable to replicate the issue under these circumstances. I can tell there is a significant difference in performance and making these requests in parallel, quickly. I am not sure why I was running into the chunk encoding issue previously but it could be a number of factors in my environment so I don't want to point a firm finger at that issue. My development environment is significantly different than any user-facing environment. If someone else can help me setup a test scenario to reliably reproduce this issue, that would help tremendously. My ideal goal would be to have verbose logs on quay and on jib's end. On Quay, I am providing the environment variable
Seeing as the performance is significantly different in this scenario, I am also curious if your theory about the connections is related @chanseokoh. As soon as I have time, I'll try to run Jib on the same instance that's running Quay to see if I can reproduce with the lower latency. I did quickly test pushing a large image with "high concurrency" using Docker to the same Quay instance. It appears that Docker v18.06.3 actually retries pushing a layer upon failure. Modifying the file @hbjastad Would you be able to attempt a similar test -- running a highly parallelized Docker push to your Quay environment? That might shed some light on to whether this is related to Jib or not, as suggested by @chanseokoh. |
Using I got this error once, which is strange:
@kurtismullins, any idea why this would happen randomly? The 500 error did not show up at all. I updated the version to 1a and tried another 10 runs, with the exact same result I then tried another 10 runs, this time I got the 401 error twice, and still no Is it possible that v3.1.0 of Quay has fixed the 500 error and instead introduced |
@hbjastad That's interesting. I'm glad the 500s have cleared up for you! I am not sure about the 401. There were some auth-related changes between those two versions so it is possible although our test coverage is pretty extensive so I wouldn't expect to see any authentication errors unless Jib is first making an unauthenticated request. It may be helpful to add the Until we can pinpoint that Jib, specifically, has an issue, I'd like to invite anyone interested over to discuss this in our Google Group. I have created a thread related to this Github issue. My hope is that this will get more visibility by other members in the Quay team/community and avoid producing more notifications than necessary for the folks behind Jib. In the mean-time, I'll try to reproduce the HTTP 401 response in Jib to see if I can track down what's going on there. |
Re: the random 401 errors (#1986 (comment)), I've analyzed the log and showed here that it is most likely Quay that sometimes misbehaves. |
There has been a development in the thread re: 401 errors, and looks like the cause is identified. The 401 situation is a bit similar to #1914. In #1914, the |
Still in a deadlock: https://groups.google.com/forum/#!topic/quay-sig/2y7tMP0h0g0 |
In #2106 (comment), I've identified it is
So in #2106, escaping some characters confuses the OpenShift registry. However, this issue #1986 is the opposite. The query string value is being unescaped:
Worse, I realized this decoding happens on a different execution path than the path that encodes query strings. |
#2201 only resolves the random 401 errors,. But it was reported at some point that 500 errors are no longer returned. (Maybe Quay fixed it?) I'm closing it now, but if you still need assistance in having Quay folks fix internal server errors, feel free to re-open. |
Environment:
Description of the issue:
Images created by Jib randomly causes 500 Internal server error in Red Hat Quay.
This has been reproduced with the HelloWorld example from Jib, but does not happen
with images created in any other way than with Jib.
Expected behavior:
Images created by Jib should consistently be accepted by Quay.
Steps to reproduce:
gcr.io/REPLACE-WITH-YOUR-GCP-PROJECT/image-built-with-jib
to
QUAY-REGISTRY/image-built-with-jib
mvn compile jib:build
jib-maven-plugin
Configuration:Log output:
Additional Information:
Red Hat's response is that this only happens with Jib-created images and therefore points the finger at Jib.
Is it possible to get Google and Red Hat to cooperate in finding a solution to this?
Red Hat Case #: 02456747
The text was updated successfully, but these errors were encountered: