Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images created by Jib randomly causes 500 Internal server error in Red Hat Quay #1986

Closed
hbjastad opened this issue Sep 13, 2019 · 42 comments · Fixed by #2201
Closed

Images created by Jib randomly causes 500 Internal server error in Red Hat Quay #1986

hbjastad opened this issue Sep 13, 2019 · 42 comments · Fixed by #2201
Milestone

Comments

@hbjastad
Copy link

hbjastad commented Sep 13, 2019

Environment:

  • Jib version: 1.5.1
  • Build tool: Maven
  • OS: Linux and Windows

Description of the issue:
Images created by Jib randomly causes 500 Internal server error in Red Hat Quay.
This has been reproduced with the HelloWorld example from Jib, but does not happen
with images created in any other way than with Jib.

Expected behavior:
Images created by Jib should consistently be accepted by Quay.

Steps to reproduce:

  1. Download https://github.com/GoogleContainerTools/jib/tree/master/examples/helloworld
  2. Change
    gcr.io/REPLACE-WITH-YOUR-GCP-PROJECT/image-built-with-jib
    to
    QUAY-REGISTRY/image-built-with-jib
  3. Run mvn compile jib:build

jib-maven-plugin Configuration:

<configuration>
    <to>
        <image>QUAY-REGISTRY/image-built-with-jib</image>
    </to>
</configuration>

Log output:

> [2019-09-13T07:50:47.986Z] [ERROR] Failed to execute goal com.google.cloud.tools:jib-maven-plugin:1.5.1:build (build_docker_image) on project xx: 500 Internal Server Error
> [2019-09-13T07:50:47.986Z] [ERROR] <html>
> [2019-09-13T07:50:47.986Z] [ERROR]   <head>
> [2019-09-13T07:50:47.986Z] [ERROR]     <title>Internal Server Error</title>
> [2019-09-13T07:50:47.986Z] [ERROR]   </head>
> [2019-09-13T07:50:47.986Z] [ERROR]   <body>
> [2019-09-13T07:50:47.986Z] [ERROR]     <h1>Internal Server Error</h1>
> [2019-09-13T07:50:47.986Z] [ERROR]
> [2019-09-13T07:50:47.986Z] [ERROR]   </body>
> [2019-09-13T07:50:47.986Z] [ERROR] </html>

Additional Information:
Red Hat's response is that this only happens with Jib-created images and therefore points the finger at Jib.

Is it possible to get Google and Red Hat to cooperate in finding a solution to this?
Red Hat Case #: 02456747

@briandealwis
Copy link
Member

Hi @hbjastad. Does this happen with quay.io? Or is there a container image with Quay that we can use to reproduce?

@jonjohnsonjr
Copy link

Probably because jib creates schema 2 images? Has quay rolled out schema 2 support fully yet?

@briandealwis
Copy link
Member

I just tried pushing to quay.io, and the push is rejected:

[ERROR] Failed to execute goal com.google.cloud.tools:jib-maven-plugin:1.5.1:build (default-cli) on project first: Tried to push image manifest for quay.io/briandealwis/jibtest:latest but failed because: Registry may not support pushing OCI Manifest or Docker Image Manifest Version 2, Schema 2 | If this is a bug, please file an issue at https://github.com/GoogleContainerTools/jib/issues/new: 415 UNSUPPORTED MEDIA TYPE
[ERROR] {"errors":[{"code":"MANIFEST_INVALID","detail":{"message":"manifest schema version not supported"},"message":"manifest invalid"}]}

@chanseokoh
Copy link
Member

chanseokoh commented Sep 13, 2019

@jonjohnsonjr Quay still seems to default to Schema 1, as @briandealwis and I observed today. People have been individually requesting whitelisting their namespaces, even until recently.

@jonjohnsonjr
Copy link

I'm surprised it's causing a 500 instead of the 415 like you're seeing.

@briandealwis
Copy link
Member

briandealwis commented Sep 13, 2019

@hbjastad An internal server error usually indicates a bug in the server implementation. Jib has revealed implementation bugs in other registries (#534).

Perhaps RedHat support could reach out to cloud-tools-for-java-build-team-external@googlegroups.com with a way to test against a hosted Quay as the instructions are somewhat involved.

@hbjastad
Copy link
Author

Yes, I agree it's surprising to get a 500, and yes, that usually indicates a bug in the server implementation. So while I firmly put the blame on RedHat in this case, they have not been able to find the problem. And since the problem has only occured with images generated by Jib, I was hoping that they could get some assistance...

Unfortunately, quay.io has not been upgraded to version 3 (which is required for schema 2 support), so that instance can not be used for reproducing the problem.

@chanseokoh
Copy link
Member

chanseokoh commented Sep 17, 2019

Is there any useful information in the Red Hat Case #: 02456747? We can't access it (or where to find it). If they hit 500, they must have logged something in their server logs and should be able to relate that to their code where in the code the error is coming.

If there is a Quay repository against which we can directly test and reproduce this, we could also try various things. At least you can provide us detailed Jib logs so that we get to know at which stage this error happens; follow these instructions to capture low-level HTTP traffic. Also pass -X to Maven to log debug output. Then strip out any sensitive info and upload the file here. (You can also easily inspect the log on your own to pinpoint which HTTP request endpoint URL the server returns 500 for.)

And what's the frequency of this happening? It's weird because what Jib generates and pushes will always be identical.

@hbjastad
Copy link
Author

hbjastad commented Sep 17, 2019

The only interesting thing in the RH case (which is internal and only available to RH), is that
they say they have tried 100 runs without being able to reproduce the problem.
Makes me think that it could be a problem with the registry installation we use (internally hosted),
but we need help from RH to identify why they go 500 on us...
It fails roughly 20% of the runs.

Here is the last part of the log output, with the suggested options enabled:

18:28:40  CONFIG: -------------- REQUEST  --------------
18:28:40  PUT https://QUAY-REGISTRY/v2/group/app/blobs/uploads/a5fbdf2f-1744-45ca-b31c-97f9da10806e?digest=sha256:716c75ad66f56d3d4d9bc8c60c9e98e672e86604bf74b41002ef0b748cc10363
18:28:40  Accept: 
18:28:40  Accept-Encoding: gzip
18:28:40  Authorization: <Not Logged>
18:28:40  User-Agent: jib 1.5.0 jib-maven-plugin Google-HTTP-Java-Client/1.30.0 (gzip)
18:28:40  
18:28:40  Sep 17, 2019 6:28:40 PM com.google.api.client.http.HttpRequest execute
18:28:40  CONFIG: curl -v --compressed -X PUT -H 'Accept: ' -H 'Accept-Encoding: gzip' -H 'Authorization: <Not Logged>' -H 'User-Agent: jib 1.5.0 jib-maven-plugin Google-HTTP-Java-Client/1.30.0 (gzip)' -- 'https://QUAY-REGISTRY/v2/group/app/blobs/uploads/a5fbdf2f-1744-45ca-b31c-97f9da10806e?digest=sha256:716c75ad66f56d3d4d9bc8c60c9e98e672e86604bf74b41002ef0b748cc10363'
18:28:40  Sep 17, 2019 6:28:40 PM com.google.api.client.http.HttpResponse <init>
18:28:40  CONFIG: -------------- RESPONSE --------------
18:28:40  HTTP/1.1 201 CREATED
18:28:40  Server: nginx/1.12.1
18:28:40  Date: Tue, 17 Sep 2019 16:28:40 GMT
18:28:40  Content-Type: text/html; charset=utf-8
18:28:40  Content-Length: 0
18:28:40  Connection: close
18:28:40  Docker-Content-Digest: sha256:716c75ad66f56d3d4d9bc8c60c9e98e672e86604bf74b41002ef0b748cc10363
18:28:40  Location: https://QUAY-REGISTRY/v2/group/app/blobs/sha256:716c75ad66f56d3d4d9bc8c60c9e98e672e86604bf74b41002ef0b748cc10363
18:28:40  X-Frame-Options: DENY
18:28:40  Strict-Transport-Security: max-age=63072000; preload
18:28:40  
18:28:40  Sep 17, 2019 6:28:40 PM com.google.api.client.http.HttpRequest execute
18:28:40  CONFIG: -------------- REQUEST  --------------
18:28:40  PUT https://QUAY-REGISTRY/v2/group/app/manifests/0.2.7-SNAPSHOT
18:28:40  Accept:
18:28:40  Accept-Encoding: gzip
18:28:40  Authorization: <Not Logged>
18:28:40  User-Agent: jib 1.5.0 jib-maven-plugin Google-HTTP-Java-Client/1.30.0 (gzip)
18:28:40  Content-Type: application/vnd.docker.distribution.manifest.v2+json
18:28:40
18:28:40  Sep 17, 2019 6:28:40 PM com.google.api.client.http.HttpRequest execute
18:28:40  CONFIG: -------------- REQUEST  --------------
18:28:40  PUT https://QUAY-REGISTRY/v2/group/app/manifests/latest
18:28:40  Accept:
18:28:40  Accept-Encoding: gzip
18:28:40  Authorization: <Not Logged>
18:28:40  User-Agent: jib 1.5.0 jib-maven-plugin Google-HTTP-Java-Client/1.30.0 (gzip)
18:28:40  Content-Type: application/vnd.docker.distribution.manifest.v2+json
18:28:40
18:28:40  Sep 17, 2019 6:28:40 PM com.google.api.client.http.HttpRequest execute
18:28:40  CONFIG: curl -v --compressed -X PUT -H 'Accept: ' -H 'Accept-Encoding: gzip' -H 'Authorization: <Not Logged>' -H 'User-Agent: jib 1.5.0 jib-maven-plugin Google-HTTP-Java-Client/1.30.0 (gzip)' -H 'Content-Type: application/vnd.docker.distribution.manifest.v2+json' -d '@-' -- 'https://QUAY-REGISTRY/v2/group/app/manifests/latest' << $$$
18:28:40  Sep 17, 2019 6:28:40 PM com.google.api.client.http.HttpRequest execute
18:28:40  CONFIG: curl -v --compressed -X PUT -H 'Accept: ' -H 'Accept-Encoding: gzip' -H 'Authorization: <Not Logged>' -H 'User-Agent: jib 1.5.0 jib-maven-plugin Google-HTTP-Java-Client/1.30.0 (gzip)' -H 'Content-Type: application/vnd.docker.distribution.manifest.v2+json' -d '@-' -- 'https://QUAY-REGISTRY/v2/group/app/manifests/0.2.7-SNAPSHOT' << $$$
18:28:40  Sep 17, 2019 6:28:40 PM com.google.api.client.util.LoggingByteArrayOutputStream close
18:28:40  CONFIG: Total: 1,899 bytes
18:28:40  Sep 17, 2019 6:28:40 PM com.google.api.client.util.LoggingByteArrayOutputStream close
18:28:40  CONFIG: Total: 1,899 bytes
18:28:40  Sep 17, 2019 6:28:40 PM com.google.api.client.util.LoggingByteArrayOutputStream close
18:28:40  CONFIG: {"schemaVersion":2,"mediaType":"application/vnd.docker.distribution.manifest.v2+json","config":{"mediaType":"application/vnd.docker.container.image.v1+json","digest":"sha256:716c75ad66f56d3d4d9bc8c60c9e98e672e86604bf74b41002ef0b748cc10363","size":6394},"layers":[{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:dc4e07da33bca48edefbcba8dadefa8e7ffc6fe3e8ee4db140600a62862a16ac","size":75838348},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:5c2598df456066d90ae3ed81592c54247800071c7ee8845985a93db7e95e936f","size":1318},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:50408d34b7db1a7cac449215c8bf82b020a4e61bd542f66a8766b4804f3882fe","size":107047961},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:4ba3169fe2368e12f3c389fcbb4a50b5de415158498523d4de8fc9a26b222c86","size":73145},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:c0c6d1b7edbf7547f7097c8f5526df037f79ac8ec4d19bc1837111002f691a13","size":201442152},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:39695cdcbd9269bf29fd1c1cfc51059b8015f157efeaabfc461b82f069663a4e","size":68161562},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:68c6c0f7af33cdc5751c57a2395c557471fe8228946456b91ce8e46bcb66a113","size":2057357},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:1fc1bfa6e7965b4df5526e5b61e308223301cf1d336994c6bb3dcecc3303c4ac","size":23282},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:046e7a6c5df301fd2c1a1615b4712bd733ea3d7a38813c09d550aa48d7c9137a","size":124861},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:539a556737849eab550afdd62d11bbe72faa16642c0f2419d1ed25ec36e04243","size":1483}]}
18:28:40  Sep 17, 2019 6:28:40 PM com.google.api.client.util.LoggingByteArrayOutputStream close
18:28:40  CONFIG: {"schemaVersion":2,"mediaType":"application/vnd.docker.distribution.manifest.v2+json","config":{"mediaType":"application/vnd.docker.container.image.v1+json","digest":"sha256:716c75ad66f56d3d4d9bc8c60c9e98e672e86604bf74b41002ef0b748cc10363","size":6394},"layers":[{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:dc4e07da33bca48edefbcba8dadefa8e7ffc6fe3e8ee4db140600a62862a16ac","size":75838348},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:5c2598df456066d90ae3ed81592c54247800071c7ee8845985a93db7e95e936f","size":1318},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:50408d34b7db1a7cac449215c8bf82b020a4e61bd542f66a8766b4804f3882fe","size":107047961},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:4ba3169fe2368e12f3c389fcbb4a50b5de415158498523d4de8fc9a26b222c86","size":73145},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:c0c6d1b7edbf7547f7097c8f5526df037f79ac8ec4d19bc1837111002f691a13","size":201442152},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:39695cdcbd9269bf29fd1c1cfc51059b8015f157efeaabfc461b82f069663a4e","size":68161562},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:68c6c0f7af33cdc5751c57a2395c557471fe8228946456b91ce8e46bcb66a113","size":2057357},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:1fc1bfa6e7965b4df5526e5b61e308223301cf1d336994c6bb3dcecc3303c4ac","size":23282},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:046e7a6c5df301fd2c1a1615b4712bd733ea3d7a38813c09d550aa48d7c9137a","size":124861},{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","digest":"sha256:539a556737849eab550afdd62d11bbe72faa16642c0f2419d1ed25ec36e04243","size":1483}]}
18:28:41  Sep 17, 2019 6:28:40 PM com.google.api.client.http.HttpResponse <init>
18:28:41  CONFIG: -------------- RESPONSE --------------
18:28:41  HTTP/1.1 500 Internal Server Error
18:28:41  Server: nginx/1.12.1
18:28:41  Date: Tue, 17 Sep 2019 16:28:40 GMT
18:28:41  Content-Type: text/html
18:28:41  Content-Length: 134
18:28:41  Connection: close
18:28:41
18:28:41  Sep 17, 2019 6:28:40 PM com.google.api.client.util.LoggingByteArrayOutputStream close
18:28:41  CONFIG: Total: 134 bytes
18:28:41  Sep 17, 2019 6:28:40 PM com.google.api.client.util.LoggingByteArrayOutputStream close
18:28:41  CONFIG: <html>
18:28:41    <head>
18:28:41      <title>Internal Server Error</title>
18:28:41    </head>
18:28:41    <body>
18:28:41      <h1>Internal Server Error</h1>
18:28:41
18:28:41    </body>
18:28:41  </html>

Sorry for the lack of formatting, I'm on an old version of IE that is a struggle to use.
I can see that it makes 2 manifest PUT requests, but not sure if we can conclude which of them fails.
I'm assuming they're made asynchronously, since both requests are logged before there is a response.

@briandealwis
Copy link
Member

briandealwis commented Sep 17, 2019

If you run with -Djib.serialize=true for Jib >= 1.6.0 (or -DjibSerialize=true for Jib < 1.6.0) then the requests should be serialized. But my bet is on lack of schema v2-2 support.

@chanseokoh
Copy link
Member

chanseokoh commented Sep 17, 2019

@hbjastad sorry, looks like the requests were not serialized. The instructions say to use -Djib.serialize=true (for Jib >= 1.6.0) but it should have been -DjibSerialize=true for Jib < 1.6.0. My fault. (I strongly recommend upgrading to 1.6.1, since 1.5.x has a significant network performance issue.) So, at this point, it isn't certain which HTTP request made the server respond with 500.

But I can already infer from the above log that it is either one of these:

  • PUT https://QUAY-REGISTRY/v2/group/app/manifests/0.2.7-SNAPSHOT
  • PUT https://QUAY-REGISTRY/v2/group/app/manifests/latest

These are the final HTTP requests of the last push stage to complete an image push by uploading image manifests (for the tag latest and 0.2.7-SNAPSHOT). These manifest pushes can start only after all other HTTP requests/responses (e.g., uploading layer blobs) completed, so I'm fairly certain the error is happening when the server is supposed to accept schema v2-2 manifests. You will be able to verify this with -DjibSerialize=true for < 1.6.0 or -Djib.serialize=true for >= 1.6.0.

Now, I don't immediately find anything particularly wrong with the manifest being uploaded (note you upload the exact same manifest JSON for the two tags):

manifest being uploaded (click to expand)
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "config": {
    "mediaType": "application/vnd.docker.container.image.v1+json",
    "digest": "sha256:716c75ad66f56d3d4d9bc8c60c9e98e672e86604bf74b41002ef0b748cc10363",
    "size": 6394
  },
  "layers": [
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "digest": "sha256:dc4e07da33bca48edefbcba8dadefa8e7ffc6fe3e8ee4db140600a62862a16ac",
      "size": 75838348
    },
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "digest": "sha256:5c2598df456066d90ae3ed81592c54247800071c7ee8845985a93db7e95e936f",
      "size": 1318
    },
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "digest": "sha256:50408d34b7db1a7cac449215c8bf82b020a4e61bd542f66a8766b4804f3882fe",
      "size": 107047961
    },
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "digest": "sha256:4ba3169fe2368e12f3c389fcbb4a50b5de415158498523d4de8fc9a26b222c86",
      "size": 73145
    },
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "digest": "sha256:c0c6d1b7edbf7547f7097c8f5526df037f79ac8ec4d19bc1837111002f691a13",
      "size": 201442152
    },
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "digest": "sha256:39695cdcbd9269bf29fd1c1cfc51059b8015f157efeaabfc461b82f069663a4e",
      "size": 68161562
    },
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "digest": "sha256:68c6c0f7af33cdc5751c57a2395c557471fe8228946456b91ce8e46bcb66a113",
      "size": 2057357
    },
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "digest": "sha256:1fc1bfa6e7965b4df5526e5b61e308223301cf1d336994c6bb3dcecc3303c4ac",
      "size": 23282
    },
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "digest": "sha256:046e7a6c5df301fd2c1a1615b4712bd733ea3d7a38813c09d550aa48d7c9137a",
      "size": 124861
    },
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "digest": "sha256:539a556737849eab550afdd62d11bbe72faa16642c0f2419d1ed25ec36e04243",
      "size": 1483
    }
  ]
}

I wonder if it is the -SNOPSHOT string that's causing an issue, but I think that's somewhat unlikely, honestly. You already mentioned you can reproduce this issue with the helloworld example. Hopefully -Djib.serialize gives you more consistency in reproducing the problem.

@hbjastad
Copy link
Author

Yes, I don't think it's the SNAPSHOT-version, cause we have the same problem also with non-SNAPSHOT versions.

I have now upgraded to 1.6.1 (we're trying to keep up with new releases).

Funnily, at first run I got another error:
[2019-09-18T05:42:43.863Z] [ERROR] Failed to execute goal com.google.cloud.tools:jib-maven-plugin:1.6.1:build (build_docker_image) on project app-backend: Tried to push BLOB for QUAY-REGISTRY/group/app with digest sha256:2a4b9f7f0b931a97e88abd05a50deae12969986e6aa66c9975c9305997a7e141 but failed because: other: blob upload invalid | If this is a bug, please file an issue at https://github.com/GoogleContainerTools/jib/issues/new: 400 BAD REQUEST
[2019-09-18T05:42:43.863Z] [ERROR] {"errors":[{"code":"BLOB_UPLOAD_INVALID","detail":{},"message":"blob upload invalid"}]}

But also, now with 1.6.1, I made 50 runs with the helloworld project without a single error. So I will do some more testing and get back to you.

@briandealwis
Copy link
Member

Was that with jib.serialize? Does the original error re-occur if jib.serialize is removed?

@hbjastad
Copy link
Author

After working a full day with 1.6.1, we have not seen the 500 error occur a single time. Let's give it some more time before reaching conclusions, but if 1.6.1 has not removed the error it is at least happening far less frequently.
The 400 error above has also not happened again, so I guess that was a one off.
We have been running both with and without jib.serialize, I will remove it and see if that makes a difference.

@sergue1
Copy link

sergue1 commented Sep 19, 2019

We also observe this random 500 Internal Server Error with Quay 3.0.4. Version 1.6.1 still fails similar to 1.4 or 1.5. The error rate remains the same: 25% (100 sequential runs), errors are equally distributed. Usage of -Djib.serialize fixes the issue (no error in 100 runs). Original error re-occurs after -Djib.serialize is removed.

@hbjastad
Copy link
Author

Hi @sergue1, it's good to hear someone else is also experiencing this problem. Today we are also experiencing the problem again with 1.6.1, after removing -Djib.serialize
So it seems like -Djib.serialize is a workaround for this apparent bug in Quay.

Can we please get serialize as a config parameter in the plugin configuration in the pom.xml, so that we can avoid including it on the command line everywhere?

@loosebazooka
Copy link
Member

I believe you can use something like: https://stackoverflow.com/questions/3231797/specify-system-property-to-maven-project/13159023#13159023 to set the system property. We don't really have plans to expose it as a direct config option.

As for the root cause of the issue, one possible theory is too many simultaneous connections to a local quay server are causing issues. It's just a guess, would be interesting if someone could confirm this.

@chanseokoh
Copy link
Member

chanseokoh commented Sep 19, 2019

As for the root cause of the issue, one possible theory is too many simultaneous connections to a local quay server are causing issues.

Yeah, maybe the server is a bit slow to finalize all the individual layer blobs uploaded concurrently, and by the time the server was asked to accept the manifest JSON that references these blobs via digests to stitch all of them, the server fails somehow as the internal server state is not stabilized. (I assume the error occurs only when uploading a manifest JSON as the final operation as evident in #1986 (comment).) But this is my pure speculation as well.

Just remember that using jib.serialize will make the build super slow. No concurrency at all. Basically, no one should be using it (no reason to use). In general, I mean.

@sergue1 are you using an on-prem registry too? Regardless, is it possible to obtain and look into the server log and see if there's anything useful? (Might need to increase logging level, but I think usually servers will log at least something for errors like 500.) You can do the same analysis (without jib.serialize, of course) in #1986 (comment) to roughly narrow down for which endpoint URLs this error happens. Then I suppose you could work with the RH support, presenting what you've got so far, and press them for help. The same applies to you, @hbjastad.

@chanseokoh
Copy link
Member

chanseokoh commented Sep 19, 2019

Also, docker has --max-concurrent-uploads int which is set to 5 by default, so you can also try increasing the number to push an image with many layers (preferably of similar sizes) to see if the error is reproducible with docker push too. But note that Docker retries pushes if there are failures, so this may be harder to notice with docker push.

@kurtismullins
Copy link

Quay developer here; I just wanted to take a moment to chime in.

@jonjohnsonjr

Probably because jib creates schema 2 images? Has quay rolled out schema 2 support fully yet?

Great catch! The https://quay.io service does not (yet) support Schema 2 images, yet. To the best of my knowledge, some on-premise installations may have this enabled depending on their configuration.

Our initial struggle was being able to reproduce this issue locally. There's a ton of really great information in here that will get us one step closer. I, personally, have never heard of jib until this issue popped up so it's all new ground for me.

The HTTP 500 response sounds like an issue on our end. I can't promise when it will be fixed, but regardless of the payload received by Quay, it should respond with an appropriate status. I will do my best to provide updates if anyone is interested.

Please don't hesitate to @ me if anyone has questions I might be able to answer.

@loosebazooka
Copy link
Member

@kurtismullins I can't seem to find a quay github repo, so just posting this here as motivation for quay.io to move to support v2.2 soon: https://docs.docker.com/engine/deprecated/#pushing-and-pulling-with-image-manifest-v2-schema-1

@hbjastad
Copy link
Author

@kurtismullins, it's great to see you picking up this thread, now I can sit back and relax until the problem is solved :-) I believe you have all the information you need in this thread, and you can use the helloworld example from this repo to reproduce it.

In the mean time we will use jib.serialize so that we can get back to focusing on our real job of delivering functionality to our users.

@hbjastad
Copy link
Author

Please have a look at #2013 also, which is not the same as this issue, but very much related.

@kurtismullins
Copy link

Update:

I've been able to create the "Hello World" example using jib and push it to my local (development) Quay environment. Thanks to all for the reproduction steps!

Here's some initial observations:

  • If Schema 2_2 is not enabled, I always get the (expected) "Manifest Invalid" error
  • On a new/fresh image build+push using jib, I am seeing an exception related to a chunk missing a newline terminator.

It's still pretty premature but I believe that exception is not caught and handled correctly (hence the 5xx) and I'm not sure which chunk (if any?) is malformed. On my local system and example, the request which generates this error is:

PATCH /v2/devtable/image-built-with-jib-test/blobs/uploads/bcab81cf-b0d9-4972-b4a6-63ba87d13da5 HTTP/1.1

Moving forward, I'm digging through the raw requests using Wireshark to see if I can find a malformed chunk. My goal is to produce a testable scenario to trigger this exception so that we can at least get it handled properly on Quay's end. If I find a malformed request being generated by jib and am able to reproduce it, I'll share that information as well.

For reference purposes, here's essentially my current test-case:

  • Clone jib (working directly out of master)
  • Point the helloWorld example to my Quay instance
  • Execute mvn -Djib.console=plain -Djava.util.logging.config.file=./logging.properties -DsendCredentialsOverHttp=true compile jib:build
  • Modify HelloWorld.java and the repository's name in pom.xml for each fresh run.

Note: After encountering the HTTP 500 error initially for a given image, a subsequent retry succeeds. Hence the purpose of my changes to pom.xml and HelloWorld.java in my test-case.

I am far from a Java, maven, or jib expert so please don't hesitate to share suggestions if you have them. I had hoped I could use mitmproxy but neither the command-line arguments (e.g. -Dhttp.proxyHost [...]) or the maven settings file seemed to have worked. It is probably related to my local configuration and not using TLS/HTTPs.

@loosebazooka
Copy link
Member

loosebazooka commented Sep 23, 2019

@kurtismullins thanks for taking a look.
All the layers jib generates are cached so you can take a look them if you want to inspect them.

Default base layer cache locations:
windows: %LOCALAPPDATA%/google-cloud-tools-java/jib
linux: <user-home>/.cache/google-cloud-tools-java/jib
macos: <user-home>/Library/Application Support/google-cloud-tools-java/jib

Default application layer cache:
maven: <project>/target/jib-cache
gradle: <project>/build/jib-cache

@chanseokoh
Copy link
Member

chanseokoh commented Sep 23, 2019

The proxy should work. For example, if I do

mvn \
    -Dhttp.proxyHost=localhost -Dhttp.proxyPort=8085 \
    -Dhttps.proxyHost=localhost -Dhttps.proxyPort=8085 \
    -Dhttp.nonProxyHosts="" \
    jib:build

and do nc -l 8085 on a different terminal, I see Jib trying to connect to my GCR repo:

$ nc -l 8085
CONNECT gcr.io:443 HTTP/1.1
Host: gcr.io:443
Proxy-Connection: Keep-Alive

And what do you mean exactly by a "chunk"? If you are talking about the "Docker chunked upload", Jib doesn't do it but always does "monolithic" uploads.

@briandealwis
Copy link
Member

@kurtismullins if your mitmproxy is on the localhost, I think you'll also need to specify -Dhttp.nonProxyHosts="" to force proxy access for the local machine too.

@kurtismullins
Copy link

kurtismullins commented Sep 23, 2019

@chanseokoh & @briandealwis

Thank you! That seemed to work during the pull from gcr.io but doesn't seem to affect the push to my localhost:5000. It's as-if it's ignoring the proxy all-together for localhost. I'll see if I can make it work later on -- perhaps by using a fake hostname in my /etc/hosts (Linux).

@chanseokoh

And what do you mean exactly by a "chunk"? If you are talking about the "Docker chunked upload", Jib doesn't do it but always does "monolithic" uploads.

Great question. I actually don't have more context on that Exception at the moment. My assumption was Chunked Encoding but you just mentioned that it's always "monolithic" so maybe that's not the case. I'm going to continue digging in.

Edit:
It appears that jib is using Chunked Encoding. From one of my patch requests:

    PATCH /v2/devtable/image-built-with-jib-test-2/blobs/uploads/617dfb2f-d310-48c4-a8d0-be976b662a17 HTTP/1.1\r\n
        [Expert Info (Chat/Sequence): PATCH /v2/devtable/image-built-with-jib-test-2/blobs/uploads/617dfb2f-d310-48c4-a8d0-be976b662a17 HTTP/1.1\r\n]
        Request Method: PATCH
        Request URI: /v2/devtable/image-built-with-jib-test-2/blobs/uploads/617dfb2f-d310-48c4-a8d0-be976b662a17
        Request Version: HTTP/1.1
    Accept: \r\n
    Accept-Encoding: gzip\r\n
     [truncated]Authorization: Bearer [snipped]
    User-Agent: jib 1.6.1 jib-maven-plugin Google-HTTP-Java-Client/1.27.0 (gzip)\r\n
    Transfer-Encoding: chunked\r\n
    Content-Type: application/octet-stream\r\n
    Host: localhost:5000\r\n
    Connection: Keep-Alive\r\n
    \r\n

@chanseokoh
Copy link
Member

chanseokoh commented Sep 23, 2019

@kurtismullins thanks for the update. To clarify, Jib always does monolithic uploads instead of chunked uploads (consisting of multiple HTTP PATCH requests). Here, "chunked" in this context refers to the the "monolithic" vs "chunked" in the Docker API spec for pushing an image.

Now, for the chunked HTTP transfer encoding, Jib doesn't explicitly craft chunked bodies. If the message is being chunked, it may be the HTTP client library that we use that automatically does the chunking for some reason. But I've not really seen this happening from our logs. From your comment, it looks like you already know how to log detailed HTTP requests/responses from Jib. Do you actually see Transfer-Encoding: chunked from the Jib log? I wonder if your server mistakenly assumes Transfer-Encoding: chunked when Content-Length is missing. (Jib does not set Content-Length for blob push PATCH.)

@chanseokoh
Copy link
Member

chanseokoh commented Sep 23, 2019

But I've not really seen this happening from our logs.

Sorry, I'll take that back. I've monitored HTTP headers using mitmdump, and we end up setting Transfer-Encoding: chunked. I think it is our HTTP library that does this. (Not sure yet if it always does chunking or occasionally.) I'll look into the content body to see if it is well-formed.

@chanseokoh
Copy link
Member

@kurtismullins I think our HTTP library (Google HTTP Client for Java) is working fine and the chunked body is well-formed. The library is being used extensively by the public too. Do you see a specific body that is missing a newline termination?

PATCH / HTTP/1.1
Accept:
Accept-Encoding: gzip
Authorization: Bearer ...
User-Agent: jib 1.6.2-SNAPSHOT jib-maven-plugin Google-HTTP-Java-Client/1.27.0 (gzip)
Transfer-Encoding: chunked
Content-Type: application/octet-stream
Host: localhost:8085
Connection: Keep-Alive

70d
{"created":"1970-01-01T00:00:00Z","architecture":"amd64","os":"linux", ... }
0

@kurtismullins
Copy link

Thank you for checking @chanseokoh! From a quick glance, it looks good to me. I would also assume that Google HTTP Client for Java would work well and is pretty heavily tested. It does seem to work most of the time with Quay, as well.

Quay also relies on a mainstream (Python) HTTP library/framework. I also imagine it's been pretty well tested by the public.

I'm going to try to put together a minimal example to see if I can reproduce the issue while eliminating jib. I don't believe that Quay is doing anything that would affect this "low-level" portion of the stack but I could be wrong.

I am curious why configuring jib to run in serial would eliminate the issue. It would make it a lot easier to isolate the problem!

@chanseokoh
Copy link
Member

chanseokoh commented Sep 26, 2019

I am curious why configuring jib to run in serial would eliminate the issue. It would make it a lot easier to isolate the problem!

No idea here too. But in case you missed it, our theory is that too many simultaneous connections causes the issue somehow. Related, you can increase the number of concurrent uploads with docker push, and I am also curious if this is reproducible by docker push after increasing it. But be aware that Docker retries pushes if there are failures, so this may be harder to notice with docker push.

@kurtismullins
Copy link

I just wanted to chime back in. I have setup a simple environment using Docker to run three services. My intention is to replicate a sample user environment. These containers are running on a public cloud instance and I am running jib locally.

Containers:

  • Quay v3.0.5
  • Postgres 9.6
  • Redis (latest)

Unfortunately I am unable to replicate the issue under these circumstances. I can tell there is a significant difference in performance and making these requests in parallel, quickly.

I am not sure why I was running into the chunk encoding issue previously but it could be a number of factors in my environment so I don't want to point a firm finger at that issue. My development environment is significantly different than any user-facing environment.

If someone else can help me setup a test scenario to reliably reproduce this issue, that would help tremendously. My ideal goal would be to have verbose logs on quay and on jib's end. On Quay, I am providing the environment variable DEBUGLOG=true and in Jib I am providing multiple options:

mvn -Djib.console=plain -Djava.util.logging.config.file=./logging.properties compile jib:build 2>&1 | tee jib-log.txt

Seeing as the performance is significantly different in this scenario, I am also curious if your theory about the connections is related @chanseokoh. As soon as I have time, I'll try to run Jib on the same instance that's running Quay to see if I can reproduce with the lower latency.

I did quickly test pushing a large image with "high concurrency" using Docker to the same Quay instance. It appears that Docker v18.06.3 actually retries pushing a layer upon failure. Modifying the file /etc/docker/daemon.json on Fedora 30, I have set the max upload count to 1,000 although it looks to be using only ~13 in parallel. In this quick test, I was able to reproduce the retry scenario a few times but have not gathered any data which indicates why there are failures.

@hbjastad Would you be able to attempt a similar test -- running a highly parallelized Docker push to your Quay environment? That might shed some light on to whether this is related to Jib or not, as suggested by @chanseokoh.

@hbjastad
Copy link
Author

hbjastad commented Oct 1, 2019

Using helloworld I tried running mvn compile jib:build 10 times,
with our local Quay installation, which is now on v3.1.0

I got this error once, which is strange:

[ERROR] Failed to execute goal com.google.cloud.tools:jib-maven-plugin:1.6.1:build (default-cli) on project helloworld: com.google.cloud.tools.jib.api.RegistryUnauthorizedException: Unauthorized for QUAY-REGISTRY/r2d2/distroless_java: 401 Unauthorized
[ERROR] <html>
[ERROR] <head><title>401 Authorization Required</title></head>
[ERROR] <body bgcolor="white">
[ERROR] <center><h1>401 Authorization Required</h1></center>
[ERROR] <hr><center>nginx/1.12.1</center>
[ERROR] </body>
[ERROR] </html>

@kurtismullins, any idea why this would happen randomly?

The 500 error did not show up at all.

I updated the version to 1a and tried another 10 runs, with the exact same result
(401 error once, no other errors).

I then tried another 10 runs, this time I got the 401 error twice, and still no
500 error.

Is it possible that v3.1.0 of Quay has fixed the 500 error and instead introduced
this new, infrequent, 401 error?

@kurtismullins
Copy link

@hbjastad That's interesting. I'm glad the 500s have cleared up for you! I am not sure about the 401. There were some auth-related changes between those two versions so it is possible although our test coverage is pretty extensive so I wouldn't expect to see any authentication errors unless Jib is first making an unauthenticated request.

It may be helpful to add the -X flag to your mvn command to see which request is unauthenticated.

Until we can pinpoint that Jib, specifically, has an issue, I'd like to invite anyone interested over to discuss this in our Google Group. I have created a thread related to this Github issue. My hope is that this will get more visibility by other members in the Quay team/community and avoid producing more notifications than necessary for the folks behind Jib.

In the mean-time, I'll try to reproduce the HTTP 401 response in Jib to see if I can track down what's going on there.

@chanseokoh
Copy link
Member

Re: the random 401 errors (#1986 (comment)), I've analyzed the log and showed here that it is most likely Quay that sometimes misbehaves.

@chanseokoh
Copy link
Member

chanseokoh commented Oct 2, 2019

There has been a development in the thread re: 401 errors, and looks like the cause is identified.

The 401 situation is a bit similar to #1914. In #1914, the Location: value returned by the Microsoft server had two consecutive slashes (//) in the path component and various non-alphabet characters encoded in the query string part, and the Apache library was normalizing and decoding them. Here with Quay, the Location: had %2F in a query string value that is being normalized to /. We disabled normalizing double slashes, which Microsoft seemed to have trouble, but it did not disable decoding query string values. Apparently, Microsoft has no problem dealing with decoded query string values, but Quay doesn't seem to, so that is probably why this is still problematic with Quay. It is probably the best that Quay fixes its behavior to conform to the standard in handling query string values. We could also try to find a way to disable decoding query string values, but this query string handling seems so basic, and I'm not sure if there's an API for that.

@chanseokoh
Copy link
Member

@chanseokoh
Copy link
Member

chanseokoh commented Oct 29, 2019

In #2106 (comment), I've identified it is GenericUrl that does escaping query strings. (But it is still possible that Apache HttpClient may do a similar escaping.)

from: exp=1572285389~hmac=f0a387f...     --> OpenShift says OK
to  : exp%3D1572285389~hmac%3Df0a387f... --> confuses OpenShift

So in #2106, escaping some characters confuses the OpenShift registry. However, this issue #1986 is the opposite. The query string value is being unescaped:

from: Signature=2wYOD0a%2BDAkK%2F9lQ... --> Quay says OK
to  : Signature=2wYOD0a%2BDAkK/9lQ...   --> confuses Quay

Worse, I realized this decoding happens on a different execution path than the path that encodes query strings.

@chanseokoh
Copy link
Member

@sergue1 @hbjastad

#2201 only resolves the random 401 errors,. But it was reported at some point that 500 errors are no longer returned. (Maybe Quay fixed it?) I'm closing it now, but if you still need assistance in having Quay folks fix internal server errors, feel free to re-open.

@chanseokoh
Copy link
Member

chanseokoh commented Jan 29, 2020

@sergue1 @hbjastad we've released 2.0.0, and I expect you will no longer see the occasional 401 errors on Quay and OpenShift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants