Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flake] kaniko flake (5xx) fetching image from production.cloudflare.docker.com #6438

Closed
devjgm opened this issue May 3, 2021 · 35 comments · Fixed by #13165
Closed

[Flake] kaniko flake (5xx) fetching image from production.cloudflare.docker.com #6438

devjgm opened this issue May 3, 2021 · 35 comments · Fixed by #13165
Labels
cpp: flake A test with false positives (failures that are not interesting) type: cleanup An internal cleanup or hygiene concern.

Comments

@devjgm
Copy link
Contributor

devjgm commented May 3, 2021

Log: https://pantheon.corp.google.com/cloud-build/builds;region=global/a7edfd7e-c43e-43c2-a3d6-3ddba4565084;step=0?project=cloud-cpp-testing-resources

Digest: sha256:a12a027e1d0afbeb6cc31bb07e89d94dc47fa768265416350d442d878bdb6064
Status: Downloaded newer image for gcr.io/kaniko-project/executor:edge
gcr.io/kaniko-project/executor:edge
INFO[0000] Resolved base name centos:7 to devtools      
INFO[0000] Using dockerignore file: /workspace/ci/.dockerignore 
INFO[0000] Retrieving image manifest centos:7           
INFO[0000] Retrieving image manifest centos:7           
error building image: GET https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/86/8652b9f0cb4c0599575e5a003f5906876e10c1ceb2ab9fe1786712dac14a50cf/data?verify=REDACTED: unsupported status code 503; body: <!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Temporarily unavailable | production.cloudflare.docker.com | Cloudflare</title></title>
...

It looks like our kaniko build step that was creating our image got a 503 when fetching one of the layers from docker.com. I'm not sure if there's anything we can do to fix this. I think not.

But I'm filing this issue anyway so we can track if it's a common issue.

@devjgm devjgm added the type: cleanup An internal cleanup or hygiene concern. label May 3, 2021
@coryan
Copy link
Contributor

coryan commented May 3, 2021

@coryan
Copy link
Contributor

coryan commented May 4, 2021

I tried adding --registry-mirror=mirror.gcr.io and did not work. From what I could gather from the error messages, mirror.gcr.io does not host a number of images we need (Fedora:33, or Ubuntu:bionic).

There is a way to create our own mirror and host it, but that seems very involved.

@coryan coryan changed the title [Flake] kaniko flake (503) fetching image from production.cloudflare.docker.com [Flake] kaniko flake (5xx) fetching image from production.cloudflare.docker.com May 8, 2021
@coryan
Copy link
Contributor

coryan commented May 8, 2021

This one is similar enough that I think we should consolidate them:

https://pantheon.corp.google.com/cloud-build/builds;region=global/9ef522b9-59b8-4ca9-9a71-0d5789a497ee;step=0?project=cloud-cpp-testing-resources

error building image: GET https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/ae/aefd7f02ae24739b95f77c488de70465c54653f394097b9859acede976c80e03/data?verify=REDACTED: unsupported status code 502; body: <html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>cloudflare</center>
</body>
</html>

@coryan
Copy link
Contributor

coryan commented May 10, 2021

I figured out why --registry-mirror=mirror.gcr.io does not work: it just holds "latest" of each popular image, bummer.

@coryan
Copy link
Contributor

coryan commented May 10, 2021

A different solution may involve using the "warmer" program:

https://github.com/GoogleContainerTools/kaniko/tree/master/cmd/warmer

This can download the base image to /cache which could be a shared volume between the warmer and the kaniko steps.

There are a couple of additional twists:

  • Saying something like fedora:33 or ubuntu:bionic requires a roundtrip to registry.github.io because those versions may (and do) change.
  • These roundtrips to registry.github.io are what fails with 5xx errors
  • We can avoid them (to some degree) if we pin a SHA, and say FROM fedora:33@sha256:ab9c680acef5a053cf2a6bddcebfa9674576d5104927180ef27a35d2dbab15fc
  • Note that using the SHA saves a roundtrip to the registry even if we do not use the warmer, in other words, one less chance to get snake eyes when rolling the network dice.
  • Both the warmer and the kaniko steps would need to download the same SHA, the warmer takes a docker image name, not a dockerfile as input, that suggests a script to extract the version from the Dockerfile
  • We can have the renovate bot update the SHA, that in itself seems interesting, because maybe we do not want updates to the OS unless we run the tests first.
  • Note that the warmer step would still need to download the base image, so we are saving roundtrips to the registry, but still have a download.
  • Maybe we can have that /cache directory really cached as a GCS tarball (sure would be nice if kaniko did that instead).

@coryan
Copy link
Contributor

coryan commented Sep 23, 2021

No repeats in 90d, closing. I suspect we will need to reopen though.

@coryan coryan closed this as completed Sep 23, 2021
@coryan coryan added the cpp: flake A test with false positives (failures that are not interesting) label Jan 27, 2022
@devbww
Copy link
Contributor

devbww commented Feb 19, 2022

https://console.cloud.google.com/cloud-build/builds;region=us-east1/99910c10-0808-429c-9f10-4262c19416ba?project=cloud-cpp-testing-resources

Step #0: error building image: GET https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/dc/dcf4d4bef137f695d11ed187ba6a135362dca3de36955c4da0905d596ce521bc/data?verify=REDACTED: unexpected status code 502 Bad Gateway: <html>
Step #0: <head><title>502 Bad Gateway</title></head>
Step #0: <body>
Step #0: <center><h1>502 Bad Gateway</h1></center>
Step #0: <hr><center>cloudflare</center>
Step #0: </body>
Step #0: </html>

@devbww devbww reopened this Feb 19, 2022
@devjgm
Copy link
Contributor Author

devjgm commented Mar 16, 2022

https://pantheon.corp.google.com/cloud-build/builds;region=us-east1/472c831b-dd84-4d4d-98a5-7274a880d8b3;step=0?project=cloud-cpp-testing-resources

error building image: error building stage: failed to get filesystem from image: Get "https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/7c/7c3b88808835aa80f1ef7f03083c5ae781d0f44e644537cd72de4ce6c5e62e00/data?verify=1647437859-y6QANhynFYEw1DOP0fkes9J%2F4eY%3D": read tcp 192.168.10.3:54192->104.18.124.25:443: read: connection reset by peer

@coryan
Copy link
Contributor

coryan commented Mar 16, 2022

devjgm added a commit to devjgm/google-cloud-cpp that referenced this issue Mar 16, 2022
Related to googleapis#6438

This might (or might not) fix the rare kaniko flakes we see. But I
imagine it almost certainly cannot hurt.
@devjgm
Copy link
Contributor Author

devjgm commented Mar 16, 2022

Nice find. Can't hurt. Let's give it a shot: #8558

devjgm added a commit that referenced this issue Mar 17, 2022
Related to #6438

This might (or might not) fix the rare kaniko flakes we see. But I
imagine it almost certainly cannot hurt.
devjgm added a commit to devjgm/google-cloud-cpp that referenced this issue Mar 17, 2022
Related to googleapis#6438

This might (or might not) fix the rare kaniko flakes we see. But I
imagine it almost certainly cannot hurt.
@dbolduc
Copy link
Member

dbolduc commented May 7, 2022

@coryan
Copy link
Contributor

coryan commented Aug 6, 2022

90d without a repeat, closing.

@coryan coryan closed this as completed Aug 6, 2022
@dbolduc dbolduc reopened this Jun 12, 2023
@coryan
Copy link
Contributor

coryan commented Sep 6, 2023

Slightly different error message, but I think the same root cause. I am changing the title to be more generic.

https://pantheon.corp.google.com/cloud-build/builds;region=us-east1/542ff150-ac6a-4420-8692-b084b5c5e189?project=cloud-cpp-testing-resources&mods=logs_tg_prod

Step #0: error building image: error building stage: failed to execute command: extracting fs from image: read tcp 192.168.10.2:52280->142.251.162.207:443: read: connection reset by peer

@devbww
Copy link
Contributor

devbww commented Oct 20, 2023

https://console.cloud.google.com/cloud-build/builds;region=us-east1/226f040c-17ef-4b65-b68d-d410e818372a?project=936212892354

error building image: error building stage: failed to execute command: extracting fs from image: read tcp 192.168.10.2:57314->74.125.26.207:443: read: connection reset by peer

@coryan
Copy link
Contributor

coryan commented Oct 20, 2023

FWIW: this seems to be:

GoogleContainerTools/kaniko#1717

This may also be of use, but requires a lot more configuration:

https://cloud.google.com/artifact-registry/docs/repositories/remote-repo

@dbolduc
Copy link
Member

dbolduc commented Nov 2, 2023

@alevenberg alevenberg self-assigned this Nov 9, 2023
@alevenberg
Copy link
Member

Attempting to fix upstream (GoogleContainerTools/kaniko#2837)

@dbolduc
Copy link
Member

dbolduc commented Dec 12, 2023

@dbolduc dbolduc reopened this Dec 12, 2023
@alevenberg alevenberg removed their assignment Dec 13, 2023
@alevenberg
Copy link
Member

I tried :(

@alevenberg
Copy link
Member

alevenberg commented Feb 13, 2024

Step #0: error building image: error building stage: failed to execute command: extracting fs from image: read tcp 192.168.10.2:45946->74.125.196.207:443: read: connection reset by peer

Build FAILURE: libcxx-ci https://console.cloud.google.com/cloud-build/builds;region=us-east1/aea5600b-a3a6-4159-a043-3ae3d52d8dac?project=936212892354

https://ff61d7d2b82917f3c17eaeff0c86b71d3239f0c72a9c7b4acb459bf-apidata.googleusercontent.com/download/storage/v1/b/cloud-cpp-community-publiclogs/o/logs%2Fgoogle-cloud-cpp%2Fmain%2Fa988f3e39ca5134b6966578c3db5da07e1147156%2Ffedora-latest-bazel-libcxx-__default__%2Flog-aea5600b-a3a6-4159-a043-3ae3d52d8dac.txt?jk=ATxpoHeeaWtiteuzk_Hv4q9-IM7XMf4UI-FwcW3nzSl-qCF_T09zhJSr_WlG3yIzztYXmDjGnzEOuwJO3645otW7Tk0IFqhjohyC6X70Ww9uuVpgOE_C9AiMvxoM-12lPeUJmAWVgzFYLUJ_sbEGjmRnksixk2RozK04efDNH5GQiHCvwkfpXwj61Ex_YIuqmFt5ji7gB0HbVfDMzcBD8DZZpWzhoAPZzytpxVDoJj1UcwOvTUaApiNrREyaAzHnH5R5ilGLZn6x6Be4kW_LqLMQWxDOMcWVAFVntzW3E9WRjA3XuxNB8B7lUDvBC1POWwCH52Lfs_DhnrKqTL-O9GBQvKaicZ_ubfd7U2SSK3UOQIf4pxmBmSVbTiSUqCQebhTAwqBZio5MM6wjiytRtr9GkPu3Ld52JZvOPHcnwPkA7BFzw351pZfgh9SrQo2qAv2z58X8OdZua32RPvTBb3jXN3MZJ4ZL55eACl6ctTbNn3aXV0DiB6oiMKc7fvBGdX3RQhH6Qex0vFiCHEY9fq7xan_q1TVcRGuxRnYK322NsL33vqY2OIVxmpZBOpKvnMiZRMpaFp3zGlry-S-waYcXTaxIk_cfj8Hv857mA9aeG2GjMigIZxsXlBs6wGiuIxoY8Tg0r8cjN0QJ7q4Ub0K2B3m-vZM7dY3-6lAbk8lC9cpiAkKaiE9neOf8wOYVSPU1-gcXBqz6W7G2pa8nEuG5HTtSMOsWDhLGo3Prv0U3o7VMtkVyomZtf4ubtRjU3WPGwMEmDk9TdQJGkPKzs7UFqwzJwNeW5dbnU8EzgTN9s9S0TuxxqJaB2xg6_Q-73X8qZdXL7QwgiqI6OK7fDEgC2vD92tj47K_G-Ar5hsbFC3FSPHOXc7PQI1z92WjsuIWlzgaObFlAAslcLtk6q7Kpg36RHYJPEtRFFQcEQuC3PermC-EUQtsL5wOclZXTkh9VHc05qhcnI753DF5FOAbFOwytdoz2K_wYHyNc4TWUI9YZzuAhsDUseayDvlZEN_fm4n88kgFz-w2aGzPvH61SKblV_qttdslHvIHi881RKdFV7nkNghHjQaF97CPcipl6DhMqm6sW7Yl8bhguSIuktyAQ2xZosH-QpKo5HhGS8spa-k6FNvw7GOvG7EjEDYJvAKJL0PN4HQtKzKmTxjMYyq4OKcR1LgqpSpeREY_ne7vHBnhXIn1OBXOXoZ3cPuk9pR5M5ZAFvIfvzmmGDeTsnPx8V6nam8I1oBxyeeblL7PeTZzXLyFVx3STt0l3XCFvvU32eEkWahxrygI_FdAmg4URqy06igc88ofMTqjZN6MPDoF3IYNNzYOSEGdUvF2kGCV60r7Hm-zQ4V6jg3hGy-7gLUeNKsmzraqqAaIQ2iBoN9Rv7av6&isca=1

@scotthart
Copy link
Member

Haven't seen in 90d; closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp: flake A test with false positives (failures that are not interesting) type: cleanup An internal cleanup or hygiene concern.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants