Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

w/ GitLab Registry: "invalid status code from registry 404" #17999

Open
vrothberg opened this issue Mar 31, 2023 Discussed in #16842 · 19 comments
Open

w/ GitLab Registry: "invalid status code from registry 404" #17999

vrothberg opened this issue Mar 31, 2023 Discussed in #16842 · 19 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@vrothberg
Copy link
Member

Discussed in #16842

Originally posted by 2xB October 6, 2022

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

When pushing to a GitLab Container Registry, I sometimes randomly get Error: trying to reuse blob sha256:... at destination: Requesting bearer token: invalid status code from registry 404 (Not Found). Weirdly enough, when I upload multiple images from the same base image, I get this error message once for every image I try to push, the second push for each image works again.

Steps to reproduce the issue:

  1. Have a GitLab project with a GitLab Runner that can build and push Podman images (e.g. via Shell executor or custom executor)

  2. Use it very often

  3. Sometimes (rarely) get this error during podman push ...

Describe the results you received:

Error: trying to reuse blob sha256:... at destination: Requesting bearer token: invalid status code from registry 404 (Not Found)

Describe the results you expected:

Successfully pushing the image to the GitLab Container Registry

Additional information you deem important (e.g. issue happens only occasionally):

I have a full log of podman --log-level=debug push ... the time it fails. I probably can't post the full log, but if there's something to check in that log, please tell!

Output of podman version:

Client:       Podman Engine
Version:      4.2.1
API Version:  4.2.1
Go Version:   go1.18.5
Built:        Wed Sep  7 19:58:19 2022
OS/Arch:      linux/amd64

Output of podman info:

host:
  arch: amd64
  buildahVersion: 1.27.0
  cgroupControllers:
  - cpuset
  - cpu
  - cpuacct
  - blkio
  - memory
  - devices
  - freezer
  - net_cls
  - perf_event
  - net_prio
  - hugetlb
  - pids
  - rdma
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: conmon-2.1.4-2.fc36.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.4, commit: '
  cpuUtilization:
    idlePercent: 94.36
    systemPercent: 0.85
    userPercent: 4.78
  cpus: 6
  distribution:
    distribution: fedora
    variant: container
    version: "36"
  eventLogger: file
  hostname: 04e8b959c867
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.4.0-126-generic
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 2666459136
  memTotal: 4914487296
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.6-2.fc36.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.6
      commit: 18cf2efbb8feb2b2f20e316520e0fd0b6c41ef4d
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-0.2.beta.0.fc36.x86_64
    version: |-
      slirp4netns version 1.2.0-beta.0
      commit: 477db14a24ff1a3de3a705e51ca2c4c1fe3dda64
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 4243320832
  swapTotal: 4294963200
  uptime: 26h 8m 44.00s (Approximately 1.08 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.imagestore: /var/lib/shared
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.9-1.fc36.x86_64
      Version: |-
        fusermount3 version: 3.10.5
        fuse-overlayfs: version 1.9
        FUSE library version 3.10.5
        using FUSE kernel interface version 7.31
    overlay.mountopt: nodev,fsync=0
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 40501673984
  graphRootUsed: 16835969024
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /tmp/custom-executor1517200462
  imageStore:
    number: 57
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.2.1
  Built: 1662580699
  BuiltTime: Wed Sep  7 19:58:19 2022
  GitCommit: ""
  GoVersion: go1.18.5
  Os: linux
  OsArch: linux/amd64
  Version: 4.2.1

Package info (e.g. output of rpm -q podman or apt list podman or brew info podman):

podman/stable from quay.io, run inside another Podman inside a VirtualBox

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

podman/stable from quay.io, run inside another Podman inside a VirtualBox
```</div>
@vrothberg vrothberg added the kind/bug Categorizes issue or PR as related to a bug. label Mar 31, 2023
@vrothberg
Copy link
Member Author

Relevant quote from @mtrmac in the discussion:

404 when trying to authenticate? Ugh. The ActiveRecord log above is an explanation for how that 404 happens, but it’s still a server-side bug if it reports “this does not exist” and not “server unavailable”, or whatever the cause is…

Right now this error case does not return a Go-detectable error. That can be fixed, of course.

Automatically retrying on a 404… really seems like a bad papering over of a clear problem; if we were building a list of things to never retry on, a 404 would definitely be on it.

I guess accepting the idea of “cosmic-rays-induced failures” and retrying on pretty much everything (except where the case is known to be local and unfixable, like out of disk space) in c/common/pkg/retry might make sense — as long as we only retry once?

@mtrmac
Copy link
Collaborator

mtrmac commented Apr 1, 2023

I guess accepting the idea of “cosmic-rays-induced failures” and retrying on pretty much everything

I think it would also be useful to be quite loud about it (warning-level logs at least).

@andrewleech
Copy link

I'm planning on raising a matching issue of Gitlab's issue tracker when I'm back in the office next week to get their input.

It's worth noting though that site-wide on gitlab, any time there's an auth / permission restriction (eg trying to view a private project you don't have access to) the server responds with 404. I think this is a valid response choice from a security perspective, in that the server can't "leak" the existence of a resource you don't have permission to view.

That might be why we get a 404 here, if it's some kind of rate limiting on the auth endpoint?

@andrewleech
Copy link

I've raised https://gitlab.com/gitlab-org/gitlab/-/issues/404326 to request some feedback from the gitlab side.

@mtrmac
Copy link
Collaborator

mtrmac commented Apr 3, 2023

It's worth noting though that site-wide on gitlab, any time there's an auth / permission restriction (eg trying to view a private project you don't have access to) the server responds with 404.

That concept of pretending something doesn’t exist is fine; but the server still needs to follow other constraints of the protocol.

E.g. that other ticket shows that the server returns a 404 on GET /jwt/auth?…. I read that as claiming not that the parameters refer to an inconsistent repo, but that /jwt/auth itself doesn’t exist, and that doesn’t really make sense.

What is the client to do with that information? We could, reasonably, and consistently with that concept, set up this operation to treat 404 (“does not exist”) the same as 403 (unauthorized) — but 403 is explicitly one of the results where we don’t retry because it wouldn’t help and it could hurt (by locking the user out due to repeated authentication failures).

If the server is encountering some kind of outage / downtime / overload, I think it should indicate an outage / downtime / overload; that reveals nothing about the authentication rules to the client, but it allows the client to make a desirable retry decision.

@2xB
Copy link

2xB commented Apr 5, 2023

I don't think I have seen this error again after reporting it, although it did occur multiple times before. The only thing I can imagine is that since then, I've regularly updated both GitLab and Podman to the latest version. Potentially it makes sense to track with which Podman and GitLab versions people see this issue.

@mtrmac Regarding I think it would also be useful to be quite loud about it (warning-level logs at least).: For the original bug report I have such a log. As mentioned in the initial bug report, I probably can't post the full log, but if there's something to check in that log, please tell!

@mtrmac
Copy link
Collaborator

mtrmac commented Apr 5, 2023

Historically, there was https://gitlab.com/gitlab-org/gitlab/-/issues/215715 , where a repo is expected to be auto-created on a first upload, but if it is accessed while it is being auto-created (in this case, with parallel uploads of multiple layers), some of the attempts may fail. I’m not at all sure this is related in any way, but it is a bit suggestive in that maybe failures encountered while setting up a process might not show up afterwards.


The “quite loud warning-level logs” idea was suggested as something to include in the future auto-retry behavior of Podman, so that there is a trace of a thing failing. It wouldn’t directly help with this situation, especially if the failures no longer occur.

In this case, I think the activerecord::RecordNotFound logs in the GitLab ticket are most important data to allow a code fix.

@2xB
Copy link

2xB commented Apr 5, 2023

@mtrmac This is an amazing find. I never considered it but yes, when I reported this issue, that was the first push to that image repository and after deleting an image repository and pushing to it again, I again see this issue. A reason why I never considered it is that this is not happening at the beginning of the push transaction, but rather at the end. In any case I can therefore confirm that also for me this happens when repositories are auto-created.

@bmaupin
Copy link

bmaupin commented Apr 25, 2023

This happens to me almost every time when pushing to a new image registry. The first push fails, but GitLab creates the registry successfully, and the second push works fine.

I can't reproduce it with docker at all, only buildah. Maybe docker has some kind of built-in retry logic?

@bmaupin
Copy link

bmaupin commented Apr 25, 2023

My workaround is to just try the push twice. So in my .gitlab-ci-yml, I replaced

buildah push --format=v2s2 "$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG"

with

buildah push --format=v2s2 "$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG" || sleep 10; buildah push --format=v2s2 "$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG"

@mhio
Copy link

mhio commented Apr 26, 2023

Maybe docker has some kind of built-in retry logic?

I haven't seen Docker attempt 6 simultaneous auth requests for the JWT bearer token prior to the image push.
If the root cause is a race condition in gitlabs active record, it's hard to trigger that when there is no race.

A buildah login first might work, instead of the push attempting to populate the bearer token .

@bmaupin
Copy link

bmaupin commented Apr 26, 2023

A buildah login first might work, instead of the push attempting to populate the bearer token .

I'm pushing to an authenticated registry so I'm already doing a buildah login. That doesn't seem to fix the issue.

I assumed the 404 was because the registry wasn't found (because it wasn't created yet), not because the auth actually failed.

@mtrmac
Copy link
Collaborator

mtrmac commented Apr 26, 2023

Requesting bearer token: invalid status code from registry 404 (Not Found)

Technically, the authentication/authorization step is failing. But that doesn’t rule out the possibility that a single authentication request would have succeeded, while a series of concurrent ones triggers a failure.

The code could, possibly, track authentication requests in flight, and wait for an existing one to succeed instead of starting a parallel one. I don’t immediately know whether it would avoid this problem; it’s anyway the polite thing to do, against servers that could be rate-limiting authentication attempts (especially if the user provided incorrect credentials). It might even be a slight performance improvement.

@mhio
Copy link

mhio commented Apr 27, 2023

I'm pushing to an authenticated registry so I'm already doing a buildah login. That doesn't seem to fix the issue.

I assumed the 404 was because the registry wasn't found (because it wasn't created yet), not because the auth actually failed.

Interesting, I had only seen a 404 in the batch of JWT auth requests to gitlab, not in the requests to the registry.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@andrewleech
Copy link

Fyi the ticket I raised at gitlab for this issue has been triaged as severity: major so hopefully we'll get some investigating from their end soon

There was also a potential workaround posted that involves building to a different tag name, then retagging to desired tag, though it doesn't make much sense to me as I wouldn't think it'd result in any change to registry api calls:
https://gitlab.com/gitlab-org/gitlab/-/issues/404326#note_1401274540

@mtrmac
Copy link
Collaborator

mtrmac commented May 29, 2023

Technically, the authentication/authorization step is failing. But that doesn’t rule out the possibility that a single authentication request would have succeeded, while a series of concurrent ones triggers a failure.

The code could, possibly, track authentication requests in flight, and wait for an existing one to succeed instead of starting a parallel one. I don’t immediately know whether it would avoid this problem; it’s anyway the polite thing to do, against servers that could be rate-limiting authentication attempts (especially if the user provided incorrect credentials). It might even be a slight performance improvement.

I have a draft implementation in containers/image#1968 . Could someone who can reliably reproduce this GitLab failure test a Podman build with that change?

@bcg62
Copy link

bcg62 commented Jul 7, 2023

@mtrmac I can reproduce this, if you or someone has a build available i'll give it a try.

@andrewleech
Copy link

If anyone has ideas / information regarding reproducing the issue (even just describing your CI setup that does reproduce it) gitlab are trying to investigate it here: https://gitlab.com/gitlab-org/gitlab/-/issues/404326#note_1587264776

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

8 participants