Skip to content

[registry-facade] Re-deploying Gitpod breaks current image pulls #2512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
csweichel opened this issue Dec 11, 2020 · 5 comments
Closed

[registry-facade] Re-deploying Gitpod breaks current image pulls #2512

csweichel opened this issue Dec 11, 2020 · 5 comments
Assignees
Labels
Milestone

Comments

@csweichel
Copy link
Contributor

Problem

registry-facade runs as DaemonSet on each node and the container runtime (e.g. containerd) pull through this service. The service is available through a hostPort on the node directly, exactly because it needs to be accessed from outside of Kubernetes.

When restarting the registry-facade service, e.g. while deploying, there's a service dis-/interruption. Currently running image pulls break because the service goes down. New image pulls fail because there's no service available in this moment.

Possible Solutions

Graceful socket handover: when a new registry-facade starts, it checks if there's already a service running. If so, it requests a handover from the old instance. The new facade would take over the listening socket, and place the old one in a "draining mode". We'd need to allow for a generous termination grace period.

@csweichel
Copy link
Contributor Author

Currently Kubernetes does support surge rollouts of DaemonSets (KEP is through, API is in the works), which limits the handover functionality we can implement, as we probably need both the new and old process running at the same time to implement graceful handover.

@csweichel
Copy link
Contributor Author

We can achieve regisrty-facade deployment with a zero-downtime handover by deploying a "handover daemonset" first. The process would be as follows:

  1. the handover daemonset requests the HTTP server socket from the old registry facade (v0)
  2. v0 continues to serve requests in progress (i.e. finish them), but can't accept new ones. It doesn't have the server socket anymore. v0 shuts down once all requests have been served. We need to allow for a generous enough termination grace period.
  3. The handover process accepts incoming requests on the server socket, but does not answer them yet.
  4. Once the new registry-facade is deployed (v1) it asks the handover process for the server socket on a Unix pipe. The handover process passes the server socket along using SCM_RIGHTS.
  5. The new registry facade v1 serves new requests from now on.
  6. The handover process proxies previously accepted requests to v1, thereby answering them properly now. Once all pending request have been proxied and answered, the handover process can terminate. We have to allow for a generous enough termination grace period to facilitate that.

The following sequence diagram illustrates the process. The yellow request is a proxied one.
image

There are a few benefits of this concept:

  • the handover process is entirely optional. If no handover happens, the registry-facade would just terminate like it does now.
  • once Kubernetes supports surge rollouts of daemonSets we can easily migrate the handover process.
  • all code can remain in registry-facade and now additional service is required.

There are also downsides/questions to this idea:

  • it introduces an additional (albeit optional) step during the deployment/upgrade.
  • it's unclear if the server socket handover works across network namespaces, or if registry-facade would need to run with hostNetwork: true

@csweichel
Copy link
Contributor Author

Working example for passing sockets/FDs through unix sockets: https://github.com/ftrvxmtrx/fd

@geropl
Copy link
Member

geropl commented Dec 14, 2020

Hm. It feels like the above is like shooting birds with cannons. I guess you already checked, just to make sure: Is it an option to create the socket using SO_REUSEPORT? The procedure would stay the same, but we would not need to "move" the port.

@csweichel
Copy link
Contributor Author

PR is merged, follow up in #3049

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants