Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry client requests to Director if we detect the Director is rebooting #1890

Merged
merged 21 commits into from
Jan 27, 2025

Conversation

jhiemstrawisc
Copy link
Member

This PR tries to improve our clients' behavior in the event that the Director is being rebooted, or was recently rebooted and has not yet repopulated its cache of server ads.

The primary issue we're trying to work around is that our production OSDF Director lives behind a Traefik ingress proxy. When the Director is rebooting, the ingress proxy still responds to Director requests, but almost always with a 404, 500, or 502. If the client gets an error response and but isn't actually talking to the Director, it should follow a backoff retry strategy to give the Director time to finish rebooting.

On the other hand, the Director may have recently rebooted, but hasn't had time to receive all the server advertisements it needs to issue the correct redirect. In this case, the Director detects that it recently rebooted and sends an HTTP 429 (too many requests) to the client. The client detects this and starts retrying while it waits for the Director to receive the needed server ads.

The core change here is that the Director now sends a Server header to clients, which lets clients determine who they're talking to and what they should do in the event they encounter certain errors.

This is mostly built on top of Brian's draft PR #1565, and this PR should replace that one.

bbockelm and others added 3 commits January 10, 2025 21:48
When the director restarts,
- Detect whether a response is coming from Pelican or a SSL terminator application.
  In the latter case, retry the request a few times to allow the director to restart.
- If the director has recently restarted, instead of sending a 404, return a 429 (too
  many requests) indicating the client should retry again soon.
…s` param

This sets a hidden parameter called `Client.IsPlugin` whenever the binary is
named "pelican_xfer_plugin" or "pelican_plugin", or any time the `stashPluginMain`
function is executed. Some actions in our client may soon behave differently if
they detect they're running in plugin mode because we assume plugin failures are
generally more expensive.
Now the client will try to detect two situations that may call for retries:
1) The "director's" response is missing the `Server: pelican/` header AND
there's a status code indicating some error. This means we're likely hitting
a proxy and should retry until the Server header reappears
2) The Server header is correct, but the Director responds with an HTTP 429
(too many requests) indicating it just rebooted and can't issue the requested
redirect, perhaps because it's waiting for a service to re-advertise

In both cases, the retry logic uses a configurable number of retries (where plugin
detection doubles the number of retries), and the retry frequency is backed off
with each unsuccessful attempt.
@jhiemstrawisc jhiemstrawisc added this to the v7.13.0 milestone Jan 13, 2025
@jhiemstrawisc jhiemstrawisc added enhancement New feature or request client Issue affecting the OSDF client director Issue relating to the director component labels Jan 13, 2025
@jhiemstrawisc jhiemstrawisc linked an issue Jan 13, 2025 that may be closed by this pull request
jhiemstrawisc and others added 13 commits January 15, 2025 21:14
Prior to the introduction of the pelican-server binary, pelican was
always statically-linked. However, pelican-sever needs to be dynamic so
it can access the lotman shared library.

This caused an issue in our production cache containers because the
pelican-server binary was built in an alpine container (inherited from
the old goreleaser container) which uses libmusl for linking, but then
copied into an Alma container in the final stage, where libmusl is
unavailable.

This this fix, the base container used to build pelican binaries is switched
from the goreleaser container to a raw alma9 container. A side effect of this
is that we now have to install a few things we previously didn't. Another
thing I can't readily explain is that making this change appears to have
altered the path these binaries are built under from
`/pelican/dist/pelican_linux_amd64_v1/pelican`
to
`/pelican/dist/linux_amd64/pelican_linux_amd64_v1/pelican`

This doesn't appear to be an issue and was accounted for by changing a few
paths in the Dockerfile. Famous last words?
@jhiemstrawisc
Copy link
Member Author

Finally, tests all passing. This one should be ready for a review!

Copy link
Collaborator

@turetske turetske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly approved with some comment request changes.

@jhiemstrawisc jhiemstrawisc merged commit bc0b7bf into PelicanPlatform:main Jan 27, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client Issue affecting the OSDF client director Issue relating to the director component enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve client handling around director outages
5 participants