-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry client requests to Director if we detect the Director is rebooting #1890
Merged
jhiemstrawisc
merged 21 commits into
PelicanPlatform:main
from
jhiemstrawisc:issue-1389
Jan 27, 2025
Merged
Retry client requests to Director if we detect the Director is rebooting #1890
jhiemstrawisc
merged 21 commits into
PelicanPlatform:main
from
jhiemstrawisc:issue-1389
Jan 27, 2025
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When the director restarts, - Detect whether a response is coming from Pelican or a SSL terminator application. In the latter case, retry the request a few times to allow the director to restart. - If the director has recently restarted, instead of sending a 404, return a 429 (too many requests) indicating the client should retry again soon.
…s` param This sets a hidden parameter called `Client.IsPlugin` whenever the binary is named "pelican_xfer_plugin" or "pelican_plugin", or any time the `stashPluginMain` function is executed. Some actions in our client may soon behave differently if they detect they're running in plugin mode because we assume plugin failures are generally more expensive.
Now the client will try to detect two situations that may call for retries: 1) The "director's" response is missing the `Server: pelican/` header AND there's a status code indicating some error. This means we're likely hitting a proxy and should retry until the Server header reappears 2) The Server header is correct, but the Director responds with an HTTP 429 (too many requests) indicating it just rebooted and can't issue the requested redirect, perhaps because it's waiting for a service to re-advertise In both cases, the retry logic uses a configurable number of retries (where plugin detection doubles the number of retries), and the retry frequency is backed off with each unsuccessful attempt.
jhiemstrawisc
added
enhancement
New feature or request
client
Issue affecting the OSDF client
director
Issue relating to the director component
labels
Jan 13, 2025
Prior to the introduction of the pelican-server binary, pelican was always statically-linked. However, pelican-sever needs to be dynamic so it can access the lotman shared library. This caused an issue in our production cache containers because the pelican-server binary was built in an alpine container (inherited from the old goreleaser container) which uses libmusl for linking, but then copied into an Alma container in the final stage, where libmusl is unavailable. This this fix, the base container used to build pelican binaries is switched from the goreleaser container to a raw alma9 container. A side effect of this is that we now have to install a few things we previously didn't. Another thing I can't readily explain is that making this change appears to have altered the path these binaries are built under from `/pelican/dist/pelican_linux_amd64_v1/pelican` to `/pelican/dist/linux_amd64/pelican_linux_amd64_v1/pelican` This doesn't appear to be an issue and was accounted for by changing a few paths in the Dockerfile. Famous last words?
jhiemstrawisc
force-pushed
the
issue-1389
branch
from
January 15, 2025 21:18
e4f1573
to
0a39bd4
Compare
Finally, tests all passing. This one should be ready for a review! |
turetske
reviewed
Jan 22, 2025
turetske
reviewed
Jan 22, 2025
turetske
approved these changes
Jan 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly approved with some comment request changes.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
client
Issue affecting the OSDF client
director
Issue relating to the director component
enhancement
New feature or request
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR tries to improve our clients' behavior in the event that the Director is being rebooted, or was recently rebooted and has not yet repopulated its cache of server ads.
The primary issue we're trying to work around is that our production OSDF Director lives behind a Traefik ingress proxy. When the Director is rebooting, the ingress proxy still responds to Director requests, but almost always with a 404, 500, or 502. If the client gets an error response and but isn't actually talking to the Director, it should follow a backoff retry strategy to give the Director time to finish rebooting.
On the other hand, the Director may have recently rebooted, but hasn't had time to receive all the server advertisements it needs to issue the correct redirect. In this case, the Director detects that it recently rebooted and sends an HTTP 429 (too many requests) to the client. The client detects this and starts retrying while it waits for the Director to receive the needed server ads.
The core change here is that the Director now sends a
Server
header to clients, which lets clients determine who they're talking to and what they should do in the event they encounter certain errors.This is mostly built on top of Brian's draft PR #1565, and this PR should replace that one.