Retry client requests to Director if we detect the Director is rebooting #1890

jhiemstrawisc · 2025-01-13T21:15:25Z

This PR tries to improve our clients' behavior in the event that the Director is being rebooted, or was recently rebooted and has not yet repopulated its cache of server ads.

The primary issue we're trying to work around is that our production OSDF Director lives behind a Traefik ingress proxy. When the Director is rebooting, the ingress proxy still responds to Director requests, but almost always with a 404, 500, or 502. If the client gets an error response and but isn't actually talking to the Director, it should follow a backoff retry strategy to give the Director time to finish rebooting.

On the other hand, the Director may have recently rebooted, but hasn't had time to receive all the server advertisements it needs to issue the correct redirect. In this case, the Director detects that it recently rebooted and sends an HTTP 429 (too many requests) to the client. The client detects this and starts retrying while it waits for the Director to receive the needed server ads.

The core change here is that the Director now sends a Server header to clients, which lets clients determine who they're talking to and what they should do in the event they encounter certain errors.

This is mostly built on top of Brian's draft PR #1565, and this PR should replace that one.

When the director restarts, - Detect whether a response is coming from Pelican or a SSL terminator application. In the latter case, retry the request a few times to allow the director to restart. - If the director has recently restarted, instead of sending a 404, return a 429 (too many requests) indicating the client should retry again soon.

…s` param This sets a hidden parameter called `Client.IsPlugin` whenever the binary is named "pelican_xfer_plugin" or "pelican_plugin", or any time the `stashPluginMain` function is executed. Some actions in our client may soon behave differently if they detect they're running in plugin mode because we assume plugin failures are generally more expensive.

Now the client will try to detect two situations that may call for retries: 1) The "director's" response is missing the `Server: pelican/` header AND there's a status code indicating some error. This means we're likely hitting a proxy and should retry until the Server header reappears 2) The Server header is correct, but the Director responds with an HTTP 429 (too many requests) indicating it just rebooted and can't issue the requested redirect, perhaps because it's waiting for a service to re-advertise In both cases, the retry logic uses a configurable number of retries (where plugin detection doubles the number of retries), and the retry frequency is backed off with each unsuccessful attempt.

…ires

Prior to the introduction of the pelican-server binary, pelican was always statically-linked. However, pelican-sever needs to be dynamic so it can access the lotman shared library. This caused an issue in our production cache containers because the pelican-server binary was built in an alpine container (inherited from the old goreleaser container) which uses libmusl for linking, but then copied into an Alma container in the final stage, where libmusl is unavailable. This this fix, the base container used to build pelican binaries is switched from the goreleaser container to a raw alma9 container. A side effect of this is that we now have to install a few things we previously didn't. Another thing I can't readily explain is that making this change appears to have altered the path these binaries are built under from `/pelican/dist/pelican_linux_amd64_v1/pelican` to `/pelican/dist/linux_amd64/pelican_linux_amd64_v1/pelican` This doesn't appear to be an issue and was accounted for by changing a few paths in the Dockerfile. Famous last words?

jhiemstrawisc · 2025-01-15T22:36:37Z

Finally, tests all passing. This one should be ready for a review!

director/director.go

turetske

Mostly approved with some comment request changes.

bbockelm and others added 3 commits January 10, 2025 21:48

jhiemstrawisc added this to the v7.13.0 milestone Jan 13, 2025

jhiemstrawisc added enhancement New feature or request client Issue affecting the OSDF client director Issue relating to the director component labels Jan 13, 2025

jhiemstrawisc linked an issue Jan 13, 2025 that may be closed by this pull request

Improve client handling around director outages #1389

Closed

jhiemstrawisc requested a review from turetske January 13, 2025 23:14

jhiemstrawisc assigned turetske Jan 13, 2025

jhiemstrawisc and others added 13 commits January 15, 2025 21:14

Adjust docs to reflect how it's now used

755e4c4

Add new client/director_test::queryDirector test cases to reflect ret…

f4a9ea7

…ires

Correct resetting log output in new queryDirector test

2552f3c

Add override to Director restart time for testing

6c6c0e7

Fix typos

7771627

Server hot restart API with unit tests

00a7bcf

hotRestartServer -> restart

3d7ffa9

Update Swagger Document

6884c04

Add restart server button to configuration page

fe78a87

Prettier

f36ec7f

Use http.Redirect instead of setting 307 explicitly

7460931

Set Client.DirectorRetries to positive value for tests

0a39bd4

jhiemstrawisc force-pushed the issue-1389 branch from e4f1573 to 0a39bd4 Compare January 15, 2025 21:18

jhiemstrawisc and others added 2 commits January 15, 2025 15:19

Merge branch 'main' into issue-1389

1b7a260

Populate Client.DirectorRetries in other tests and set fallback

c7871be

jhiemstrawisc mentioned this pull request Jan 20, 2025

Improve memory usage of the "stat cache" #1878

Merged

turetske reviewed Jan 22, 2025

View reviewed changes

director/director.go Show resolved Hide resolved

turetske reviewed Jan 22, 2025

View reviewed changes

director/director.go Show resolved Hide resolved

turetske approved these changes Jan 22, 2025

View reviewed changes

jhiemstrawisc and others added 2 commits January 24, 2025 18:00

Add comment about why director sends 429

f53f67e

Merge branch 'main' into issue-1389

2699aff

bbockelm mentioned this pull request Jan 25, 2025

Improve director restart behavior #1565

Closed

Post-rebase fixes to remove redundant variable

4045c34

jhiemstrawisc merged commit bc0b7bf into PelicanPlatform:main Jan 27, 2025
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry client requests to Director if we detect the Director is rebooting #1890

Retry client requests to Director if we detect the Director is rebooting #1890

jhiemstrawisc commented Jan 13, 2025

jhiemstrawisc commented Jan 15, 2025

turetske left a comment

Retry client requests to Director if we detect the Director is rebooting #1890

Retry client requests to Director if we detect the Director is rebooting #1890

Conversation

jhiemstrawisc commented Jan 13, 2025

jhiemstrawisc commented Jan 15, 2025

turetske left a comment

Choose a reason for hiding this comment