Retry connection if CSRF crumb retrieval results in HTTP 5xx error #571

jimklimov · 2023-06-30T22:46:36Z

As reported in https://issues.jenkins.io/browse/JENKINS-70501 sometimes when the Jenkins controller reboots (and responds with the butler page instead of any logical replies while it is initializing), and the Swarm Client is trying to reconnect, it fails and stalls.

Quoting from that ticket:

When the controller was last restarting, some swarm agents never re-appeared. Some hours after that I logged in to check on them, and found the last logged interaction was the HTML response about "Starting Jenkins"; a retry after that never happened:

# Tail of journal:
Jan 26 14:10:27 nutci-debian-11-amd64 swarm-client-nutci.sh[153]: INFO: Retrying in 10 seconds
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]: Jan 26, 2023 2:10:37 PM hudson.plugins.swarm.Client run
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]: INFO: Attempting to connect to https://ci.networkupstools.org/
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]: Jan 26, 2023 2:10:37 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]: SEVERE: Could not obtain CSRF crumb. Response code: 503
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]:
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]:
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]:     <!DOCTYPE html><html lang="en"><head resURL="/static/6728fa46" data-rooturl="" data-resurl="/static/6728fa46" data-imagesurl="/static/6728fa46/images"><title>Starting Jenkins</title><meta name="ROBOTS" content="NOFOLLOW"><meta name="viewport" content="width=device-width, initial-scale=1"><link rel="stylesheet" href="/static/6728fa46/jsbundles/simple-page.css" type="text/css"><link rel="stylesheet" href="/static/6728fa46/css/loading.css" type="text/css"></head><body><div class="simple-page" role="main"><div class="modal signup"><div class="signupIntroDefault"><div class="logo"><img src="/static/6728fa46/images/svgs/logo.svg" alt="Jenkins logo"></div><h1 class="loading">
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]:                             Please wait while Jenkins is getting ready to work
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]:                             <span>.</span><span>.</span><span>.</span></h1><p class="restarting">Your browser will reload automatically when Jenkins is ready.</div></div></div><script src="/static/6728fa46/scripts/loading.js" type="text/javascript"></script></body></html>
^C

:; date
Thu Jan 26 17:36:21 UTC 2023

This is a large inconvenience - to recover I have to log into the workers or have a way to reboot them, after I notice they are AWOL at all. In the meanwhile, the CI farm is under-powered - machines run but executors are not provided by them.

I hope this simple fix would cause the Swarm Client to continue retrying (if enabled by user) until the server begins responding and gives the actual CSRF logic response instead of the "Starting..." HTML page.

Testing done

At the moment of posting, this is a speculative fix, with a build running on a private CI farm (busy working so not restarting for fun in short term, possibly next week...) to see if the original error would be ever seen again if the server is recycled by auto-packaging updates or other evils.

I wonder if it is possible however that the server would pass through some other interim states during init, characterized by different error codes and messages if queried at that moment. Maybe a more thorough fix would "remember" the timestamp when a server error was seen - possibly with response content parsing to be sure that this was a server start-up, so for some time after that discovery it would cause any other errors to also go to retry.

So far I have no idea for automated testing for this situation and do not intend to spend time on that, but others are welcome to chime in with code :)

Submitter checklist

Give feedback

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue
Reported on manual testing
Options

…h HTTP-5xx error [JENKINS-70501]

jimklimov · 2023-07-01T19:08:01Z

Well... can't exactly confirm it worked thanks to this change across a server restart (with custom builds of swarm-client.jar trying to connect), but it did work in this form:

Jul 01, 2023 7:01:00 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
SEVERE: Could not obtain CSRF crumb. Response code: 503



    <!DOCTYPE html><html lang="en"><head resURL="/static/cfe2af1c" data-rooturl="" data-resurl="/static/cfe2af1c" data-imagesurl="/static/cfe2af1c/images"><title>Restarting Jenkins</title><meta name="ROBOT
S" content="NOFOLLOW"><meta name="viewport" content="width=device-width, initial-scale=1"><link rel="stylesheet" href="/static/cfe2af1c/jsbundles/simple-page.css" type="text/css"><link rel="stylesheet" hre
f="/static/cfe2af1c/css/loading.css" type="text/css"></head><body><div class="simple-page" role="main"><div class="modal signup"><div class="signupIntroDefault"><div class="logo"><img src="/static/cfe2af1c
/images/svgs/logo.svg" alt="Jenkins logo"></div><h1 class="loading">
                            Please wait while Jenkins is restarting
                            <span>.</span><span>.</span><span>.</span></h1><p class="restarting">Your browser will reload automatically when Jenkins is ready.</div></div></div><script src="/static/cfe2af1c
/scripts/loading.js" type="text/javascript"></script></body></html>
Jul 01, 2023 7:01:00 PM hudson.plugins.swarm.Client run
SEVERE: An error occurred
hudson.plugins.swarm.RetryException: Failed to create a Swarm agent on Jenkins. Response code: 403
...

...and looped until the server became responsive.

At the very least, all expected swarm clients appear online and usable without manual kicking of the tires.

…rror to runs with limited retry counts [JENKINS-70501]

jimklimov · 2023-07-02T13:08:49Z

Style note: most of the other code that inspects particular response code values deals with constants in the class for particular codes. My PR just checks for numeric range [500-600) as a server-side error, which suits HTTP definition but does not integrate neatly with Java abstractions. If desired, the check can be reduced to compare to a number of named constants instead of "magic numbers" but I'm not sure that would be truly beneficial :)

jimklimov · 2023-07-02T13:24:30Z

Ok, now that I do not constrain to options.retry count, it is confirmed better - it shows the exception text I added :)

hudson.plugins.swarm.RetryException: Failed to obtain CSRF crumb due to an Internal Server Error or similar condition. Response code: 503

Full cycling report is:

....
Jul 02, 2023 1:22:05 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
SEVERE: Could not obtain CSRF crumb. Response code: 503



    <!DOCTYPE html><html lang="en"><head resURL="/static/0b0dddbc" data-rooturl="" data-resurl="/static/0b0dddbc" data-imagesurl="/static/0b0dddbc/images"><title>Starting Jenkins</title><meta name="ROBOTS" content="NOFOLLOW"><meta name="viewport" content="width=device-width, initial-scale=1"><link rel="stylesheet" href="/static/0b0dddbc/jsbundles/simple-page.css" type="text/css"><link rel="stylesheet" href="/static/0b0dddbc/css/loading.css" type="text/css"></head><body><div class="simple-page" role="main"><div class="modal signup"><div class="signupIntroDefault"><div class="logo"><img src="/static/0b0dddbc/images/svgs/logo.svg" alt="Jenkins logo"></div><h1 class="loading">
                            Please wait while Jenkins is getting ready to work
                            <span>.</span><span>.</span><span>.</span></h1><p class="restarting">Your browser will reload automatically when Jenkins is ready.</div></div></div><script src="/static/0b0dddbc/scripts/loading.js" type="text/javascript"></script></body></html>
Jul 02, 2023 1:22:05 PM hudson.plugins.swarm.Client run
SEVERE: An error occurred
hudson.plugins.swarm.RetryException: Failed to obtain CSRF crumb due to an Internal Server Error or similar condition. Response code: 503
        at hudson.plugins.swarm.SwarmClient.getCsrfCrumb(SwarmClient.java:304)
        at hudson.plugins.swarm.SwarmClient.createSwarmAgent(SwarmClient.java:362)
        at hudson.plugins.swarm.Client.run(Client.java:193)
        at hudson.plugins.swarm.Client.main(Client.java:68)

Jul 02, 2023 1:22:05 PM hudson.plugins.swarm.Client run
INFO: Retrying in 10 seconds
...

jimklimov · 2023-10-25T15:32:38Z

Bump: any chance to get this merged, so I don't have to roll custom JARs for each new-release upgrade? :)

basil

Thanks for the PR!

jimklimov requested a review from a team as a code owner June 30, 2023 22:46

jimklimov changed the title ~~SwarmClient.java: retry connection if CSRF Crumb was not received with HTTP-5xx error~~ SwarmClient.java: retry connection if CSRF Crumb was not-received with HTTP-5xx error Jun 30, 2023

SwarmClient.java: retry connection if CSRF Crumb was not received wit…

9b6e659

…h HTTP-5xx error [JENKINS-70501]

jimklimov force-pushed the JENKINS-70501 branch from e4e7871 to 9b6e659 Compare June 30, 2023 23:41

SwarmClient.java: do not constrain CSRF crumb retry due to HTTP-5xx e…

1362f3a

…rror to runs with limited retry counts [JENKINS-70501]

jimklimov mentioned this pull request Jul 11, 2023

Migrate from Apache HttpComponents Client to Java Platform HTTP client #493

Merged

jimklimov added 3 commits August 26, 2023 20:41

Merge remote-tracking branch 'upstream/master' into JENKINS-70501

fa03517

Merge remote-tracking branch 'upstream/master' into JENKINS-70501

1e536f4

Merge remote-tracking branch 'upstream/master' into JENKINS-70501

a834312

jimklimov added 2 commits November 23, 2023 10:43

Merge remote-tracking branch 'upstream/master' into JENKINS-70501

fd2127e

Merge remote-tracking branch 'upstream/master' into JENKINS-70501

d3837c2

basil approved these changes Jan 5, 2024

View reviewed changes

basil merged commit ad27563 into jenkinsci:master Jan 5, 2024
16 checks passed

basil changed the title ~~SwarmClient.java: retry connection if CSRF Crumb was not-received with HTTP-5xx error~~ Retry connection if CSRF crumb retrieval results in HTTP 5xx error Jan 5, 2024

basil added the bug label Jan 5, 2024

jimklimov deleted the JENKINS-70501 branch January 11, 2024 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry connection if CSRF crumb retrieval results in HTTP 5xx error #571

Retry connection if CSRF crumb retrieval results in HTTP 5xx error #571

jimklimov commented Jun 30, 2023 •

edited

Loading

Submitter checklist

jimklimov commented Jul 1, 2023 •

edited

Loading

jimklimov commented Jul 2, 2023

jimklimov commented Jul 2, 2023

jimklimov commented Oct 25, 2023

basil left a comment

Retry connection if CSRF crumb retrieval results in HTTP 5xx error #571

Retry connection if CSRF crumb retrieval results in HTTP 5xx error #571

Conversation

jimklimov commented Jun 30, 2023 • edited Loading

Testing done

Submitter checklist

jimklimov commented Jul 1, 2023 • edited Loading

jimklimov commented Jul 2, 2023

jimklimov commented Jul 2, 2023

jimklimov commented Oct 25, 2023

basil left a comment

Choose a reason for hiding this comment

jimklimov commented Jun 30, 2023 •

edited

Loading

jimklimov commented Jul 1, 2023 •

edited

Loading