Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry connection if CSRF crumb retrieval results in HTTP 5xx error #571

Merged
merged 7 commits into from
Jan 5, 2024

Conversation

jimklimov
Copy link
Contributor

@jimklimov jimklimov commented Jun 30, 2023

As reported in https://issues.jenkins.io/browse/JENKINS-70501 sometimes when the Jenkins controller reboots (and responds with the butler page instead of any logical replies while it is initializing), and the Swarm Client is trying to reconnect, it fails and stalls.

Quoting from that ticket:

When the controller was last restarting, some swarm agents never re-appeared. Some hours after that I logged in to check on them, and found the last logged interaction was the HTML response about "Starting Jenkins"; a retry after that never happened:

# Tail of journal:
Jan 26 14:10:27 nutci-debian-11-amd64 swarm-client-nutci.sh[153]: INFO: Retrying in 10 seconds
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]: Jan 26, 2023 2:10:37 PM hudson.plugins.swarm.Client run
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]: INFO: Attempting to connect to https://ci.networkupstools.org/
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]: Jan 26, 2023 2:10:37 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]: SEVERE: Could not obtain CSRF crumb. Response code: 503
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]:
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]:
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]:     <!DOCTYPE html><html lang="en"><head resURL="/static/6728fa46" data-rooturl="" data-resurl="/static/6728fa46" data-imagesurl="/static/6728fa46/images"><title>Starting Jenkins</title><meta name="ROBOTS" content="NOFOLLOW"><meta name="viewport" content="width=device-width, initial-scale=1"><link rel="stylesheet" href="/static/6728fa46/jsbundles/simple-page.css" type="text/css"><link rel="stylesheet" href="/static/6728fa46/css/loading.css" type="text/css"></head><body><div class="simple-page" role="main"><div class="modal signup"><div class="signupIntroDefault"><div class="logo"><img src="/static/6728fa46/images/svgs/logo.svg" alt="Jenkins logo"></div><h1 class="loading">
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]:                             Please wait while Jenkins is getting ready to work
Jan 26 14:10:37 nutci-debian-11-amd64 swarm-client-nutci.sh[153]:                             <span>.</span><span>.</span><span>.</span></h1><p class="restarting">Your browser will reload automatically when Jenkins is ready.</div></div></div><script src="/static/6728fa46/scripts/loading.js" type="text/javascript"></script></body></html>
^C

:; date
Thu Jan 26 17:36:21 UTC 2023

This is a large inconvenience - to recover I have to log into the workers or have a way to reboot them, after I notice they are AWOL at all. In the meanwhile, the CI farm is under-powered - machines run but executors are not provided by them.

I hope this simple fix would cause the Swarm Client to continue retrying (if enabled by user) until the server begins responding and gives the actual CSRF logic response instead of the "Starting..." HTML page.

Testing done

At the moment of posting, this is a speculative fix, with a build running on a private CI farm (busy working so not restarting for fun in short term, possibly next week...) to see if the original error would be ever seen again if the server is recycled by auto-packaging updates or other evils.

I wonder if it is possible however that the server would pass through some other interim states during init, characterized by different error codes and messages if queried at that moment. Maybe a more thorough fix would "remember" the timestamp when a server error was seen - possibly with response content parsing to be sure that this was a server start-up, so for some time after that discovery it would cause any other errors to also go to retry.

So far I have no idea for automated testing for this situation and do not intend to spend time on that, but others are welcome to chime in with code :)

Submitter checklist

@jimklimov jimklimov requested a review from a team as a code owner June 30, 2023 22:46
@jimklimov jimklimov changed the title SwarmClient.java: retry connection if CSRF Crumb was not received with HTTP-5xx error SwarmClient.java: retry connection if CSRF Crumb was not-received with HTTP-5xx error Jun 30, 2023
@jimklimov
Copy link
Contributor Author

jimklimov commented Jul 1, 2023

Well... can't exactly confirm it worked thanks to this change across a server restart (with custom builds of swarm-client.jar trying to connect), but it did work in this form:

Jul 01, 2023 7:01:00 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
SEVERE: Could not obtain CSRF crumb. Response code: 503



    <!DOCTYPE html><html lang="en"><head resURL="/static/cfe2af1c" data-rooturl="" data-resurl="/static/cfe2af1c" data-imagesurl="/static/cfe2af1c/images"><title>Restarting Jenkins</title><meta name="ROBOT
S" content="NOFOLLOW"><meta name="viewport" content="width=device-width, initial-scale=1"><link rel="stylesheet" href="/static/cfe2af1c/jsbundles/simple-page.css" type="text/css"><link rel="stylesheet" hre
f="/static/cfe2af1c/css/loading.css" type="text/css"></head><body><div class="simple-page" role="main"><div class="modal signup"><div class="signupIntroDefault"><div class="logo"><img src="/static/cfe2af1c
/images/svgs/logo.svg" alt="Jenkins logo"></div><h1 class="loading">
                            Please wait while Jenkins is restarting
                            <span>.</span><span>.</span><span>.</span></h1><p class="restarting">Your browser will reload automatically when Jenkins is ready.</div></div></div><script src="/static/cfe2af1c
/scripts/loading.js" type="text/javascript"></script></body></html>
Jul 01, 2023 7:01:00 PM hudson.plugins.swarm.Client run
SEVERE: An error occurred
hudson.plugins.swarm.RetryException: Failed to create a Swarm agent on Jenkins. Response code: 403
...

...and looped until the server became responsive.

At the very least, all expected swarm clients appear online and usable without manual kicking of the tires.

…rror to runs with limited retry counts [JENKINS-70501]
@jimklimov
Copy link
Contributor Author

Style note: most of the other code that inspects particular response code values deals with constants in the class for particular codes. My PR just checks for numeric range [500-600) as a server-side error, which suits HTTP definition but does not integrate neatly with Java abstractions. If desired, the check can be reduced to compare to a number of named constants instead of "magic numbers" but I'm not sure that would be truly beneficial :)

@jimklimov
Copy link
Contributor Author

Ok, now that I do not constrain to options.retry count, it is confirmed better - it shows the exception text I added :)

hudson.plugins.swarm.RetryException: Failed to obtain CSRF crumb due to an Internal Server Error or similar condition. Response code: 503

Full cycling report is:

....
Jul 02, 2023 1:22:05 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
SEVERE: Could not obtain CSRF crumb. Response code: 503



    <!DOCTYPE html><html lang="en"><head resURL="/static/0b0dddbc" data-rooturl="" data-resurl="/static/0b0dddbc" data-imagesurl="/static/0b0dddbc/images"><title>Starting Jenkins</title><meta name="ROBOTS" content="NOFOLLOW"><meta name="viewport" content="width=device-width, initial-scale=1"><link rel="stylesheet" href="/static/0b0dddbc/jsbundles/simple-page.css" type="text/css"><link rel="stylesheet" href="/static/0b0dddbc/css/loading.css" type="text/css"></head><body><div class="simple-page" role="main"><div class="modal signup"><div class="signupIntroDefault"><div class="logo"><img src="/static/0b0dddbc/images/svgs/logo.svg" alt="Jenkins logo"></div><h1 class="loading">
                            Please wait while Jenkins is getting ready to work
                            <span>.</span><span>.</span><span>.</span></h1><p class="restarting">Your browser will reload automatically when Jenkins is ready.</div></div></div><script src="/static/0b0dddbc/scripts/loading.js" type="text/javascript"></script></body></html>
Jul 02, 2023 1:22:05 PM hudson.plugins.swarm.Client run
SEVERE: An error occurred
hudson.plugins.swarm.RetryException: Failed to obtain CSRF crumb due to an Internal Server Error or similar condition. Response code: 503
        at hudson.plugins.swarm.SwarmClient.getCsrfCrumb(SwarmClient.java:304)
        at hudson.plugins.swarm.SwarmClient.createSwarmAgent(SwarmClient.java:362)
        at hudson.plugins.swarm.Client.run(Client.java:193)
        at hudson.plugins.swarm.Client.main(Client.java:68)

Jul 02, 2023 1:22:05 PM hudson.plugins.swarm.Client run
INFO: Retrying in 10 seconds
...

@jimklimov
Copy link
Contributor Author

Bump: any chance to get this merged, so I don't have to roll custom JARs for each new-release upgrade? :)

Copy link
Member

@basil basil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@basil basil merged commit ad27563 into jenkinsci:master Jan 5, 2024
16 checks passed
@basil basil changed the title SwarmClient.java: retry connection if CSRF Crumb was not-received with HTTP-5xx error Retry connection if CSRF crumb retrieval results in HTTP 5xx error Jan 5, 2024
@basil basil added the bug label Jan 5, 2024
@jimklimov jimklimov deleted the JENKINS-70501 branch January 11, 2024 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants