-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry connection if CSRF crumb retrieval results in HTTP 5xx error #571
Conversation
…h HTTP-5xx error [JENKINS-70501]
Well... can't exactly confirm it worked thanks to this change across a server restart (with custom builds of
...and looped until the server became responsive. At the very least, all expected swarm clients appear online and usable without manual kicking of the tires. |
…rror to runs with limited retry counts [JENKINS-70501]
Style note: most of the other code that inspects particular response code values deals with constants in the class for particular codes. My PR just checks for numeric range |
Ok, now that I do not constrain to
Full cycling report is:
|
Bump: any chance to get this merged, so I don't have to roll custom JARs for each new-release upgrade? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
As reported in https://issues.jenkins.io/browse/JENKINS-70501 sometimes when the Jenkins controller reboots (and responds with the butler page instead of any logical replies while it is initializing), and the Swarm Client is trying to reconnect, it fails and stalls.
Quoting from that ticket:
I hope this simple fix would cause the Swarm Client to continue retrying (if enabled by user) until the server begins responding and gives the actual CSRF logic response instead of the "Starting..." HTML page.
Testing done
At the moment of posting, this is a speculative fix, with a build running on a private CI farm (busy working so not restarting for fun in short term, possibly next week...) to see if the original error would be ever seen again if the server is recycled by auto-packaging updates or other evils.
I wonder if it is possible however that the server would pass through some other interim states during init, characterized by different error codes and messages if queried at that moment. Maybe a more thorough fix would "remember" the timestamp when a server error was seen - possibly with response content parsing to be sure that this was a server start-up, so for some time after that discovery it would cause any other errors to also go to retry.
So far I have no idea for automated testing for this situation and do not intend to spend time on that, but others are welcome to chime in with code :)
Submitter checklist