[JENKINS-72163] Retry on initial connection failure occurs in one entrypoint but not the other #675

basil · 2023-10-11T17:46:11Z

Context

Problem

I talked to a user running Kubernetes agents in a cluster where the controller was not immediately reachable over the network after spinning up the agent. Rather, it took 30 seconds or so for the controller to become reachable over the network. While admitting this networking setup was not ideal, the user expected Remoting to be resilient to this scenario, but it was not. Rather, Remoting printed the following exception and then terminated with a non-zero exit code, never trying again:

java.io.IOException: Failed to connect to http://example.com/tcpSlaveAgentListener/: Connection refused
	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:216)
	at hudson.remoting.Engine.innerRun(Engine.java:761)
	at hudson.remoting.Engine.run(Engine.java:543)
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.Net.pollConnect(Native Method)
	at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
	at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:547)
	at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
	at java.base/java.net.Socket.connect(Socket.java:633)
	at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:178)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:533)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:638)
	at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:281)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:386)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:408)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1309)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1242)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1128)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1057)
	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:213)
	... 2 more

Evaluation

There are two public static void main() entrypoints into Remoting: hudson.remoting.Launcher (used by java -jar remoting.jar -jnlpUrl <…>) and hudson.remoting.jnlp.Main (used by java -cp remoting.jar hudson.remoting.jnlp.Main <…> -url <…>), which was the entrypoint being used by this user. The first entrypoint is a thin wrapper around the second when -jnlpUrl is passed in, and if the controller is not available it keeps retrying every 10 seconds (unless -noReconnect is specified) until the controller is available before vectoring into the second entrypoint. If the connection is interrupted after it is established and the controller is not immediately available for reconnection, we again retry every 10 seconds (unless noReconnect is specified). But there is a gap in retry coverage—if the second entrypoint is invoked directly (rather than via the first entrypoint) and the controller is not available at the time the initial connection is made, as was the case with this user, no retries will occur.

Solution

When -noReconnect is not specified, sleep the usual interval and loop back around to retry rather than terminating. This behavior is consistent with the behavior when running via the first entrypoint, which does this before vectoring into the second (at which point the controller should already be reachable). It is also consistent with the behavior that occurs after an existing connection is interrupted and the controller is not immediately available for reconnection. When -noReconnect is specified, preserve the existing behavior of either terminating fatally under normal scenarios or terminating non-fatally when running under Swarm (to allow Swarm to do its own exponential backoff implementation instead).

Implementation

We have made no attempt to clean up the rather complex control flow and duplicate code in this module. We did, however, leave notes that might aid a future refactoring effort.

Notes to Reviewers

Please review with the Hide Whitespace feature enabled.

Testing Done

Reproduced the problem by shutting down the controller and attempting to connect via the second entrypoint directly. Received the exception and a fatal termination before this PR, while after this PR Remoting would sleep 10 seconds until the controller was back online and the connection succeeded. Also verified that the existing behavior, including fatal exit code, was preserved when running with -noReconnect.

…rypoint but not the other

basil · 2023-10-11T17:49:20Z

src/main/java/hudson/remoting/Engine.java

-                } catch (Exception e) {
-                    if (Boolean.getBoolean(Engine.class.getName() + ".nonFatalJnlpAgentEndpointResolutionExceptions")) {
-                        events.status("Could not resolve JNLP agent endpoint", e);
+                } catch (IOException e) {


Minor cleanup while I was here. The original code was catching any exception type, including InterruptedException (which it made no attempt to handle). Following generic advice such as that given on this page, prefer specific exceptions in catch blocks.

timja

hardcoded 10 seconds sleep doesn't seem ideal but all the pre-existing code does it so 👍

and thanks for the follow up PR at #676 that fixes the hardcoded sleep issue.

MarkEWaite

Thanks!

jglick · 2023-10-13T11:23:54Z

src/main/java/hudson/remoting/Engine.java

+                        if (Boolean.getBoolean(Engine.class.getName() + ".nonFatalJnlpAgentEndpointResolutionExceptions")) {
+                            events.status("Could not resolve JNLP agent endpoint", e);


Note to readers excluding the author (not a review): this property derives from #449, where SwarmClient wraps hudson.remoting.jnlp.Main in-JVM and thus the call to System.exit would be “fatal”. For normal uses of Remoting, the distinction is merely between logging a stack trace at SEVERE and exiting with -1 vs. logging a stack trace at INFO and exiting with 0 (but still effectively treating the error as fatal).

thus the call to System.exit would be “fatal”

Not sure why scare quotes were used here, but yes a call to System.exit is fatal to the life of the process.

[JENKINS-72163] Retry on initial connection failure occurs in one ent…

a6f7c96

…rypoint but not the other

basil added backporting-candidate bug For changelog: Fixes a bug. labels Oct 11, 2023

basil commented Oct 11, 2023

View reviewed changes

timja approved these changes Oct 12, 2023

View reviewed changes

sboardwell mentioned this pull request Oct 12, 2023

Expand non-fatal JNLP agent endpoint resolution to do retries #628

Closed

6 tasks

MarkEWaite approved these changes Oct 12, 2023

View reviewed changes

basil merged commit d76b9dd into jenkinsci:master Oct 12, 2023
13 checks passed

basil deleted the retry branch October 12, 2023 15:34

basil mentioned this pull request Oct 12, 2023

[JENKINS-72163] Add tests for agent endpoint resolution retries #608

Merged

6 tasks

jglick reviewed Oct 13, 2023

View reviewed changes

jglick mentioned this pull request Oct 13, 2023

Deprecating PermanentConnectionRefusalException #680

Merged

jglick mentioned this pull request Oct 30, 2023

Backporting for 2.426.1 jenkinsci/jenkins#8658

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-72163] Retry on initial connection failure occurs in one entrypoint but not the other #675

[JENKINS-72163] Retry on initial connection failure occurs in one entrypoint but not the other #675

basil commented Oct 11, 2023

basil Oct 11, 2023

timja left a comment •

edited

Loading

MarkEWaite left a comment

jglick Oct 13, 2023

basil Oct 13, 2023

		if (Boolean.getBoolean(Engine.class.getName() + ".nonFatalJnlpAgentEndpointResolutionExceptions")) {
		events.status("Could not resolve JNLP agent endpoint", e);

[JENKINS-72163] Retry on initial connection failure occurs in one entrypoint but not the other #675

[JENKINS-72163] Retry on initial connection failure occurs in one entrypoint but not the other #675

Conversation

basil commented Oct 11, 2023

Context

Problem

Evaluation

Solution

Implementation

Notes to Reviewers

Testing Done

basil Oct 11, 2023

Choose a reason for hiding this comment

timja left a comment • edited Loading

Choose a reason for hiding this comment

MarkEWaite left a comment

Choose a reason for hiding this comment

jglick Oct 13, 2023

Choose a reason for hiding this comment

basil Oct 13, 2023

Choose a reason for hiding this comment

timja left a comment •

edited

Loading