[JENKINS-46680] Disconnect computer on ping timeout #3005

olivergondza · 2017-09-07T09:33:25Z

When ping thread closes the channel because of response time issues / no responses arriving, it rely on the other hand to close the channel fully and most importantly to remove the half-closed channel reference from computer - which might never happen. The ironic part is the computer appears perfectly healthy otherwise and not even the ResponseTimeMonitor will notice the connection is broken so the node remains online.

Proposed changelog entries

Disconnect computer on ping timeout.

Submitter checklist

JIRA issue is well described
Changelog entry appropriate for the audience affected by the change (users or developer, depending on the change). Examples
* Use the Internal: prefix if the change has no user-visible impact (API, test frameworks, etc.)
Appropriate autotests or explanation to why this change has no tests

TODO:

Reproduce
Provide fix
Reuse the offline cause from ResponseTimeMonitor.

olivergondza · 2017-09-07T09:34:26Z

core/src/main/java/hudson/slaves/ChannelPinger.java

+        install(channel, null);
+    }
+
+    @VisibleForTesting


Never used in jenkinsci group.

I confirm it's being used in one of private-source plugins. I would not deprecate the old method since there is no replacement in public API

OK. I am just wondering, is it used for SlaveComputer channel from master's side so it can be migrated to new API?

Nope, it's another Remoting channel type. In new API it would be possible to pass null for sure (would not make it worse), but our use-case would probably require a special callback support in API. In jenkinsci org there is a similar use-case in Maven Plugin, where we could need a custom Remoting connection termination logic as well, e.g. force termination of the build (CC @aheritier)

We could do it API for sure, but IMHO it's out of the scope of this PR. I would rather prefer t keep it in the current state (just without API deprecation) so we could easily backport it to 2.73.x.

Alright, so let's un-deprecate the old call and keep the new one private so we can give the API design a proper though later.

olivergondza · 2017-09-07T09:34:33Z

core/src/main/java/hudson/slaves/ChannelPinger.java

@@ -163,30 +179,35 @@ protected Object readResolve() {
        }
    }

-    static void setUpPingForChannel(final Channel channel, int timeoutSeconds, int intervalSeconds, final boolean analysis) {
+    @VisibleForTesting


Never used in jenkinsci group.

olivergondza · 2017-09-07T09:34:42Z

core/src/main/java/hudson/slaves/ChannelPinger.java

    }

-    static class SetUpRemotePing extends MasterToSlaveCallable<Void, IOException> {
+    @VisibleForTesting


Never used in jenkinsci group.

olivergondza · 2017-09-07T09:37:23Z

core/src/main/java/hudson/slaves/ChannelPinger.java

                    }
-                } catch (IOException e) {
-                    LOGGER.log(Level.SEVERE,"Failed to terminate the channel "+channel.getName(),e);


The close is now done in disconnect() where failures are logged instead of thrown. The catch block was reused by failures to analyze that now does not throw any longer and goes through all the implementations registered.

oleg-nenashev

The approach is fine for me. I would like to spend some time on double checking the impact on locks in the case of Computer#disconnect(), but it should work fine.

I request a minor change: Do not deprecate ChannelPinger#install() since there is no public replacement in API

oleg-nenashev · 2017-09-07T09:56:12Z

core/src/main/java/hudson/slaves/ChannelPinger.java

+        install(channel, null);
+    }
+
+    @VisibleForTesting


I confirm it's being used in one of private-source plugins. I would not deprecate the old method since there is no replacement in public API

…ping timeout

…outs

olivergondza · 2017-09-07T11:40:48Z

@oleg-nenashev, the API is back. The only additional lock involved seems to be channelLock in Computer#closeChannel() but it seems safe to me.

oleg-nenashev

Will try to test it on my remoting test env if I manage to connect to it remotely, but looks good.

* [JENKINS-46680] Reproduce in unittest * [FIX JENKINS-46680] Reset SlaveComputer channel before closing it on ping timeout * [JENKINS-46680] Attach channel termination offline cause on ping timeouts (cherry picked from commit dbb5e44)

jglick · 2019-04-11T01:37:03Z

test/src/test/java/hudson/slaves/PingThreadTest.java

+            assertNull(slave.getComputer().getChannel());
+            assertNull(computer.getChannel());
+        } finally {
+            assert new ProcessBuilder("kill", "-CONT", pid).start().waitFor() == 0;


@olivergondza this seems to be flaky: #3961 (comment)

This comment remained unacknowledged until #7149 fixed the flaky test 3 years, 5 months, and 18 days after the flakiness was first reported.

[JENKINS-46680] Reproduce in unittest

320174e

olivergondza added the work-in-progress The PR is under active development, not ready to the final review label Sep 7, 2017

olivergondza commented Sep 7, 2017

View reviewed changes

olivergondza force-pushed the JENKINS-46680 branch from 5729d08 to e739886 Compare September 7, 2017 09:41

oleg-nenashev requested changes Sep 7, 2017

View reviewed changes

olivergondza added 2 commits September 7, 2017 12:43

[FIX JENKINS-46680] Reset SlaveComputer channel before closing it on …

eda35c8

…ping timeout

[JENKINS-46680] Attach channel termination offline cause on ping time…

2aff715

…outs

olivergondza force-pushed the JENKINS-46680 branch from e739886 to 2aff715 Compare September 7, 2017 11:36

olivergondza added needs-more-reviews Complex change, which would benefit from more eyes and removed work-in-progress The PR is under active development, not ready to the final review labels Sep 7, 2017

oleg-nenashev approved these changes Sep 7, 2017

View reviewed changes

oleg-nenashev added ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback and removed needs-more-reviews Complex change, which would benefit from more eyes labels Sep 15, 2017

oleg-nenashev approved these changes Sep 15, 2017

View reviewed changes

oleg-nenashev merged commit dbb5e44 into jenkinsci:master Sep 15, 2017

jglick reviewed Apr 11, 2019

View reviewed changes

jglick mentioned this pull request Jan 11, 2023

[JENKINS-70414] Missing agent-side Channel.close from PingThread.onDead #7580

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-46680] Disconnect computer on ping timeout #3005

[JENKINS-46680] Disconnect computer on ping timeout #3005

olivergondza commented Sep 7, 2017 •

edited

Loading

olivergondza Sep 7, 2017

oleg-nenashev Sep 7, 2017

olivergondza Sep 7, 2017

oleg-nenashev Sep 7, 2017

olivergondza Sep 7, 2017

olivergondza Sep 7, 2017

olivergondza Sep 7, 2017

olivergondza Sep 7, 2017

oleg-nenashev left a comment

oleg-nenashev Sep 7, 2017

olivergondza commented Sep 7, 2017

oleg-nenashev left a comment

jglick Apr 11, 2019

basil Jan 12, 2023

[JENKINS-46680] Disconnect computer on ping timeout #3005

[JENKINS-46680] Disconnect computer on ping timeout #3005

Conversation

olivergondza commented Sep 7, 2017 • edited Loading

Proposed changelog entries

Submitter checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleg-nenashev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olivergondza commented Sep 7, 2017

oleg-nenashev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olivergondza commented Sep 7, 2017 •

edited

Loading