-
-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-46680] Disconnect computer on ping timeout #3005
Conversation
install(channel, null); | ||
} | ||
|
||
@VisibleForTesting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never used in jenkinsci group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm it's being used in one of private-source plugins. I would not deprecate the old method since there is no replacement in public API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I am just wondering, is it used for SlaveComputer
channel from master's side so it can be migrated to new API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it's another Remoting channel type. In new API it would be possible to pass null
for sure (would not make it worse), but our use-case would probably require a special callback support in API. In jenkinsci
org there is a similar use-case in Maven Plugin, where we could need a custom Remoting connection termination logic as well, e.g. force termination of the build (CC @aheritier)
We could do it API for sure, but IMHO it's out of the scope of this PR. I would rather prefer t keep it in the current state (just without API deprecation) so we could easily backport it to 2.73.x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, so let's un-deprecate the old call and keep the new one private so we can give the API design a proper though later.
@@ -163,30 +179,35 @@ protected Object readResolve() { | |||
} | |||
} | |||
|
|||
static void setUpPingForChannel(final Channel channel, int timeoutSeconds, int intervalSeconds, final boolean analysis) { | |||
@VisibleForTesting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never used in jenkinsci group.
} | ||
|
||
static class SetUpRemotePing extends MasterToSlaveCallable<Void, IOException> { | ||
@VisibleForTesting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never used in jenkinsci group.
} | ||
} catch (IOException e) { | ||
LOGGER.log(Level.SEVERE,"Failed to terminate the channel "+channel.getName(),e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The close is now done in disconnect()
where failures are logged instead of thrown. The catch block was reused by failures to analyze
that now does not throw any longer and goes through all the implementations registered.
5729d08
to
e739886
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach is fine for me. I would like to spend some time on double checking the impact on locks in the case of Computer#disconnect()
, but it should work fine.
I request a minor change: Do not deprecate ChannelPinger#install()
since there is no public replacement in API
install(channel, null); | ||
} | ||
|
||
@VisibleForTesting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm it's being used in one of private-source plugins. I would not deprecate the old method since there is no replacement in public API
e739886
to
2aff715
Compare
@oleg-nenashev, the API is back. The only additional lock involved seems to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try to test it on my remoting test env if I manage to connect to it remotely, but looks good.
* [JENKINS-46680] Reproduce in unittest * [FIX JENKINS-46680] Reset SlaveComputer channel before closing it on ping timeout * [JENKINS-46680] Attach channel termination offline cause on ping timeouts (cherry picked from commit dbb5e44)
assertNull(slave.getComputer().getChannel()); | ||
assertNull(computer.getChannel()); | ||
} finally { | ||
assert new ProcessBuilder("kill", "-CONT", pid).start().waitFor() == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olivergondza this seems to be flaky: #3961 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment remained unacknowledged until #7149 fixed the flaky test 3 years, 5 months, and 18 days after the flakiness was first reported.
See JENKINS-46680.
When ping thread closes the channel because of response time issues / no responses arriving, it rely on the other hand to close the channel fully and most importantly to remove the half-closed channel reference from computer - which might never happen. The ironic part is the computer appears perfectly healthy otherwise and not even the
ResponseTimeMonitor
will notice the connection is broken so the node remains online.Proposed changelog entries
Submitter checklist
* Use the
Internal:
prefix if the change has no user-visible impact (API, test frameworks, etc.)TODO:
ResponseTimeMonitor
.