Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-46680] Disconnect computer on ping timeout #3005

Merged
merged 3 commits into from
Sep 15, 2017

Conversation

olivergondza
Copy link
Member

@olivergondza olivergondza commented Sep 7, 2017

See JENKINS-46680.

When ping thread closes the channel because of response time issues / no responses arriving, it rely on the other hand to close the channel fully and most importantly to remove the half-closed channel reference from computer - which might never happen. The ironic part is the computer appears perfectly healthy otherwise and not even the ResponseTimeMonitor will notice the connection is broken so the node remains online.

Proposed changelog entries

  • Disconnect computer on ping timeout.

Submitter checklist

  • JIRA issue is well described
  • Changelog entry appropriate for the audience affected by the change (users or developer, depending on the change). Examples
    * Use the Internal: prefix if the change has no user-visible impact (API, test frameworks, etc.)
  • Appropriate autotests or explanation to why this change has no tests

TODO:

  • Reproduce
  • Provide fix
  • Reuse the offline cause from ResponseTimeMonitor.

@olivergondza olivergondza added the work-in-progress The PR is under active development, not ready to the final review label Sep 7, 2017
install(channel, null);
}

@VisibleForTesting
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never used in jenkinsci group.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm it's being used in one of private-source plugins. I would not deprecate the old method since there is no replacement in public API

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I am just wondering, is it used for SlaveComputer channel from master's side so it can be migrated to new API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it's another Remoting channel type. In new API it would be possible to pass null for sure (would not make it worse), but our use-case would probably require a special callback support in API. In jenkinsci org there is a similar use-case in Maven Plugin, where we could need a custom Remoting connection termination logic as well, e.g. force termination of the build (CC @aheritier)

We could do it API for sure, but IMHO it's out of the scope of this PR. I would rather prefer t keep it in the current state (just without API deprecation) so we could easily backport it to 2.73.x.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so let's un-deprecate the old call and keep the new one private so we can give the API design a proper though later.

@@ -163,30 +179,35 @@ protected Object readResolve() {
}
}

static void setUpPingForChannel(final Channel channel, int timeoutSeconds, int intervalSeconds, final boolean analysis) {
@VisibleForTesting
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never used in jenkinsci group.

}

static class SetUpRemotePing extends MasterToSlaveCallable<Void, IOException> {
@VisibleForTesting
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never used in jenkinsci group.

}
} catch (IOException e) {
LOGGER.log(Level.SEVERE,"Failed to terminate the channel "+channel.getName(),e);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The close is now done in disconnect() where failures are logged instead of thrown. The catch block was reused by failures to analyze that now does not throw any longer and goes through all the implementations registered.

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach is fine for me. I would like to spend some time on double checking the impact on locks in the case of Computer#disconnect(), but it should work fine.

I request a minor change: Do not deprecate ChannelPinger#install() since there is no public replacement in API

install(channel, null);
}

@VisibleForTesting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm it's being used in one of private-source plugins. I would not deprecate the old method since there is no replacement in public API

@olivergondza olivergondza added needs-more-reviews Complex change, which would benefit from more eyes and removed work-in-progress The PR is under active development, not ready to the final review labels Sep 7, 2017
@olivergondza
Copy link
Member Author

@oleg-nenashev, the API is back. The only additional lock involved seems to be channelLock in Computer#closeChannel() but it seems safe to me.

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try to test it on my remoting test env if I manage to connect to it remotely, but looks good.

@oleg-nenashev oleg-nenashev added ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback and removed needs-more-reviews Complex change, which would benefit from more eyes labels Sep 15, 2017
@oleg-nenashev oleg-nenashev merged commit dbb5e44 into jenkinsci:master Sep 15, 2017
olivergondza added a commit that referenced this pull request Oct 23, 2017
* [JENKINS-46680] Reproduce in unittest

* [FIX JENKINS-46680] Reset SlaveComputer channel before closing it on ping timeout

* [JENKINS-46680] Attach channel termination offline cause on ping timeouts

(cherry picked from commit dbb5e44)
assertNull(slave.getComputer().getChannel());
assertNull(computer.getChannel());
} finally {
assert new ProcessBuilder("kill", "-CONT", pid).start().waitFor() == 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olivergondza this seems to be flaky: #3961 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment remained unacknowledged until #7149 fixed the flaky test 3 years, 5 months, and 18 days after the flakiness was first reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants