Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-flake PingThreadTest #7149

Merged
merged 22 commits into from
Sep 28, 2022
Merged

De-flake PingThreadTest #7149

merged 22 commits into from
Sep 28, 2022

Conversation

basil
Copy link
Member

@basil basil commented Sep 22, 2022

Flake observed in this run. I observed that the code being called by the test was asynchronous and could reproduce a hang reliably within

diff --git a/core/src/main/java/hudson/slaves/SlaveComputer.java b/core/src/main/java/hudson/slaves/SlaveComputer.java
index 1b2cc7f562..efa9591510 100644
--- a/core/src/main/java/hudson/slaves/SlaveComputer.java
+++ b/core/src/main/java/hudson/slaves/SlaveComputer.java
@@ -797,6 +797,11 @@ public class SlaveComputer extends Computer {
         return Computer.threadPoolForRemoting.submit(new Runnable() {
             @Override
             public void run() {
+                try {
+                    Thread.sleep(15000);
+                } catch (InterruptedException e) {
+                    // ignore
+                }
                 // do this on another thread so that any lengthy disconnect operation
                 // (which could be typical) won't block UI thread.
                 launcher.beforeDisconnect(SlaveComputer.this, taskListener);

In any case the solution is to wait for the asynchronous activity to complete. I verified that the fix in this PR made the hang induced above disappear (modulo adjusting the timing for the artificially long sleep interval).

Proposed changelog entries

N/A

Proposed upgrade guidelines

N/A

Submitter checklist

  • (If applicable) Jira issue is well described
  • Changelog entries and upgrade guidelines are appropriate for the audience affected by the change (users or developer, depending on the change) and are in the imperative mood. Examples
    • Fill-in the Proposed changelog entries section only if there are breaking changes or other changes which may require extra steps from users during the upgrade
  • Appropriate autotests or explanation to why this change has no tests
  • New public classes, fields, and methods are annotated with @Restricted or have @since TODO Javadoc, as appropriate.
  • New deprecations are annotated with @Deprecated(since = "TODO") or @Deprecated(forRemoval = true, since = "TODO") if applicable.
  • New or substantially changed JavaScript is not defined inline and does not call eval to ease future introduction of Content-Security-Policy directives (see documentation on jenkins.io).
  • For dependency updates: links to external changelogs and, if possible, full diffs

Desired reviewers

@mention

Maintainer checklist

Before the changes are marked as ready-for-merge:

  • There are at least 2 approvals for the pull request and no outstanding requests for change
  • Conversations in the pull request are over OR it is explicit that a reviewer does not block the change
  • Changelog entries in the PR title and/or Proposed changelog entries are accurate, human-readable, and in the imperative mood
  • Proper changelog labels are set so that the changelog can be generated automatically
  • If the change needs additional upgrade steps from users, upgrade-guide-needed label is set and there is a Proposed upgrade guidelines section in the PR title. (example)
  • If it would make sense to backport the change to LTS, a Jira issue must exist, be a Bug or Improvement, and be labeled as lts-candidate to be considered (see query).

@basil basil added the skip-changelog Should not be shown in the changelog label Sep 22, 2022
@NotMyFault
Copy link
Member

This PR is now ready for merge. We will merge it after ~24 hours if there is no negative feedback.
Please see the merge process documentation for more information about the merge process.
Thanks!

@NotMyFault NotMyFault added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Sep 23, 2022
@basil
Copy link
Member Author

basil commented Sep 23, 2022

Still seeing some flakiness in this test, so I added commit ea759cb to add debugging information. The latest run shows that the error is "kill: (16490): No such process" when sending the second signal. I can't reproduce this locally. So Jenkins must be killing the (stopped) agent process before we can resume it. No big deal, I can just wrap the resume in a check for the process still being alive.

@basil
Copy link
Member Author

basil commented Sep 23, 2022

No big deal, I can just wrap the resume in a check for the process still being alive.

Actually that is just a hack because the real question is why isn't the process still alive? I think it's because when suspending the process we don't wait for the signal to be delivered before running the test. So then when we simulate the failed ping the process might not have received the signal yet, and the close sequence actually gets delivered to the running process. But that isn't great because that is defeating the purpose of the test. Better to wait for the signal to be delivered before running the test, which should make the test more deterministic and avoid the need for a liveness check in the finally block. I implemented this in commit c059ead.

@basil basil removed the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Sep 25, 2022
@basil
Copy link
Member Author

basil commented Sep 25, 2022

On closer inspection there were not one but two async bugs in this test: the abovementioned asynchronous behavior when invoking onDead, and also the asynchronous nature of Unix signal delivery: while not waiting long enough for the TSTP signal was causing the test to test the wrong thing, not waiting long enough for the CONT signal to be delivered before trying to shut down Jenkins would result in the test timing out because the cleanup routines would hang trying to clean up an agent that was suspended.

@basil
Copy link
Member Author

basil commented Sep 26, 2022

Continuing to make slow progress debugging this test on CI. My latest discovery is that the TSTP signal appears to never have been delivered in CI (but not in my local builds) - so this test was never working as intended on CI. My current theory is that this is due to the lack of a controlling terminal, so I'll try switching the TSTP signal for a STOP signal. As usual it will take me hours to make even a tiny amount of progress, since I can't reproduce this failure locally.

@basil
Copy link
Member Author

basil commented Sep 26, 2022

Finally, a successful theory! The latest debug run shows the process finally suspending as expected in the container, so my controlling terminal theory was right. Now to rip the debug code out and see how a regular run goes…

@basil
Copy link
Member Author

basil commented Sep 27, 2022

At last I think this is done.

@basil basil added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Sep 27, 2022
@basil basil merged commit 41feb0d into jenkinsci:master Sep 28, 2022
@basil basil deleted the PingThreadTest branch September 28, 2022 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback skip-changelog Should not be shown in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants