-
-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
De-flake PingThreadTest
#7149
De-flake PingThreadTest
#7149
Conversation
This PR is now ready for merge. We will merge it after ~24 hours if there is no negative feedback. |
Still seeing some flakiness in this test, so I added commit ea759cb to add debugging information. The latest run shows that the error is "kill: (16490): No such process" when sending the second signal. I can't reproduce this locally. So Jenkins must be killing the (stopped) agent process before we can resume it. No big deal, I can just wrap the resume in a check for the process still being alive. |
Actually that is just a hack because the real question is why isn't the process still alive? I think it's because when suspending the process we don't wait for the signal to be delivered before running the test. So then when we simulate the failed ping the process might not have received the signal yet, and the close sequence actually gets delivered to the running process. But that isn't great because that is defeating the purpose of the test. Better to wait for the signal to be delivered before running the test, which should make the test more deterministic and avoid the need for a liveness check in the finally block. I implemented this in commit c059ead. |
On closer inspection there were not one but two async bugs in this test: the abovementioned asynchronous behavior when invoking |
Continuing to make slow progress debugging this test on CI. My latest discovery is that the |
Finally, a successful theory! The latest debug run shows the process finally suspending as expected in the container, so my controlling terminal theory was right. Now to rip the debug code out and see how a regular run goes… |
At last I think this is done. |
Flake observed in this run. I observed that the code being called by the test was asynchronous and could reproduce a hang reliably within
In any case the solution is to wait for the asynchronous activity to complete. I verified that the fix in this PR made the hang induced above disappear (modulo adjusting the timing for the artificially long sleep interval).
Proposed changelog entries
N/A
Proposed upgrade guidelines
N/A
Submitter checklist
Proposed changelog entries
section only if there are breaking changes or other changes which may require extra steps from users during the upgrade@Restricted
or have@since TODO
Javadoc, as appropriate.@Deprecated(since = "TODO")
or@Deprecated(forRemoval = true, since = "TODO")
if applicable.eval
to ease future introduction of Content-Security-Policy directives (see documentation on jenkins.io).Desired reviewers
@mention
Maintainer checklist
Before the changes are marked as
ready-for-merge
:Proposed changelog entries
are accurate, human-readable, and in the imperative moodupgrade-guide-needed
label is set and there is aProposed upgrade guidelines
section in the PR title. (example)lts-candidate
to be considered (see query).