-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spurious CI failure: tests/run-make/jobserver-error #110321
Comments
This seems to have happened several times very recently:
I don't have even the remotest theory on how that could be happening. The jobserver was last updated 2023-02-28 in #108582. It was a fairly significant update, but that was sufficiently long ago that I doubt it was directly responsible. There has been a series of recent changes to rustc_codegen_ssa just before 2023-04-07:
But none of those look related. cc @weihanglo Does this error make any sense to you? One way someone can help is try to reproduce it locally. I'm thinking running hundreds of copies of the test. IIRC, these builders do not use an outer make, but that would be good to double-check. |
I don't really know the testing infra here. Is there any chance that it happened because of this #109770? |
The test itself is very new, it was added in #109694, maybe it could always fail spuriously. |
That made it more likely to fail, but only because it's being run more often now (previously it wouldn't be run for targets that were cross-compiled). I doubt it's the root cause. |
@belovdv do you have time to follow up here? |
until then, we should just disable it. This test doesn't seem important enough to cause trouble. |
I'm not sure why I thought this was an old test. Teaches me to not ignore the obvious. I was able to reproduce this once locally, but it took hundreds of thousands of attempts. I tried some instrumenting, but I wasn't able to reproduce after adding the logging (even with the logging turned off). My hypothesis is that due to scheduling oddities, the jobserver helper thread gets delayed. The coordinator thread will proceed with using the implicit token for doing work. However, I'm surprised that it would get delayed for such a long time. There are 5 CGUs, and that would require processing them all serially, with several switches between the main and coordinator threads. I posted #110361 to disable it temporarily. |
…k-Simulacrum Temporarily disable the jobserver-error test This test is failing randomly on CI. We don't have a handle on what might be causing it, so disable it for now to reduce disruption. cc rust-lang#110321
I was able to reproduce on CI with debug logging. I confirmed that the jobserver helper thread never starts until after coordinator thread finishes. That means it just uses the implicit token until it tries to link, and the jobserver helper thread just exits since it is no longer needed. I'm not very familiar with Linux thread scheduling, so I'm not sure why the thread gets delayed for so long. I also don't have any particular ideas on how to make this test work reliably. |
I'll work on this, but i doubt if my experience is enough. I wasn't able to reproduce it locally. Could you share your results? |
the jobserver test is currently disabled: https://github.com/rust-lang/rust/pull/110361/files |
I'm almost sure i didn't forgot to re-enable it and now i've run another 100 times a bit different test: UPD: i'll run again exactly the same as was disabled to check myself, but a bit later |
It took me hundreds of thousands of attempts to reproduce it once locally. That generally wasn't productive since it was so rare. I had much better luck reproducing in GitHub Actions. Here is a branch with some changes to run just the one test, and to instrument it. If you have Actions enabled on your fork, you can just push a new branch with similar changes. The problem is that the jobserver helper thread launched here gets starved out and doesn't run, while the coordinator thread continually processes work using the implicit jobserver token, never yielding long enough for the jobserver helper to do its work (and process the error). Honestly, I'm not sure this is really worth fixing. It would be nice to have a test to validate the jobserver behavior, but I'm not sure it is worth the expense of changing the coordinator's behavior in some way (since there could be performance losses, for example if it explicitly yielded or waited for the jobserver helper, or more aggressively tried to acquire tokens). I'm not familiar enough with Linux thread scheduling to know why the thread could get blocked for so long, or if there are different approaches to fix the scheduling. Another thought I just had would be to change the test to generate a binary with enough code to split across a very large number of CGUs (and use |
#110177 (comment) seems to have run into some spurious CI failure
The text was updated successfully, but these errors were encountered: