-
-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
packed::iter::performance on riscv64 has PermanentlyLocked failure running fixture #1605
Comments
Thanks a lot for reporting! My recommendation here is to try and increase the timeout for file-based locks in My guess here is that rebuilding something while the tests are running is so slow while involving the same fixture that the performance test starves before even being started - it's fixture can't be run/validated as another test, probably the one building a binary, is capturing it. An alternative validation would be to limit the scope of the tests that are run. Lastly, while the test is running, using |
Sounds good, I'm experimenting with that now. |
I've opened #1606, which doubles the timeout and seems to fix this. I'm not totally sure what's going on, but it looks like there was a contribution from build processes that were triggered by tests. I didn't observe tracked lock files or anything from running |
Current behavior 😯
The
gix-ref-tests::refs packed::iter::performance
test fails on a riscv64 machine when I run all tests and force archives to be regenerated by running:The way it fails is:
With a bit of further context:
So this looks like a deadlock, though I don't know that it is one. This does not happen on an x64 machine, though the x64 machine I have been testing on is faster, though it has the same number of cores as the riscv64 machine I've been testing on. They both have four cores. The riscv64 machine has this
cat /proc/cpuinfo
output and more information on the details of its hardware is available in zlib-ng/zlib-ng#1705 (comment) and on this page.Full details of all
max-pure
runs are in this gist, which supersedes the earlier gist linked in rust-lang/libz-sys#218 (comment) where I first noticed the problem. The significance of that older gist is that it shows the problem happens even withmax
. Other runs I've done to investigate this have been withmax-pure
, to examine the problem independently of rust-lang/libz-sys#200 (zlib-ng/zlib-ng#1705).The strangest thing is that the problem only occurs if I rebuild the code. Specifically, cleaning with
git restore .
followed bygix clean -xd -m '*generated*' -e
is insufficient to cause the next full run to have the failure, but cleaning withgit restore .
followed bygix clean -xde
is sufficient to cause the next full run to have the failure. I have verified that content undertarget/
is the only content reported bygix clean -xdn
as eligible for deletion after runninggix clean -xd -m '*generated*' -e
, so it appears that, somehow, rebuilding is part of what is needed to produce the problem.Although this is present in the readme for the new gist, the gist "table of contents" serves both as a summary of what kinds of runs produce what results and the order in which I did the runs, and as a collection of links to the nine runs in case one wishes to examine them in detail. Therefore, I quote it here:
All tests in the newer gist were run at the current tip of main, 612896d. They were all run on Ubuntu 24.04 LTS systems. The two x86 runs were on one system, and the seven riscv64 runs were, of course, on a different system (but the same system as each other). Because the older gist also shows the problem, it is not new, at least not newer than be2f093.
Expected behavior 🤔
All tests should pass.
Secondarily, it seems to me that when
git restore .
has been run and there are no intervening commands that modify the working tree,gix clean -xd -m '*generated*' -e
should be as good asgix clean -xde
at resetting state associated with fixtures that are forced to be rerun due to the use ofGIX_TEST_IGNORE_ARCHIVES=1
, at least when repeated re-running has not identified any nondeterminism in failures that occur related to fixtures. However, I am not certain that there is necessarily a specific, separate bug that corresponds to this second expectation.Git behavior
Probably not applicable, unless the speed at which the
git
commands in the fixture script are run turns out to be a contributing factor (but I think that would still not begit
behavior that corresponds to the code ingitoxide
where the test fails).Steps to reproduce 🕹
On a riscv64 machine in GNU/Linux (and specifically Ubuntu 24.04 LTS, if one wishes to reproduce the setup I used), either clone the
gitoxide
repo, or rungit restore .
andgix clean -xde
in the already clonedgitoxide
repo (after ensuring that one has no valuable modifications that could be lost by doing so). Then run:To observe that it reliably happens when run this way, run
gix clean -xde
and then that test command again. This can be done as many times as desired.To observe that rebuilding seems to be required to produce the problem, replace
gix clean -xde
withgix clean -xd -m '*generated*' -e
in the above procedure and verify that (except on the first run, which is already clean) the problem does not occur. Going back togix clean -xde
verifies that the order is not the cause.In case it turns out to be relevant, the
git
command on the system I used reports its version when runninggit version
is asgit version 2.43.0
. It is more specifically the downstream version1:2.43.0-1ubuntu7.1
(packaged for Ubuntu 24.04 LTS) as revealed by runningapt list git
.It occurs to me that the inability to produce the problem without having just recently rebuilt might potentially be due to the effect of rebuilding on dynamic clock speed. However, I would expect at least some nondeterminism in the failures to be observed if this were the case, since the failing test is not one of the earliest tests to run. I may be able to investigate that further if other approaches do not reveal what is going on here.
The text was updated successfully, but these errors were encountered: