-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sporadic test failure in tst/testbugfix/2006-08-28-t00151.tst #4368
Comments
One potential issue with this test is that it depends on the state of the garbage collector: a lot of collections are made in it, and at some point (or several) a GC is triggered; perhaps even a full (not partial) one. If it happens in the first loops vs. the second loop, that can substantially skew the timings. I'll open a PR for this. Next, the CI tests run in VMs, and multiple VMs will run on a single real host; and so other workloads run in parallel can cause strong fluctuations in the performance of the VM. So it may happen that it slows down temporarily while one of the two loops being compared is run. This skew can be quite substantial -- looking at https://github.com/gap-system/gap/actions/workflows/CI.yml?query=branch%3Amaster I see some passing runs in 37 minutes and other in over an hour. To combat that one can employ a bit more sophisticated sampling: right now we run each of the two tests 20 times, and then compare the sums (so, in effect, the averages). It might be better to compare the medians; or to discard outliers and then compare the average; and one cold also compare the geometric mean instead, etc. etc. As it is, benchmarking is simply hard, and the approach used in that test is perhaps a bit too naive when faced with such an erratic environment as presented by the VMs our CI runs on... |
The failures persist (although perhaps less frequently now?) |
Yes, I had mentioned this just an hour ago in a review comment for pull request #4421. (Why doesn't github automatically show a link here?) |
@ThomasBreuer that's indeed weird, normally GitHub shows "mentions". In any case, I think it's good that Wilf added the explicit links to failed build jobs here, as it makes it much easier to see and review them when one visits this issue. Anyway, I decided to have a closer look in case the failures indicate a real issue. My conclusion is that the failures really are bogus and due to a too simplistic approach to measuring: The fluctuations on the CI server simply are too high; averaging multiple results simply is not good enough. So I propose we instead alternate measuring the two different computations (instead of first measuring the one 20 times, then other 20 times) to make it less likely that a big spike affects one of the two sets much more than the other; secondly, to track the separate timings, and then compare the median, not the mean; this removes extreme outliers in both directions. I tested this across several GAP versions from 4.4.12 to master. Note that the test in question was added in GAP 4.4.9, by @frankluebeck ; the relevant release notes item seems to be this:
Doing so showed that the two operations (calling |
The speedup of the first loop with a mutable list between GAP 4.3 and 4.4.12 is about a factor of 5000. So, the original test file could have been written with |
The test from
tst/testbugfix/2006-08-28-t00151.tst:14
has recently failed in several situations.This is mentioned in the discussion of pull request #4326, which contains links to other instances.
@wilfwilson stated: "Let's keep an eye out for this happening again."
Now I got the same failure again in a test run.
I have still no idea what is going on here.
The text was updated successfully, but these errors were encountered: