Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sporadic test failure in tst/testbugfix/2006-08-28-t00151.tst #4368

Closed
ThomasBreuer opened this issue Apr 7, 2021 · 5 comments · Fixed by #4426
Closed

sporadic test failure in tst/testbugfix/2006-08-28-t00151.tst #4368

ThomasBreuer opened this issue Apr 7, 2021 · 5 comments · Fixed by #4426

Comments

@ThomasBreuer
Copy link
Contributor

The test from tst/testbugfix/2006-08-28-t00151.tst:14 has recently failed in several situations.
This is mentioned in the discussion of pull request #4326, which contains links to other instances.
@wilfwilson stated: "Let's keep an eye out for this happening again."
Now I got the same failure again in a test run.
I have still no idea what is going on here.

@fingolfin
Copy link
Member

One potential issue with this test is that it depends on the state of the garbage collector: a lot of collections are made in it, and at some point (or several) a GC is triggered; perhaps even a full (not partial) one. If it happens in the first loops vs. the second loop, that can substantially skew the timings. I'll open a PR for this.

Next, the CI tests run in VMs, and multiple VMs will run on a single real host; and so other workloads run in parallel can cause strong fluctuations in the performance of the VM. So it may happen that it slows down temporarily while one of the two loops being compared is run. This skew can be quite substantial -- looking at https://github.com/gap-system/gap/actions/workflows/CI.yml?query=branch%3Amaster I see some passing runs in 37 minutes and other in over an hour.

To combat that one can employ a bit more sophisticated sampling: right now we run each of the two tests 20 times, and then compare the sums (so, in effect, the averages). It might be better to compare the medians; or to discard outliers and then compare the average; and one cold also compare the geometric mean instead, etc. etc. As it is, benchmarking is simply hard, and the approach used in that test is perhaps a bit too naive when faced with such an erratic environment as presented by the VMs our CI runs on...

@wilfwilson
Copy link
Member

wilfwilson commented Apr 21, 2021

@ThomasBreuer
Copy link
Contributor Author

Yes, I had mentioned this just an hour ago in a review comment for pull request #4421. (Why doesn't github automatically show a link here?)

@fingolfin fingolfin changed the title sporadic test failure sporadic test failure in tst/testbugfix/2006-08-28-t00151.tst Apr 21, 2021
@fingolfin
Copy link
Member

@ThomasBreuer that's indeed weird, normally GitHub shows "mentions". In any case, I think it's good that Wilf added the explicit links to failed build jobs here, as it makes it much easier to see and review them when one visits this issue.

Anyway, I decided to have a closer look in case the failures indicate a real issue. My conclusion is that the failures really are bogus and due to a too simplistic approach to measuring: The fluctuations on the CI server simply are too high; averaging multiple results simply is not good enough. So I propose we instead alternate measuring the two different computations (instead of first measuring the one 20 times, then other 20 times) to make it less likely that a big spike affects one of the two sets much more than the other; secondly, to track the separate timings, and then compare the median, not the mean; this removes extreme outliers in both directions.

I tested this across several GAP versions from 4.4.12 to master. Note that the test in question was added in GAP 4.4.9, by @frankluebeck ; the relevant release notes item seems to be this:

  • Now PositionSorted is much faster on long mutable plain lists. (The former operation is substituted by a function and a new operation PositionSortedOp.) (Reported by Silviu Radu)

Doing so showed that the two operations (calling PositionSorted on mutable vs. immutable lists) indeed take about the same in all GAP versions I tested. But I did discover a substantial regression in performance between GAP 4.8 and 4.9. I reported this as issue #4422

@frankluebeck
Copy link
Member

The speedup of the first loop with a mutable list between GAP 4.3 and 4.4.12 is about a factor of 5000.

So, the original test file could have been written with if time1 >= 10*time2 then instead of if time1 >= 2*time2 then (or an even higher factor). Would that solve this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants