sporadic test failure in tst/testbugfix/2006-08-28-t00151.tst #4368

ThomasBreuer · 2021-04-07T07:50:34Z

The test from tst/testbugfix/2006-08-28-t00151.tst:14 has recently failed in several situations.
This is mentioned in the discussion of pull request #4326, which contains links to other instances.
@wilfwilson stated: "Let's keep an eye out for this happening again."
Now I got the same failure again in a test run.
I have still no idea what is going on here.

The text was updated successfully, but these errors were encountered:

fingolfin · 2021-04-07T22:17:50Z

One potential issue with this test is that it depends on the state of the garbage collector: a lot of collections are made in it, and at some point (or several) a GC is triggered; perhaps even a full (not partial) one. If it happens in the first loops vs. the second loop, that can substantially skew the timings. I'll open a PR for this.

Next, the CI tests run in VMs, and multiple VMs will run on a single real host; and so other workloads run in parallel can cause strong fluctuations in the performance of the VM. So it may happen that it slows down temporarily while one of the two loops being compared is run. This skew can be quite substantial -- looking at https://github.com/gap-system/gap/actions/workflows/CI.yml?query=branch%3Amaster I see some passing runs in 37 minutes and other in over an hour.

To combat that one can employ a bit more sophisticated sampling: right now we run each of the two tests 20 times, and then compare the sums (so, in effect, the averages). It might be better to compare the medians; or to discard outliers and then compare the average; and one cold also compare the geometric mean instead, etc. etc. As it is, benchmarking is simply hard, and the approach used in that test is perhaps a bit too naive when faced with such an erratic environment as presented by the VMs our CI runs on...

wilfwilson · 2021-04-21T08:36:38Z

The failures persist (although perhaps less frequently now?)

ThomasBreuer · 2021-04-21T08:45:22Z

Yes, I had mentioned this just an hour ago in a review comment for pull request #4421. (Why doesn't github automatically show a link here?)

fingolfin · 2021-04-21T11:01:23Z

@ThomasBreuer that's indeed weird, normally GitHub shows "mentions". In any case, I think it's good that Wilf added the explicit links to failed build jobs here, as it makes it much easier to see and review them when one visits this issue.

Anyway, I decided to have a closer look in case the failures indicate a real issue. My conclusion is that the failures really are bogus and due to a too simplistic approach to measuring: The fluctuations on the CI server simply are too high; averaging multiple results simply is not good enough. So I propose we instead alternate measuring the two different computations (instead of first measuring the one 20 times, then other 20 times) to make it less likely that a big spike affects one of the two sets much more than the other; secondly, to track the separate timings, and then compare the median, not the mean; this removes extreme outliers in both directions.

I tested this across several GAP versions from 4.4.12 to master. Note that the test in question was added in GAP 4.4.9, by @frankluebeck ; the relevant release notes item seems to be this:

Now PositionSorted is much faster on long mutable plain lists. (The former operation is substituted by a function and a new operation PositionSortedOp.) (Reported by Silviu Radu)

Doing so showed that the two operations (calling PositionSorted on mutable vs. immutable lists) indeed take about the same in all GAP versions I tested. But I did discover a substantial regression in performance between GAP 4.8 and 4.9. I reported this as issue #4422

frankluebeck · 2021-04-21T12:42:36Z

The speedup of the first loop with a mutable list between GAP 4.3 and 4.4.12 is about a factor of 5000.

So, the original test file could have been written with if time1 >= 10*time2 then instead of if time1 >= 2*time2 then (or an even higher factor). Would that solve this problem?

fingolfin mentioned this issue Apr 7, 2021

Increase robustness of a test relying on timing #4371

Merged

fingolfin changed the title ~~sporadic test failure~~ sporadic test failure in tst/testbugfix/2006-08-28-t00151.tst Apr 21, 2021

fingolfin mentioned this issue Apr 21, 2021

Performance regression in LtPlist due to recursion check #4422

Open

fingolfin mentioned this issue Apr 21, 2021

CI: detect failure to run make docomp nicely again #4421

Merged

fingolfin mentioned this issue Apr 21, 2021

Remove an unstable test #4426

Merged

fingolfin closed this as completed in #4426 Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sporadic test failure in tst/testbugfix/2006-08-28-t00151.tst #4368

sporadic test failure in tst/testbugfix/2006-08-28-t00151.tst #4368

ThomasBreuer commented Apr 7, 2021

fingolfin commented Apr 7, 2021

wilfwilson commented Apr 21, 2021 •

edited

Loading

ThomasBreuer commented Apr 21, 2021

fingolfin commented Apr 21, 2021

frankluebeck commented Apr 21, 2021

sporadic test failure in tst/testbugfix/2006-08-28-t00151.tst #4368

sporadic test failure in tst/testbugfix/2006-08-28-t00151.tst #4368

Comments

ThomasBreuer commented Apr 7, 2021

fingolfin commented Apr 7, 2021

wilfwilson commented Apr 21, 2021 • edited Loading

ThomasBreuer commented Apr 21, 2021

fingolfin commented Apr 21, 2021

frankluebeck commented Apr 21, 2021

wilfwilson commented Apr 21, 2021 •

edited

Loading