repeatable flaky test #3048

hjoliver · 2019-03-28T03:55:10Z

On current master, in my environment, tests/shutdown/18-client-on-dead-suite.t seems to always pass on its own:

$ cylc test-b -v ./tests/shutdown/18-client-on-dead-suite.t

ok 1 - 18-client-on-dead-suite-validate
ok 2 - 18-client-on-dead-suite-1
ok 3 - 18-client-on-dead-suite-1.stderr-contains-ok
ok 4 - 18-client-on-dead-suite-2
ok 5 - 18-client-on-dead-suite-2.stderr-contains-ok
ok
All tests successful.
Files=1, Tests=5, 12 wallclock secs ( 0.03 usr  0.00 sys +  3.75 cusr  0.49 csys =  4.27 CPU)
Result: PASS

But if I run it with another test, it seems to always fail, like this:

$ export CYLC_TEST_DEBUG=true 
cylc test-b -v ./tests/special/04-clock-triggered.t \
   ./tests/shutdown/18-client-on-dead-suite.t                                                                          
===(       4;6  2/5  2/4 )==============================================
18-client-on-dead-suite 18-client-on-dead-suite-1.stderr-contains-ok
Missing lines:
Request returned error: Suite "cylctb-20190328T035314Z/shutdown/18-client-on-dead-suite" already stopped

18-client-on-dead-suite 18-client-on-dead-suite-2.stderr-contains-ok
Missing lines:
Contact info not found for suite "cylctb-20190328T035314Z/shutdown/18-client-on-dead-suite", suite not running?

    stdout and stderr stored in: /tmp/oliverh/cylctb-20190328T035314Z/shutdown/18-client-on-dead-suite
Failed 2/5 subtests 
./tests/special/04-clock-triggered.t ........ 
ok 3 - 04-clock-triggered-run-past
ok 4 - 04-clock-triggered-run-later
ok

Test Summary Report
-------------------
./tests/shutdown/18-client-on-dead-suite.t (Wstat: 0 Tests: 5 Failed: 2)
  Failed tests:  3, 5
Files=2, Tests=9, 29 wallclock secs ( 0.02 usr  0.01 sys +  8.33 cusr  0.98 csys =  9.34 CPU)
Result: FAIL

The text was updated successfully, but these errors were encountered:

hjoliver · 2019-03-28T03:57:26Z

At first glance (and maybe second glance) I can't see how this test could fail. Tests 3 and 5 simply cylc ping an already-killed suite, and the ping client should print out the expected lines.

hjoliver · 2019-03-28T04:02:17Z

(Occasionally 1/5 tests fail when run alone, instead of 0/5; and occasionaily 1/5 fail when run with the other test, instead of 2/5 ... so it is "flaky").

hjoliver · 2019-03-28T04:13:07Z

Ah, in failing cases, cylc ping returns this (in the ping test stderr file):

Request returned error: Could not decrypt response. Has the passphrase changed?

hjoliver · 2019-03-28T04:16:54Z

Is the cylc ping client somehow connecting to the wrong suite?

kinow · 2019-03-28T06:19:13Z

Mentioning #2894 issue here so we have a reference in GitHub, just in case it may be helpful later 👍

kinow · 2019-03-28T06:23:15Z

And have confirmed, the exact same behaviour in my environment with the master branch.

$ uname -a
Linux ranma 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"
$ python --version
Python 3.7.2

oliver-sanders · 2019-03-28T09:42:35Z

I know what's going on here...

Is the cylc ping client somehow connecting to the wrong suite?

Yes, reliably every time!

In tests/shutdown/18-client-on-dead-suite.t the suite is killed leaving behind the contact file. So when later in the test cylc ping attempts to connect to the suite there is always a risk that a new suite will have started up on that port causing the test to fail with:

Request returned error: Could not decrypt response. Has the passphrase changed?

So this test was, by design always going to be flaky.

In the new ZMQ implementation the port is not chosen at random, ZMQ picks the lowest available port in the range so it's gone from being slightly flaky to reliably flaky.

There is no real reason for picking the port this way, it was slightly nicer during the debug phase. I think there is a TODO in there somewhere. I think there might be a nice way of doing random selection in ZMQ itself.

oliver-sanders · 2019-03-28T09:48:48Z

I guess this is a case where auto-rerunning failed tests isn't always the most helpful thing to do.

hjoliver · 2019-03-28T10:29:34Z

Ah, brilliant- it all makes sense. That's a relief, thanks @oliver-sanders 🍺

hjoliver · 2019-03-28T22:13:49Z

(I had forgotten you'd switched to sequential port acquisition).

oliver-sanders · 2019-03-29T09:29:11Z

It was just a stopgap I never got rid of.

oliver-sanders · 2019-03-29T09:30:55Z

#3004 will reduce the flakyness of this test proportionate to the number of suites divided by the number of ports. Not good but much better, is this enough to close the issue for now.

hjoliver · 2019-03-31T20:08:36Z

I think that's good enough, with a comment in the test to indicate exactly why it might occasionally fail.

matthewrmshin added this to the cylc-8.0.0 milestone Mar 28, 2019

oliver-sanders mentioned this issue Mar 29, 2019

document suite runtime interface #3004

Merged

hjoliver closed this as completed in #3004 Apr 4, 2019

matthewrmshin assigned oliver-sanders Apr 9, 2019

matthewrmshin modified the milestones: cylc-8.0.0, cylc-8.0a1 Apr 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repeatable flaky test #3048

repeatable flaky test #3048

hjoliver commented Mar 28, 2019 •

edited

Loading

hjoliver commented Mar 28, 2019

hjoliver commented Mar 28, 2019

hjoliver commented Mar 28, 2019 •

edited

Loading

hjoliver commented Mar 28, 2019

kinow commented Mar 28, 2019

kinow commented Mar 28, 2019

oliver-sanders commented Mar 28, 2019 •

edited

Loading

oliver-sanders commented Mar 28, 2019

hjoliver commented Mar 28, 2019

hjoliver commented Mar 28, 2019

oliver-sanders commented Mar 29, 2019

oliver-sanders commented Mar 29, 2019

hjoliver commented Mar 31, 2019

repeatable flaky test #3048

repeatable flaky test #3048

Comments

hjoliver commented Mar 28, 2019 • edited Loading

hjoliver commented Mar 28, 2019

hjoliver commented Mar 28, 2019

hjoliver commented Mar 28, 2019 • edited Loading

hjoliver commented Mar 28, 2019

kinow commented Mar 28, 2019

kinow commented Mar 28, 2019

oliver-sanders commented Mar 28, 2019 • edited Loading

oliver-sanders commented Mar 28, 2019

hjoliver commented Mar 28, 2019

hjoliver commented Mar 28, 2019

oliver-sanders commented Mar 29, 2019

oliver-sanders commented Mar 29, 2019

hjoliver commented Mar 31, 2019

hjoliver commented Mar 28, 2019 •

edited

Loading

hjoliver commented Mar 28, 2019 •

edited

Loading

oliver-sanders commented Mar 28, 2019 •

edited

Loading