Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repeatable flaky test #3048

Closed
hjoliver opened this issue Mar 28, 2019 · 13 comments
Closed

repeatable flaky test #3048

hjoliver opened this issue Mar 28, 2019 · 13 comments
Assignees
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Mar 28, 2019

On current master, in my environment, tests/shutdown/18-client-on-dead-suite.t seems to always pass on its own:

$ cylc test-b -v ./tests/shutdown/18-client-on-dead-suite.t

ok 1 - 18-client-on-dead-suite-validate
ok 2 - 18-client-on-dead-suite-1
ok 3 - 18-client-on-dead-suite-1.stderr-contains-ok
ok 4 - 18-client-on-dead-suite-2
ok 5 - 18-client-on-dead-suite-2.stderr-contains-ok
ok
All tests successful.
Files=1, Tests=5, 12 wallclock secs ( 0.03 usr  0.00 sys +  3.75 cusr  0.49 csys =  4.27 CPU)
Result: PASS

But if I run it with another test, it seems to always fail, like this:

$ export CYLC_TEST_DEBUG=true 
cylc test-b -v ./tests/special/04-clock-triggered.t \
   ./tests/shutdown/18-client-on-dead-suite.t                                                                          
===(       4;6  2/5  2/4 )==============================================
18-client-on-dead-suite 18-client-on-dead-suite-1.stderr-contains-ok
Missing lines:
Request returned error: Suite "cylctb-20190328T035314Z/shutdown/18-client-on-dead-suite" already stopped

18-client-on-dead-suite 18-client-on-dead-suite-2.stderr-contains-ok
Missing lines:
Contact info not found for suite "cylctb-20190328T035314Z/shutdown/18-client-on-dead-suite", suite not running?

    stdout and stderr stored in: /tmp/oliverh/cylctb-20190328T035314Z/shutdown/18-client-on-dead-suite
Failed 2/5 subtests 
./tests/special/04-clock-triggered.t ........ 
ok 3 - 04-clock-triggered-run-past
ok 4 - 04-clock-triggered-run-later
ok

Test Summary Report
-------------------
./tests/shutdown/18-client-on-dead-suite.t (Wstat: 0 Tests: 5 Failed: 2)
  Failed tests:  3, 5
Files=2, Tests=9, 29 wallclock secs ( 0.02 usr  0.01 sys +  8.33 cusr  0.98 csys =  9.34 CPU)
Result: FAIL
@hjoliver
Copy link
Member Author

At first glance (and maybe second glance) I can't see how this test could fail. Tests 3 and 5 simply cylc ping an already-killed suite, and the ping client should print out the expected lines.

@hjoliver
Copy link
Member Author

(Occasionally 1/5 tests fail when run alone, instead of 0/5; and occasionaily 1/5 fail when run with the other test, instead of 2/5 ... so it is "flaky").

@hjoliver
Copy link
Member Author

hjoliver commented Mar 28, 2019

Ah, in failing cases, cylc ping returns this (in the ping test stderr file):

Request returned error: Could not decrypt response. Has the passphrase changed?

@hjoliver
Copy link
Member Author

Is the cylc ping client somehow connecting to the wrong suite?

@kinow
Copy link
Member

kinow commented Mar 28, 2019

Mentioning #2894 issue here so we have a reference in GitHub, just in case it may be helpful later 👍

@kinow
Copy link
Member

kinow commented Mar 28, 2019

And have confirmed, the exact same behaviour in my environment with the master branch.

$ uname -a
Linux ranma 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"
$ python --version
Python 3.7.2

@matthewrmshin matthewrmshin added this to the cylc-8.0.0 milestone Mar 28, 2019
@oliver-sanders
Copy link
Member

oliver-sanders commented Mar 28, 2019

I know what's going on here...

Is the cylc ping client somehow connecting to the wrong suite?

Yes, reliably every time!

In tests/shutdown/18-client-on-dead-suite.t the suite is killed leaving behind the contact file. So when later in the test cylc ping attempts to connect to the suite there is always a risk that a new suite will have started up on that port causing the test to fail with:

Request returned error: Could not decrypt response. Has the passphrase changed?

So this test was, by design always going to be flaky.

In the new ZMQ implementation the port is not chosen at random, ZMQ picks the lowest available port in the range so it's gone from being slightly flaky to reliably flaky.

There is no real reason for picking the port this way, it was slightly nicer during the debug phase. I think there is a TODO in there somewhere. I think there might be a nice way of doing random selection in ZMQ itself.

@oliver-sanders
Copy link
Member

I guess this is a case where auto-rerunning failed tests isn't always the most helpful thing to do.

@hjoliver
Copy link
Member Author

Ah, brilliant- it all makes sense. That's a relief, thanks @oliver-sanders 🍺

@hjoliver
Copy link
Member Author

(I had forgotten you'd switched to sequential port acquisition).

@oliver-sanders
Copy link
Member

It was just a stopgap I never got rid of.

@oliver-sanders
Copy link
Member

#3004 will reduce the flakyness of this test proportionate to the number of suites divided by the number of ports. Not good but much better, is this enough to close the issue for now.

@hjoliver
Copy link
Member Author

I think that's good enough, with a comment in the test to indicate exactly why it might occasionally fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants