Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spurious hydra test failures in travis #1169

Closed
grondo opened this issue Aug 29, 2017 · 5 comments
Closed

spurious hydra test failures in travis #1169

grondo opened this issue Aug 29, 2017 · 5 comments

Comments

@grondo
Copy link
Contributor

grondo commented Aug 29, 2017

not ok 5 - Flux libpmi-client wire protocol works with Hydra

Saw the failure above on travis test on master

Build log

I didn't see any clues in the log.

@garlick
Copy link
Member

garlick commented Aug 31, 2017

This just failed in travis:

Hydra sets PMI_RANK to unique value
expecting success: 
	test `mpiexec.hydra -n 4 printenv PMI_SIZE | sort | uniq | wc -l` -eq 1
not ok 4 - Hydra sets PMI_SIZE to uniform value

Nothing much in the logs here either. although the config.log did contain a write error

conftest.c:64:25: error: duplicate case value '0'
switch (0) case 0: case (sizeof (long long) == 4):;
                        ^
conftest.c:64:17: note: previous case defined here
switch (0) case 0: case (sizeof (long long) == 4):;
                ^
1 error generated.
configure:13591: $? = 1
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "flux-core"
| #define PACKAGE_TARNAME "flux-core"cat: write error: Resource temporarily unavailable

@garlick garlick changed the title not ok 5 - Flux libpmi-client wire protocol works with Hydra spurious hydra test failures in travis Aug 31, 2017
grondo added a commit to grondo/flux-core that referenced this issue Oct 20, 2017
Add more debug output to the hydra tests to attempt to capture data
useful for resolving flux-framework#1169.
grondo added a commit to grondo/flux-core that referenced this issue Oct 20, 2017
Add more debug output to the hydra tests to attempt to capture data
useful for resolving flux-framework#1169.
grondo added a commit to grondo/flux-core that referenced this issue Oct 20, 2017
Add more debug output to the hydra tests to attempt to capture data
useful for resolving flux-framework#1169.
@garlick
Copy link
Member

garlick commented Oct 13, 2018

I hit the PMI_RANK test failure on my desktop and noted the output file was empty. This makes me wonder if mpiexec.hydra is exiting without flushing its output, and the very short test with very little output is managing to get cut off before it can produce anything?

Two mpiexec.hydra options might be interesting to play with

-outfile-pattern                 direct stdout to file
-launcher                        launcher to use (ssh rsh fork slurm ll lsf sge manual persist)

If buffered I/O is used, redirecting to a file might result in an fclose(file) which would implicitly flush, whereas stdout might be left open?

The default launcher is ssh, clearly not necessary in the test case. I don't know if switching it to fork would be likely to help this problem, but it does seem like it would eliminate another location where I/O is buffered (in ssh or sshd).

I should mention I am hitting this on Ubuntu 18.04.1 LTS with mpich 3.3~a2-4.

garlick added a commit to garlick/flux-core that referenced this issue Oct 13, 2018
Problem: occasionally mpiexec output from spawned tasks
is lost, causing test to fail sporadically.

Try adding the "-launcher fork" option.  This overrides
the default launcher, which is "ssh".

Maybe this will fix flux-framework#1169
garlick added a commit to garlick/flux-core that referenced this issue Oct 13, 2018
Problem: occasionally mpiexec output from spawned tasks
is lost, causing tests to fail sporadically.

Instead of redirecting stdout, use the mpiexec -outfile
option to let mpiexec redirect the output internally.

Maybe this will fix flux-framework#1169
garlick added a commit to garlick/flux-core that referenced this issue Oct 15, 2018
Problem: occasionally mpiexec output from spawned tasks
is lost, causing test to fail sporadically.

Try adding the "-launcher fork" option.  This overrides
the default launcher, which is "ssh".

Maybe this will fix flux-framework#1169
garlick added a commit to garlick/flux-core that referenced this issue Oct 15, 2018
Problem: occasionally mpiexec output from spawned tasks
is lost, causing tests to fail sporadically.

Instead of redirecting stdout, use the mpiexec -outfile
option to let mpiexec redirect the output internally.

Maybe this will fix flux-framework#1169
garlick added a commit to garlick/flux-core that referenced this issue Oct 15, 2018
Problem: occasionally mpiexec output from spawned tasks
is lost, causing test to fail sporadically.

Try adding the "-launcher fork" option.  This overrides
the default launcher, which is "ssh".

Maybe this will fix flux-framework#1169
garlick added a commit to garlick/flux-core that referenced this issue Oct 15, 2018
Problem: occasionally mpiexec output from spawned tasks
is lost, causing tests to fail sporadically.

Instead of redirecting stdout, use the mpiexec -outfile
option to let mpiexec redirect the output internally.

Maybe this will fix flux-framework#1169
garlick added a commit to garlick/flux-core that referenced this issue Oct 16, 2018
Problem: occasionally mpiexec output from spawned tasks
is lost, causing test to fail sporadically.

Try adding the "-launcher fork" option.  This overrides
the default launcher, which is "ssh".

Maybe this will fix flux-framework#1169
garlick added a commit to garlick/flux-core that referenced this issue Oct 16, 2018
Problem: occasionally mpiexec output from spawned tasks
is lost, causing tests to fail sporadically.

Instead of redirecting stdout, use the mpiexec -outfile
option to let mpiexec redirect the output internally.

Maybe this will fix flux-framework#1169
@garlick
Copy link
Member

garlick commented Oct 16, 2018

I've seen that failure again even with the proposed mpiexec options, so I'll drop those suggested fixes from my PR.

@SteVwonder
Copy link
Member

SteVwonder commented Nov 5, 2019

I am seeing this failure on Hydra sets PMI_RANK to unique value too. My out2 was empty as well. It seems to only occur when I do make -j check.

Running on Ubuntu 19.10 with mpich 3.3-3.

EDIT: Occasionally seeing an error on 1 - Hydra runs hello world

→ less t2004-hydra.log 
/usr/bin/mpiexec.hydra
not ok 1 - Hydra runs hello world
FAIL: t2004-hydra.t 1 - Hydra runs hello world
#       
#               mpiexec.hydra -n 4 echo "Hello World"

@garlick
Copy link
Member

garlick commented Nov 5, 2019

We should just nix these tests that most commonly fail:

diff --git a/t/t2004-hydra.t b/t/t2004-hydra.t
index 276bfa5e9..77eacb892 100755
--- a/t/t2004-hydra.t
+++ b/t/t2004-hydra.t
@@ -17,26 +17,6 @@ test_expect_success 'Hydra runs hello world' '
        mpiexec.hydra -n 4 echo "Hello World"
 '
 
-count_uniq_lines() { sort $1 | uniq | wc -l; }
-
-test_expect_success 'Hydra sets PMI_FD to unique value' '
-       mpiexec.hydra -n 4 printenv PMI_FD > out &&
-       test_debug "cat out" &&
-       test $(count_uniq_lines out) -eq 4
-'
-
-test_expect_success 'Hydra sets PMI_RANK to unique value' '
-       mpiexec.hydra -n 4 printenv PMI_RANK > out2 &&
-       test_debug "cat out2" &&
-       test $(count_uniq_lines out2) -eq 4
-'
-
-test_expect_success 'Hydra sets PMI_SIZE to uniform value' '
-       mpiexec.hydra -n 4 printenv PMI_SIZE > out3 &&
-       test_debug "cat out3" &&
-       test $(count_uniq_lines out3) -eq 1
-'
-
 test_expect_success 'Flux libpmi-client wire protocol works with Hydra' '
        mpiexec.hydra -n 4 ${PMI_INFO}
 '

They are not accomplishing much for us.

garlick added a commit to garlick/flux-core that referenced this issue Dec 23, 2019
Problem: hydra tests fail occasionally

As noted in flux-framework#1169, some versions of hydra might have a problem
capturing stdio, which makes these tests unreliable.  Drop the
tests that are just verifying hydra's PMI behavior, and make the
ones that remain not dependent on stdio.

Fixes flux-framework#1169
garlick added a commit to garlick/flux-core that referenced this issue Dec 23, 2019
Problem: hydra tests fail occasionally

As noted in flux-framework#1169, some versions of hydra might have a problem
capturing stdio, which makes these tests unreliable.  Drop the
tests that are just verifying hydra's PMI behavior, and make the
ones that remain not dependent on stdio.

Fixes flux-framework#1169
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants