Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sharness: update flux-sharness.sh script #316

Merged
merged 6 commits into from
Apr 19, 2018

Conversation

garlick
Copy link
Member

@garlick garlick commented Apr 18, 2018

This PR syncs t/sharness.d/flux-sharness.sh with the version in flux-core.

The flux-broker --quiet option was removed recently flux-core, but sched's version of flux-sharness.sh was still using it, resulting in sadness. Rather than making a small change to sched's verison, I copied over core's version including some other changes.

One such change is a test that any modules loaded within a sharness script are unloaded when the script exits. This necessitated trivial additions to some tests.

garlick added 2 commits April 18, 2018 15:18
Problem: newer flux-sharness.sh script fails a test
if it exits with extra modules loaded.

Amend tests to ensure that any modules loaded
in the test are unloaded at the end of the test.
Problem: test_under_flux() conditionally
starts the broker with the -q option,
but that option is no longer valid.

N.B. flux-broker -q,--quiet was dropped in
PR flux-framework/flux-core#1464.

flux-sharness.sh has lagged behind the
one in flux-core.  Copy over the
flux-core version.
@dongahn
Copy link
Member

dongahn commented Apr 18, 2018

LGTM. I will merge it when CI reports back. Thanks @garlick!

@coveralls
Copy link

coveralls commented Apr 18, 2018

Coverage Status

Coverage increased (+0.08%) to 75.753% when pulling e92958e on garlick:sharness_update into 38abd39 on flux-framework:master.

@dongahn
Copy link
Member

dongahn commented Apr 18, 2018

Not immediately obvious why these failed on Travis.

t1001-rs2rank-basic

expecting success: 
    adjust_session_info 4 &&
    flux module remove sched &&
    flux hwloc reload ${excl_4N4B_sierra} &&
    flux module load sched sched-once=true &&
    timed_wait_job 5 &&
    submit_1N_nproc_sleep_jobs ${excl_4N4B_nc_sierra} 0 &&
    timed_sync_wait_job 10 &&
    verify_1N_nproc_sleep_jobs ${excl_4N4B_nc_sierra}
flux-hwloc: flux_mrpc_get: Protocol error
flux-module: flux_open: Connection refused
not ok 7 - rs2rank: can handle sierra nodes with group type
expecting success: 
    flux module remove sched
flux-module: flux_open: Connection refused
not ok 8 - rs2rank: unloaded sched module


flux-module: flux_open: Connection refused
not ok 8 - rs2rank: unloaded sched module

@grondo
Copy link
Contributor

grondo commented Apr 18, 2018

Kind of looks like one or more brokers died while realoading the hwloc file. The last entry in the broker logfile is

2018-04-18T22:34:42.667446Z resource-hwloc.info[0]: loading hwloc from /home/travis/build/flux-framework/flux-sched/flux-sched-0.4.0-107-g2cc9b23/t/data/hwloc-data/004N/exclusive/04-brokers-sierra/0.xml

unfortunately there's no way to gather coredumps or other fatal information on Travis right now, so this is all we have to go on...

@grondo
Copy link
Contributor

grondo commented Apr 18, 2018

Another builder didn't make it that far (just as far as unloading sched module). I also noticed this:

ERROR: t1001-rs2rank-basic.t - exited with status 137 (terminated by signal 9?)

Maybe hit by the oom-killer?

@dongahn
Copy link
Member

dongahn commented Apr 18, 2018

Hmmm. That's a new test I added for Sierra but it was okay when it was committed. What are some of the recent changes in flux-core?

@garlick
Copy link
Member Author

garlick commented Apr 18, 2018

There is this one in resource-hwloc:
flux-framework/flux-core@ff5f4d0

@grondo
Copy link
Contributor

grondo commented Apr 18, 2018

There is this one in resource-hwloc:
flux-framework/flux-core@ff5f4d0

Yeah, flux-sched nightly build started failing about 6 days ago in the same way described here, and the commit you reference @garlick went into flux-core@master ... 6 days ago.

@garlick
Copy link
Member Author

garlick commented Apr 19, 2018

Travis has libhwloc5_1.8-1 which is listed as "ancient" on the hwloc site.

When I look at the changelog between there and say libhwloc5_1.11.2-3 which appears to work, waugh.... That software has a lot of churn!

Should we build a newer one in travis perhaps? I could tack that onto this PR...

@grondo
Copy link
Contributor

grondo commented Apr 19, 2018

Should we build a newer one in travis perhaps? I could tack that onto this PR...

Are we using a newer libhwloc in flux-core? It makes sense to use a relatively recent libhwloc, but wondering what changed as of 6 days ago that the test started failing.

@garlick
Copy link
Member Author

garlick commented Apr 19, 2018

Oh, we are building 1.11.0 in travis for flux-core.

We are just installing the default package in flux-sched.

My thought was maybe the flag changes in that commit somehow made the old hwloc sad in combination with the test inputs that are failing...

@grondo
Copy link
Contributor

grondo commented Apr 19, 2018

Yeah, we should match what we're doing in flux-core. That just makes sense, and your intuition seems good to me!

@garlick
Copy link
Member Author

garlick commented Apr 19, 2018

Oh hold on, sched's .travis.yml is grabbing flux-core's travis-dep-builder.sh and running it.

Maybe I just need to remove the hwloc package from packages:....

Problem: sched's .travis.yml is both installing
the hwloc package and building it from source
via flux-core's travis-dep-builder.sh script.

Drop the packaged hwloc from .travis.yml.
@codecov-io
Copy link

codecov-io commented Apr 19, 2018

Codecov Report

Merging #316 into master will increase coverage by 0.07%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #316      +/-   ##
==========================================
+ Coverage   73.98%   74.06%   +0.07%     
==========================================
  Files          49       49              
  Lines        9511     9511              
==========================================
+ Hits         7037     7044       +7     
+ Misses       2474     2467       -7
Impacted Files Coverage Δ
simulator/simsrv.c 77.66% <0%> (+1.01%) ⬆️
simulator/sim_execsrv.c 85.78% <0%> (+1.01%) ⬆️
simulator/submitsrv.c 79% <0%> (+1.5%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38abd39...e92958e. Read the comment docs.

@grondo
Copy link
Contributor

grondo commented Apr 19, 2018

Hit this in c9.io, here's some extra output I saw on the console

flux-broker: topology.c:924: hwloc___insert_object_by_cpuset: Assertion `topology->ignored_types[HWLOC_OBJ_GROUP] == HWLOC_IGNORE_TYPE_KEEP_STRUCTURE' failed.

This with hwloc-1.11.0.

I updated to hwloc-1.11.1 and the problem goes away. shrug

Probably

+ Fix an overzealous assertion when inserting an intermediate Group object
    while Groups are totally ignored.

See hwloc 1.11.1 NEWS.

@dongahn
Copy link
Member

dongahn commented Apr 19, 2018

Yeah, HWLOC_OBJ_GROUP is a new type that Sierra hwloc xml brings to the table.

@grondo
Copy link
Contributor

grondo commented Apr 19, 2018 via email

@dongahn
Copy link
Member

dongahn commented Apr 19, 2018

That makes sense to me. (https://github.com/flux-framework/flux-sched/blob/master/configure.ac#L45)

Also somehow sched's README.md misses the hwloc dependency and this probably should be documented at here

garlick added a commit to garlick/flux-core that referenced this pull request Apr 19, 2018
Problem: hwloc-1.11.0 asserts when ingesting sierra
hwloc XML.

hwloc-1.11.1 contains a fix for this so bump
the minimum version in configure.ac.

This was discussed in flux-framework/flux-sched#316.
garlick added 2 commits April 19, 2018 09:36
Problem: hwloc-1.11.0 asserts when ingesting sierra
hwloc XML.

hwloc-1.11.1 contains a fix for this so set the
minimum version to this in configure.ac.
@garlick
Copy link
Member Author

garlick commented Apr 19, 2018

Nice work guys! I like it when I leave work with a mystery, and it's solved by the time I check back in again!

@grondo
Copy link
Contributor

grondo commented Apr 19, 2018

Restarted build after merging flux-framework/flux-core#1478

@grondo
Copy link
Contributor

grondo commented Apr 19, 2018

Looks like t0003-basic-install.t needs the @garlick module cleanup treatment.

Found ./t/t0003-basic-install.output
expecting success: 
	flux module remove sched &&
	flux module load sched
flux-module: cmb.rmmod[0] sched: No such file or directory
ok 1 - sched: module remove/load works with installed sched
sched: module remove/load works with installed sched
# passed all 1 test(s)
1..1
Error: manually loaded module(s) not unloaded: sched

@grondo
Copy link
Contributor

grondo commented Apr 19, 2018

@dongahn, one of the builders failed here

FAIL: t1007-exclude.t 5 - excluding a node with reservations works

It is difficult to tell exactly why it failed:

expecting success: 
    adjust_session_info 4 &&
    flux module remove sched &&
    flux module load sched sched-once=true node-excl=true plugin=sched.backfill &&
    timed_wait_job 5 submitted &&
    flux submit -N 4 sleep 0 &&
    flux submit -N 4 sleep 0 &&
    flux submit -N 4 sleep 0 &&
    flux submit -N 4 sleep 0 &&
    timed_sync_wait_job 10 &&
    flux wreck exclude -k cab1235 &&
    state=$(flux kvs get -j $(job_kvs_path 7).state) &&
    test ${state} = "complete"
submit: Submitted jobid 7
submit: Submitted jobid 8
submit: Submitted jobid 9
flux-waitjob: waitjob_cb: completion notified
submit: Submitted jobid 10
not ok 5 - excluding a node with reservations works

Is this just a race condition when jobs complete before the test script reaches timed_sync_wait_job?

@garlick
Copy link
Member Author

garlick commented Apr 19, 2018

Hmm, same failure after I tacked on that commit.

@grondo
Copy link
Contributor

grondo commented Apr 19, 2018

Hmm, same failure after I tacked on that commit.

I think there is a race in the t1007-exclude.t test. You did fix the t0003-basic-install.t test. Likely if we reran the one failed builder the test would (eventually) pass. Or I can open an issue and we can merge this PR since it does fix broken flux-sched against flux-core master (Sorry about that!)

@grondo
Copy link
Contributor

grondo commented Apr 19, 2018

I think there is a race in the t1007-exclude.t test.

Hm, maybe not a race because I couldn't trigger it with an introduced sleep. I'm not sure what is going on, but since this test didn't fail in all builders it isn't a reproducible failure. I think @dongahn will have to comment.

@garlick
Copy link
Member Author

garlick commented Apr 19, 2018

I opened #317 for the test failure and restarted that builder. Maybe we can merge this since the failure is (likely) unrelated to this PR?

@grondo grondo merged commit a6e03c2 into flux-framework:master Apr 19, 2018
@grondo
Copy link
Contributor

grondo commented Apr 19, 2018

Works for me. Thanks!

@dongahn
Copy link
Member

dongahn commented Apr 19, 2018

Hm, maybe not a race because I couldn't trigger it with an introduced sleep. I'm not sure what is going on, but since this test didn't fail in all builders it isn't a reproducible failure. I think @dongahn will have to comment.

Yes, I believe there is a race. I know how to fix this. Why don't you disable this case and merge this in and I will get to this either tonight or tomorrow.

@grondo
Copy link
Contributor

grondo commented Apr 19, 2018

We went ahead and merged since the test only fails intermittently.

@dongahn
Copy link
Member

dongahn commented Apr 19, 2018

Thanks.

@grondo grondo mentioned this pull request May 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants