Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wreck: small fix for jobs with more nodes in R lite than tasks #1403

Merged
merged 8 commits into from
Mar 31, 2018

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Mar 30, 2018

This adds rcalc_total_nnodes_used() call to the wreck/rcalc class, and then uses that to set nnodes internally in wrexecd. This should prevent jobs with larger allocations than tasks from hanging (but this needs testing)

I also added some R-lite inputs for testing under t/wreck/input, and verification of output in a new t1999-wreck-rcalc.t test.

This isn't ready for merge yet, but put up as a placeholder.

@coveralls
Copy link

coveralls commented Mar 30, 2018

Coverage Status

Coverage increased (+0.04%) to 78.866% when pulling 0271344 on grondo:rcalc-fixes into 6c5e47d on flux-framework:master.

@codecov-io
Copy link

codecov-io commented Mar 30, 2018

Codecov Report

Merging #1403 into master will increase coverage by 0.04%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1403      +/-   ##
==========================================
+ Coverage   78.51%   78.55%   +0.04%     
==========================================
  Files         163      163              
  Lines       29983    29995      +12     
==========================================
+ Hits        23542    23564      +22     
+ Misses       6441     6431      -10
Impacted Files Coverage Δ
src/modules/wreck/rcalc.c 91.48% <100%> (+5.77%) ⬆️
src/modules/wreck/wrexecd.c 75.77% <100%> (+1.21%) ⬆️
src/modules/connector-local/local.c 72.95% <0%> (-2.87%) ⬇️
src/common/libkvs/kvs_watch.c 90.55% <0%> (-0.86%) ⬇️
src/common/libkvs/kvs_txn.c 74.71% <0%> (-0.57%) ⬇️
src/common/libflux/message.c 81.13% <0%> (-0.48%) ⬇️
src/bindings/lua/lua-hostlist/hostlist.c 62.95% <0%> (+0.21%) ⬆️
src/cmd/flux-module.c 85.36% <0%> (+0.3%) ⬆️
src/common/libflux/future.c 89.25% <0%> (+0.46%) ⬆️
... and 3 more

@grondo
Copy link
Contributor Author

grondo commented Mar 30, 2018

Ok, it turns out it would more difficult than initially anticipated to allow wreck jobs to run when more nodes are assigned to the job than tasks. There are multiple places where number of local tasks per node are assumed to non-zero across the parallel job, including nodeid assignment, PMI, etc. These could probably be worked through, but it doesn't seem useful at this time.

Instead, this PR now makes assignment of nnodes > ntasks a fatal error:

 $ flux wreckrun -n2 -P "for i=1,3 do lwj['rank.'..i..'.cores'] = 1 end; lwj.R_lite = nil" hostname
2018-03-30T23:37:58.787929Z lwj.1.emerg[1]: nnodes assigned to job (3) greater than ntasks (2)!
2018-03-30T23:37:58.787929Z lwj.1.emerg[1]: nnodes assigned to job (3) greater than ntasks (2)!
2018-03-30T23:37:58.792215Z lwj.1.emerg[2]: nnodes assigned to job (3) greater than ntasks (2)!
2018-03-30T23:37:58.792215Z lwj.1.emerg[2]: nnodes assigned to job (3) greater than ntasks (2)!
nnodes assigned to job (3) greater than ntasks (2)!
nnodes assigned to job (3) greater than ntasks (2)!
wreckrun: job 1 failed
$ flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      2 failed     2018-03-30T23:38:12       0.000s    [1-3] hostname

@garlick
Copy link
Member

garlick commented Mar 30, 2018

This sounds reasonable to me.

Should 1042efe reference #1400 (if not fix it)?

@grondo
Copy link
Contributor Author

grondo commented Mar 30, 2018

Hm, #1400 should actually be renamed I think it is actually a flux-sched bug. It could reference it though.. what's the suggested format for that?

@grondo
Copy link
Contributor Author

grondo commented Mar 30, 2018

Actually it is 2d55622 that will make ntasks < nnodes a fatal error now.

grondo added 7 commits March 30, 2018 18:59
Add a function to return the total number of nodes that have
tasks assigned after rcalc_distribute()
Use rcalc_total_nodes_use() in t/wreck/rcalc to return
the total number of *used* nodes in output for testing
purposes.
Aded a set of "R_lite" inputs and expected outputs for
the rcalc utility, and a test to read in inputs, generate
outputs, and check the results.
In some cases a call to `wlog_fatal` would cause jobs to get
stuck in `reserved` or `starting` state, especially if rank 0
wrexecd exited with a failure.

Try harder in this function to update the job state to "failed"
before rank 0 wrexecd exits on error.
Issue a fatal error in wrexecd if nnodes > ntasks. This case
is not handled correctly in wrexecd, and it is deemed unimportant
to fix now. Terminating the job with a failure is a better solution
than a hang or inconsistent state.
Add a test to ensure that a job assigned more nodes than
tasks fails, instead of hanging or running more tasks than
requested.
@dongahn
Copy link
Member

dongahn commented Mar 31, 2018

Yeah #1400 is definitely a sched bug. And making this condition a fatal error is a reasonable semantics.

@grondo
Copy link
Contributor Author

grondo commented Mar 31, 2018

The new t1999-wreck-rcalc.t is aborting (exit 1), but only in the gcc-4.9 builder, with no captured output as to why. I'll have to keep iterating to figure that one out (surely a dumb error in the script)

Apparently the rcalc tests called in a loop will not be compatible
with chain-lint, so disable these tests for now under --chain-lint.
@garlick garlick merged commit 1e7d621 into flux-framework:master Mar 31, 2018
@grondo grondo deleted the rcalc-fixes branch April 26, 2018 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants