wreck: small fix for jobs with more nodes in R lite than tasks #1403

grondo · 2018-03-30T18:04:16Z

This adds rcalc_total_nnodes_used() call to the wreck/rcalc class, and then uses that to set nnodes internally in wrexecd. This should prevent jobs with larger allocations than tasks from hanging (but this needs testing)

I also added some R-lite inputs for testing under t/wreck/input, and verification of output in a new t1999-wreck-rcalc.t test.

This isn't ready for merge yet, but put up as a placeholder.

coveralls · 2018-03-30T18:31:08Z

Coverage increased (+0.04%) to 78.866% when pulling 0271344 on grondo:rcalc-fixes into 6c5e47d on flux-framework:master.

codecov-io · 2018-03-30T18:33:38Z

Codecov Report

Merging #1403 into master will increase coverage by 0.04%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1403      +/-   ##
==========================================
+ Coverage   78.51%   78.55%   +0.04%     
==========================================
  Files         163      163              
  Lines       29983    29995      +12     
==========================================
+ Hits        23542    23564      +22     
+ Misses       6441     6431      -10

Impacted Files	Coverage Δ
src/modules/wreck/rcalc.c	`91.48% <100%> (+5.77%)`	⬆️
src/modules/wreck/wrexecd.c	`75.77% <100%> (+1.21%)`	⬆️
src/modules/connector-local/local.c	`72.95% <0%> (-2.87%)`	⬇️
src/common/libkvs/kvs_watch.c	`90.55% <0%> (-0.86%)`	⬇️
src/common/libkvs/kvs_txn.c	`74.71% <0%> (-0.57%)`	⬇️
src/common/libflux/message.c	`81.13% <0%> (-0.48%)`	⬇️
src/bindings/lua/lua-hostlist/hostlist.c	`62.95% <0%> (+0.21%)`	⬆️
src/cmd/flux-module.c	`85.36% <0%> (+0.3%)`	⬆️
src/common/libflux/future.c	`89.25% <0%> (+0.46%)`	⬆️
... and 3 more

grondo · 2018-03-30T23:39:21Z

Ok, it turns out it would more difficult than initially anticipated to allow wreck jobs to run when more nodes are assigned to the job than tasks. There are multiple places where number of local tasks per node are assumed to non-zero across the parallel job, including nodeid assignment, PMI, etc. These could probably be worked through, but it doesn't seem useful at this time.

Instead, this PR now makes assignment of nnodes > ntasks a fatal error:

 $ flux wreckrun -n2 -P "for i=1,3 do lwj['rank.'..i..'.cores'] = 1 end; lwj.R_lite = nil" hostname
2018-03-30T23:37:58.787929Z lwj.1.emerg[1]: nnodes assigned to job (3) greater than ntasks (2)!
2018-03-30T23:37:58.787929Z lwj.1.emerg[1]: nnodes assigned to job (3) greater than ntasks (2)!
2018-03-30T23:37:58.792215Z lwj.1.emerg[2]: nnodes assigned to job (3) greater than ntasks (2)!
2018-03-30T23:37:58.792215Z lwj.1.emerg[2]: nnodes assigned to job (3) greater than ntasks (2)!
nnodes assigned to job (3) greater than ntasks (2)!
nnodes assigned to job (3) greater than ntasks (2)!
wreckrun: job 1 failed
$ flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      2 failed     2018-03-30T23:38:12       0.000s    [1-3] hostname

garlick · 2018-03-30T23:45:00Z

This sounds reasonable to me.

Should 1042efe reference #1400 (if not fix it)?

grondo · 2018-03-30T23:48:10Z

Hm, #1400 should actually be renamed I think it is actually a flux-sched bug. It could reference it though.. what's the suggested format for that?

grondo · 2018-03-30T23:49:18Z

Actually it is 2d55622 that will make ntasks < nnodes a fatal error now.

Add a function to return the total number of nodes that have tasks assigned after rcalc_distribute()

Use rcalc_total_nodes_use() in t/wreck/rcalc to return the total number of *used* nodes in output for testing purposes.

Aded a set of "R_lite" inputs and expected outputs for the rcalc utility, and a test to read in inputs, generate outputs, and check the results.

In some cases a call to `wlog_fatal` would cause jobs to get stuck in `reserved` or `starting` state, especially if rank 0 wrexecd exited with a failure. Try harder in this function to update the job state to "failed" before rank 0 wrexecd exits on error.

Issue a fatal error in wrexecd if nnodes > ntasks. This case is not handled correctly in wrexecd, and it is deemed unimportant to fix now. Terminating the job with a failure is a better solution than a hang or inconsistent state.

Add a test to ensure that a job assigned more nodes than tasks fails, instead of hanging or running more tasks than requested.

dongahn · 2018-03-31T04:13:33Z

Yeah #1400 is definitely a sched bug. And making this condition a fatal error is a reasonable semantics.

grondo · 2018-03-31T18:11:22Z

The new t1999-wreck-rcalc.t is aborting (exit 1), but only in the gcc-4.9 builder, with no captured output as to why. I'll have to keep iterating to figure that one out (surely a dumb error in the script)

Apparently the rcalc tests called in a loop will not be compatible with chain-lint, so disable these tests for now under --chain-lint.

grondo force-pushed the rcalc-fixes branch from e57262c to cd1ec3f Compare March 30, 2018 23:34

grondo added 7 commits March 30, 2018 18:59

wreck/rcalc: add rcalc_total_nodes_used

40ed56a

Add a function to return the total number of nodes that have tasks assigned after rcalc_distribute()

t/wreck/rcalc: print total nodes used in rcalc test

0322c55

Use rcalc_total_nodes_use() in t/wreck/rcalc to return the total number of *used* nodes in output for testing purposes.

t/t1999-wreck-rcalc: add test for wreck/rcalc

814fcb0

Aded a set of "R_lite" inputs and expected outputs for the rcalc utility, and a test to read in inputs, generate outputs, and check the results.

testsuite: add t1999-wreck-rcalc.t to tests

065d355

wreck: issue fatal error if nnodes > ntasks

049deec

Issue a fatal error in wrexecd if nnodes > ntasks. This case is not handled correctly in wrexecd, and it is deemed unimportant to fix now. Terminating the job with a failure is a better solution than a hang or inconsistent state.

t/t2000-wreck: test job with nnodes>ntasks fails

7fc01ce

Add a test to ensure that a job assigned more nodes than tasks fails, instead of hanging or running more tasks than requested.

grondo force-pushed the rcalc-fixes branch from cd1ec3f to 7fc01ce Compare March 31, 2018 02:02

t/t1999-wreck-rcalc: add NO_CHAIN_LINT to tests

0271344

Apparently the rcalc tests called in a loop will not be compatible with chain-lint, so disable these tests for now under --chain-lint.

garlick merged commit 1e7d621 into flux-framework:master Mar 31, 2018

garlick mentioned this pull request Apr 4, 2018

R_lite distribution failure causes all future jobs to stall at submitted or reserved #1425

Closed

grondo deleted the rcalc-fixes branch April 26, 2018 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wreck: small fix for jobs with more nodes in R lite than tasks #1403

wreck: small fix for jobs with more nodes in R lite than tasks #1403

grondo commented Mar 30, 2018

coveralls commented Mar 30, 2018 •

edited

Loading

codecov-io commented Mar 30, 2018 •

edited

Loading

grondo commented Mar 30, 2018

garlick commented Mar 30, 2018

grondo commented Mar 30, 2018

grondo commented Mar 30, 2018

dongahn commented Mar 31, 2018

grondo commented Mar 31, 2018

wreck: small fix for jobs with more nodes in R lite than tasks #1403

wreck: small fix for jobs with more nodes in R lite than tasks #1403

Conversation

grondo commented Mar 30, 2018

coveralls commented Mar 30, 2018 • edited Loading

codecov-io commented Mar 30, 2018 • edited Loading

Codecov Report

grondo commented Mar 30, 2018

garlick commented Mar 30, 2018

grondo commented Mar 30, 2018

grondo commented Mar 30, 2018

dongahn commented Mar 31, 2018

grondo commented Mar 31, 2018

coveralls commented Mar 30, 2018 •

edited

Loading

codecov-io commented Mar 30, 2018 •

edited

Loading