disturbing but apparently innocuous bug in reporting of ranks working on a job #1400

trws · 2018-03-29T20:21:57Z

Looking at the flux wreck ls output below, or the corresponding kvs entries, it looks like job 8 is running at least 11 processes, one on each of ranks 0-11, overlapping with job 7. In truth, it's running one process, only on rank 11. Somehow sched is populating the kvs with both the new resource request, and all of the resources from the previous job which is still running. This is even with the release requested resources call put back in. Somehow the result, in terms of executing the thing, still looks correct, but the output is really messed up.

splash:test_hycop_20180329-130523$ flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      1 complete   2018-03-29T11:15:11       0.141s        1 hostname
     2     10 complete   2018-03-29T11:18:03       0.400s   [1-10] flux
     3      1 complete   2018-03-29T11:18:03       2.754s        1 run-ddcmd.flu
     4     10 complete   2018-03-29T11:37:25      54.472s   [1-10] flux
     5      1 complete   2018-03-29T11:37:25       3.045s   [1-11] run-ddcmd.flu
     6     10 complete   2018-03-29T11:49:28       0.152s   [1-10] hostname
     7     10 complete   2018-03-29T12:01:16      56.422s   [1-10] flux
     8      1 complete   2018-03-29T12:01:16       3.095s   [1-11] run-ddcmd.flu
     9     10 complete   2018-03-29T12:12:52      51.523s   [1-10] flux
    10      1 complete   2018-03-29T12:12:52       3.154s   [1-11] run-ddcmd.flu
    11     10 complete   2018-03-29T12:20:03      10.520m   [1-10] flux
    12      1 complete   2018-03-29T12:20:03       3.103s   [1-11] run-ddcmd.flu
    13     10 complete   2018-03-29T12:39:59      24.765m   [1-10] flux
    14      1 complete   2018-03-29T12:39:59      24.885m   [1-11] run-ddcmd.flu
    15      5 complete   2018-03-29T13:03:38       0.192s    [1-5] hostname
    16     15 complete   2018-03-29T13:03:43       0.296s   [1-15] hostname
    17     10 running    2018-03-29T13:05:25       8.380m   [1-10] flux
    18      1 running    2018-03-29T13:05:25       8.375m   [1-11] run-ddcmd.flu
    19     10 complete   2018-03-29T13:10:05       1.003m   [1-21] sleep
    20      1 complete   2018-03-29T13:10:12       0.258s   [1-22] hostname

The text was updated successfully, but these errors were encountered:

dongahn · 2018-03-29T21:14:57Z

Looking at the flux wreck ls output below, or the corresponding kvs entries, it looks like job 8 is running at least 11 processes, one on each of ranks 0-11, overlapping with job 7. In truth, it's running one process, only on rank 11. Somehow sched is populating the kvs with both the new resource request, and all of the resources from the previous job which is still running. This is even with the release requested resources call put back in. Somehow the result, in terms of executing the thing, still looks correct, but the output is really messed up.

I'm looking at this part of the code anyway so I can take a look at it. What's the easiest way to reproduce this?

trws · 2018-03-29T21:16:37Z

It seems like the easiest way is to run a multi-node job that takes a little while, and submit another job that takes one before the first ends.

…

On 29 Mar 2018, at 14:14, Dong H. Ahn wrote: > Looking at the flux wreck ls output below, or the corresponding kvs entries, it looks like job 8 is running at least 11 processes, one on each of ranks 0-11, overlapping with job 7. In truth, it's running one process, only on rank 11. Somehow sched is populating the kvs with both the new resource request, and all of the resources from the previous job which is still running. This is even with the release requested resources call put back in. Somehow the result, in terms of executing the thing, still looks correct, but the output is really messed up. I'm looking at this part of the code anyway so I can take a look at it. What's the easiest way to reproduce this? -- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: #1400 (comment)

grondo · 2018-03-29T21:27:20Z

Currently flux wreck ls is just summarizing the integer keys in lwj.x.y.ranks..

A quick fix would be to ignore rank directories that don't have a cores= field. Another side effect of having extra ranks dirs is that the presence of those directories determines on which ranks wrexecd is launched, so we may have a lot of unnecessary fork/exec goings on.

This issue reminded me that on my PR branch (#1399) flux wreck ls is broken. I still have to add code to parse R_lite there to get the RANKS field...

trws · 2018-03-29T21:29:18Z

The disturbing part of this is that looking at one of the ones that should be running only one process, and end up running only one, all of the rank.N.cores files are there, and they all have the value “1”.

…

On 29 Mar 2018, at 14:27, Mark Grondona wrote: Currently `flux wreck ls` is just summarizing the integer keys in `lwj.x.y.ranks.`. A quick fix would be to ignore rank directories that don't have a cores= field. Another side effect of having extra `ranks` dirs is that the presence of those directories determines on which ranks `wrexecd` is launched, so we may have a lot of unnecessary fork/exec goings on. This issue reminded me that on my PR branch (#1399) `flux wreck ls` is broken. I still have to add code to parse `R_lite` there to get the `RANKS` field... -- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: #1400 (comment)

trws · 2018-03-29T21:29:50Z

for example, job 8 above has this in the kvs:

splash:test_hycop_20180329-142212$ flux kvs dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.1.cores = 1
lwj.0.0.8.rank.10.cores = 1
lwj.0.0.8.rank.11.cores = 1
lwj.0.0.8.rank.2.cores = 1
lwj.0.0.8.rank.3.cores = 1
lwj.0.0.8.rank.4.cores = 1
lwj.0.0.8.rank.5.cores = 1
lwj.0.0.8.rank.6.cores = 1
lwj.0.0.8.rank.7.cores = 1
lwj.0.0.8.rank.8.cores = 1
lwj.0.0.8.rank.9.cores = 1

grondo · 2018-03-29T21:30:06Z

The disturbing part of this is that looking at one of the ones that
should be running only one process, and end up running only one, all of
the rank.N.cores files are there, and they all have the value “1”.

erm, oof. That's not expected! 😢

trws · 2018-03-29T21:31:46Z

Oh crap... it's not just running one process, it's actually overscheduling them. We only get output from one, somehow, but it runs them all. This is a really, really bad one.

dongahn · 2018-03-29T21:32:54Z

This is a really, really bad one.

Let me look into this.

grondo · 2018-03-29T21:33:32Z

Can you dump the whole lwj.0.0.8 directory?

This should be fixed after merge of #1399, if lwj.0.0.8.ntasks = 1, fyi. (One would hope anyway)

trws · 2018-03-29T21:34:08Z

Unfortunately the instance is dead as of about a minute ago... The lwj.0.0.8.ntasks was 1 though.

trws · 2018-03-29T21:35:31Z

also ncores and nnodes were 1, I had that saved off in history.

dongahn · 2018-03-29T21:53:26Z

It seems like the easiest way is to run a multi-node job that takes a
little while, and submit another job that takes one before the first
ends.

Ok. I can use sleep <k> to emulate this of course. What are the submit options did you use? -N x -n y or did you use some other combination. This should be very helpful.

trws · 2018-03-29T21:54:57Z

The quick test I'm using is this:

flux submit -N 10 sleep 60
flux submit -N 1 -O out hostname

dongahn · 2018-03-29T22:00:36Z

Seems to work okay on 4 nodes on quartz. Let me try 10 nodes.

quartz20{dahn}22: srun --pty --mpi=none -N 4 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
quartz20{dahn}21: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 4 sleep 260
submit: Submitted jobid 1
Try `flux --help' for more information.
quartz20{dahn}23: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 1 -O out hostname
submit: Submitted jobid 2
quartz20{dahn}24: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      4 running    2018-03-29T14:58:03      24.266s    [0-3] sleep
     2      1 complete   2018-03-29T14:58:20       0.047s        0 hostname
quartz20{dahn}25: flux kvs dir -R lwj.0.0.1.rank
lwj.0.0.1.rank.0.cores = 1
lwj.0.0.1.rank.1.cores = 1
lwj.0.0.1.rank.2.cores = 1
lwj.0.0.1.rank.3.cores = 1
quartz20{dahn}26: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.2.rank
lwj.0.0.2.rank.0.cores = 1

dongahn · 2018-03-29T22:12:02Z

FYI -- user issue and then dental appointment. I will pick this up tonight.

trws · 2018-03-29T22:13:07Z

Ok, thanks dong, trying a vanilla flux to see if it's something in the perf tweaks we were making.

trws · 2018-03-29T22:36:50Z

@SteVwonder @dongahn, any idea where a "error fetching job and event" error would be coming from using most recent master core and sched?

trws · 2018-03-29T22:43:30Z

Found it, current sched can't tolerate the lack of a null state yet.

dongahn · 2018-03-29T22:45:35Z

@trws: would it be better if I just merge my flux-sched PR with temporary emulator breakage? @SteVwonder is busy with other stuff at the moment.

At least for the 4 node case, that branch worked okay.

dongahn · 2018-03-29T22:47:34Z

That PR should also speed up the scheduling performance for high job submission rate.

trws · 2018-03-29T22:47:52Z

If that's what you were testing on, that would be much appreciated. I'm looking at not being able to run correctly at all right now.

grondo · 2018-03-29T22:50:40Z

@dongahn, your PR needs to be rebased before we can hit the merge button.

trws · 2018-03-29T23:08:32Z

For this bug to repro, node_exclusive has to be turned on in sched.

trws · 2018-03-29T23:59:31Z

Also, @dongahn, the output you sent shows overscheduling a node...

dongahn · 2018-03-30T00:01:19Z

@trws: probably exclusivity isn't turned on?

dongahn · 2018-03-30T00:01:47Z

Actually, should have seen your comment above.

trws · 2018-03-30T00:16:18Z

This is now a complete blocker. It means I can't run anything without overscheduling.

dongahn · 2018-03-30T00:34:11Z

It appears the sched is configured to do only core-level scheduling!

I guess you changed this code and turned on the node exclusive scheduling and the error cropped up? If so, I will do the same and reproduce the misbehavior.

trws · 2018-03-30T00:36:06Z

I did, if it’s turned on, or even if you just set the number of node resources to request to 1 rather than 0 in the request generation, it goes completely off the deep end. The only workaround I’ve thought of is to use hwloc reload to load a single core resource description in for each node, so they all pretend to only have one core.

…

On 29 Mar 2018, at 17:34, Dong H. Ahn wrote: It appears the sched is configured to do [only core-level scheduling](https://github.com/flux-framework/flux-sched/blob/master/sched/sched.c#L441)! I guess you changed this code and turned on the node exclusive scheduling and the error cropped up? If so, I will do the same and reproduce the misbehavior. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #1400 (comment)

dongahn · 2018-03-30T00:38:05Z

bad!

dongahn · 2018-03-30T02:04:26Z

Using the latest master for both flux-core and sched with one line change in this code setting the node exclusive to be true, I reproduced the over-scheduling problem. This would be the right behavior if you do core-level scheduling, but def. incorrect scheduling for exclusive node-level scheduling.

I will first see if I can fix this issue for this simple reproducer and then see if I can use a more complex case @trws posted in the beginning of this issue.

quartz1922{dahn}52: salloc -N 4 -ppdebug
salloc: Granted job allocation 535400
quartz10{dahn}21: srun --pty --mpi=none -N 4 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
quartz10{dahn}21: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 4 sleep 60
submit: Submitted jobid 1
quartz10{dahn}22: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 1 -O out hostname
submit: Submitted jobid 2
quartz10{dahn}23: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      4 running    2018-03-29T18:53:03      16.995s    [0-3] sleep
     2      1 exited     2018-03-29T18:53:13       0.059s        0 hostname
quartz10{dahn}24: flux kvs dir -R lwj.0.0.1.rank
lwj.0.0.1.rank.0.cores = 1
lwj.0.0.1.rank.1.cores = 1
lwj.0.0.1.rank.2.cores = 1
lwj.0.0.1.rank.3.cores = 1
quartz10{dahn}25: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.2.rank
lwj.0.0.2.rank.0.cores = 1

trws · 2018-03-30T02:38:25Z

An equally acceptable result for splash would be to fix it so that the ncores, or cores per task, or something can be used to say ntasks:1 ncores:<cores-per-node> to get exclusive behavior. For now, we're actually using the single-core hwloc xml solution in production to get results over the weekend. 😨

dongahn · 2018-03-30T03:51:41Z

Well. I have to take it back. I ran the test again with that one line change and it seems nodes are exclusively scheduled.

quartz10{dahn}22: srun --pty --mpi=none -N 8 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
quartz10{dahn}21: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 8 sleep 60
submit: Submitted jobid 1
quartz10{dahn}22: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 7 sleep 10
submit: Submitted jobid 2
quartz10{dahn}23: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 6 sleep 10
submit: Submitted jobid 3
quartz10{dahn}24: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 5 sleep 10
submit: Submitted jobid 4
quartz10{dahn}25: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 4 sleep 10
submit: Submitted jobid 5
quartz10{dahn}26: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 3 sleep 10
submit: Submitted jobid 6
quartz10{dahn}27: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 2 sleep 10
submit: Submitted jobid 7
quartz10{dahn}28: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 1 sleep 10
submit: Submitted jobid 8

quartz10{dahn}42: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      8 exited     2018-03-29T20:43:25       1.006m    [0-7] sleep
     2      7 exited     2018-03-29T20:44:26      10.100s    [0-6] sleep
     3      6 exited     2018-03-29T20:44:36      10.079s    [0-5] sleep
     4      5 exited     2018-03-29T20:44:46      10.079s    [0-4] sleep
     5      4 exited     2018-03-29T20:44:56      10.085s    [0-3] sleep
     6      3 running    2018-03-29T20:44:56       1.158m    [0-6] sleep
     7      2 exited     2018-03-29T20:45:06      10.079s    [0-1] sleep
     8      1 running    2018-03-29T20:45:06      59.338s    [0-2] sleep

The problem I see is job 6 and job 8 are incorrectly marked as running. And if I do ps,

quartz10{dahn}45: ps x
   PID TTY      STAT   TIME COMMAND
 14306 pts/0    Ss     0:00 -bin/tcsh
 16282 pts/0    Sl+    0:00 srun --pty --mpi=none -N 8 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
 16283 pts/0    S+     0:00 srun --pty --mpi=none -N 8 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
 16303 pts/1    Ssl    0:01 /g/g0/dahn/workspace/flux-cancel/inst/libexec/flux/cmd/flux-broker
 16406 pts/1    S      0:00 -bin/tcsh
 16486 ?        S      0:00 /g/g0/dahn/workspace/flux-cancel/inst/libexec/flux/wrexecd --lwj-id=6 --kvs-path=lwj.0.0.6
 16495 ?        S      0:00 /g/g0/dahn/workspace/flux-cancel/inst/libexec/flux/wrexecd --lwj-id=8 --kvs-path=lwj.0.0.8
 16509 pts/1    R+     0:00 ps x

I see wrexecd processes are still running... probably didn't get the notification of the program exits?

I will do some more testing for scheduling, though.

dongahn · 2018-03-30T03:55:44Z

Ah, actually

     6      3 running    2018-03-29T20:44:56       1.158m    [0-6] sleep
     8      1 running    2018-03-29T20:45:06      59.338s    [0-2] sleep

Those two jobs seem wrong and that maybe why these jobs are still marked as running

dongahn · 2018-03-30T03:57:47Z

No wonder:

quartz10{dahn}48: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.6.rank
lwj.0.0.6.rank.0.cores = 1
lwj.0.0.6.rank.1.cores = 1
lwj.0.0.6.rank.2.cores = 1
lwj.0.0.6.rank.3.cores = 1
lwj.0.0.6.rank.4.cores = 1
lwj.0.0.6.rank.5.cores = 1
lwj.0.0.6.rank.6.cores = 1
quartz10{dahn}49: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.0.cores = 1
lwj.0.0.8.rank.1.cores = 1
lwj.0.0.8.rank.2.cores = 1

dongahn · 2018-03-30T05:07:12Z

Still looking. But I have to guess the bug is within select_resources of the scheduler plugin code called from here

I looked at the log file for the job 6: flux submit -N 3 sleep 10 While the log says:

sched.debug[0]: Found 4 node(s) for job 6, required: 3

When I dumped the select_tree object, it contains 7 compute nodes selected! Then, the logic to generate lwj...rank.N.cores emits the core counts for those 7 nodes.

Need to go a bit deeper to pinpoint the bug...

dongahn · 2018-03-30T06:09:17Z

A bit difficult to diagnose because of all this recursion but I think I got it. It seems this code which is giving trouble for node-exclusive scheduling.

When the node request is exclusive, this code shouldn't select the node type in this else branch. The node level selection should only be done in the if branch.

For quick testing, I added the following conditional:

index 9e3764f..0dbda87 100644
--- a/sched/sched_fcfs.c
+++ b/sched/sched_fcfs.c
@@ -228,7 +228,8 @@ resrc_tree_t *select_resources (flux_t *h, resrc_api_ctx_t *rsapi,
          * defined.  E.g., it might only stipulate a node with 4 cores
          * and omit the intervening socket.
          */
-        selected_tree = resrc_tree_new (selected_parent, resrc);
+        if (strcmp (resrc_type (resrc), "node") != 0)
+            selected_tree = resrc_tree_new (selected_parent, resrc);
         children = resrc_tree_children (found_tree);
         child_tree = resrc_tree_list_first (children);
         while (child_tree) {

W/ this, at least my reproducer behaves correctly:

quartz16{dahn}38: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      8 exited     2018-03-29T22:53:00       1.002m    [0-7] sleep
     2      7 exited     2018-03-29T22:54:01      10.097s    [0-6] sleep
     3      6 exited     2018-03-29T22:54:11      10.077s    [0-5] sleep
     4      5 exited     2018-03-29T22:54:21      10.081s    [0-4] sleep
     5      4 exited     2018-03-29T22:54:31      10.087s    [0-3] sleep
     6      3 exited     2018-03-29T22:54:31      10.061s    [4-6] sleep
     7      2 exited     2018-03-29T22:54:41      10.067s    [4-5] sleep
     8      1 exited     2018-03-29T22:54:41      10.083s        6 sleep

quartz16{dahn}39: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.6.rank
lwj.0.0.6.rank.4.cores = 1
lwj.0.0.6.rank.5.cores = 1
lwj.0.0.6.rank.6.cores = 1

quartz16{dahn}40: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.6.cores = 1

@trws: I will need more validation including its effect on core-level scheduling and also make the code a bit more generic... But it seems worth a shot and maybe can be used for production run over the weekend...

dongahn · 2018-03-30T06:16:18Z

Well it is kind of getting late, and thinking about this, the patch logic may not be quite complete. I will spend a bit more time tomorrow morning. But this IS the right bug site.

grondo · 2018-03-30T13:21:20Z

Wow, heroic effort @dongahn! Nice work!

Those two jobs seem wrong and that maybe why these jobs are still marked as running

There is a bug in the wreck use of Rlite (derived from ranks.N.cores) here. It is setting the nnodes of the job to the total number of ranks assigned, not the number of ranks used. However, what should the correct behavior be? We could either run successfully on the number of actually used nodes, or generate a fatal error at job startup when ntasks < nnodes.

trws · 2018-03-30T16:16:01Z

Is that correct output? I can’t tell by when you submitted, but it looks like jobs 1 and 2 are completely overlapped?

…

On 29 Mar 2018, at 23:09, Dong H. Ahn wrote: A bit difficult to diagnose because of all this recursion but I think I got it. It seems [this code](https://github.com/flux-framework/flux-sched/blob/master/sched/sched_fcfs.c#L231) which is giving trouble for node-exclusive scheduling. When the node request is exclusive, this code shouldn't select the `node` type in this `else` branch. The node level selection should only be done in the `if` branch. For quick testing, I added the following conditional: ```diff index 9e3764f..0dbda87 100644 --- a/sched/sched_fcfs.c +++ b/sched/sched_fcfs.c @@ -228,7 +228,8 @@ resrc_tree_t *select_resources (flux_t *h, resrc_api_ctx_t *rsapi, * defined. E.g., it might only stipulate a node with 4 cores * and omit the intervening socket. */ - selected_tree = resrc_tree_new (selected_parent, resrc); + if (strcmp (resrc_type (resrc), "node") != 0) + selected_tree = resrc_tree_new (selected_parent, resrc); children = resrc_tree_children (found_tree); child_tree = resrc_tree_list_first (children); while (child_tree) { ``` W/ this, at least my reproducer behaves correctly: ``` quartz16{dahn}38: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls ID NTASKS STATE START RUNTIME RANKS COMMAND 1 8 exited 2018-03-29T22:53:00 1.002m [0-7] sleep 2 7 exited 2018-03-29T22:54:01 10.097s [0-6] sleep 3 6 exited 2018-03-29T22:54:11 10.077s [0-5] sleep 4 5 exited 2018-03-29T22:54:21 10.081s [0-4] sleep 5 4 exited 2018-03-29T22:54:31 10.087s [0-3] sleep 6 3 exited 2018-03-29T22:54:31 10.061s [4-6] sleep 7 2 exited 2018-03-29T22:54:41 10.067s [4-5] sleep 8 1 exited 2018-03-29T22:54:41 10.083s 6 sleep ``` ``` quartz16{dahn}39: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.6.rank lwj.0.0.6.rank.4.cores = 1 lwj.0.0.6.rank.5.cores = 1 lwj.0.0.6.rank.6.cores = 1 quartz16{dahn}40: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.8.rank lwj.0.0.8.rank.6.cores = 1 ``` @trws: I will need more validation including its effect on core-level scheduling and also make the code a bit more generic... But it seems worth a shot and maybe can be used for production run over the weekend... -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #1400 (comment)

dongahn · 2018-03-30T16:22:40Z

Yes this is correct. job 1 ran 1 min and exited. Then job 2 started right after and then ran 10 sec. I still haven't had a chance to think about a complete solution though.

trws · 2018-03-30T16:40:40Z

Gotcha. That’s definitely an improvement.

…

On 30 Mar 2018, at 9:22, Dong H. Ahn wrote: Yes this is correct. job 1 ran 1 min and exited. Then job 2 started right after and then ran 10 sec. I still haven't had a chance to think about a complete solution though. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #1400 (comment)

dongahn · 2018-03-30T18:56:15Z

However, what should the correct behavior be? We could either run successfully on the number of actually used nodes, or generate a fatal error at job startup when ntasks < nnodes.

Yes, this is a bug cascaded from a bug from the scheduler. I sort of like the first semantics with a big warning message... But I can see why you may want the second semantics though.

dongahn · 2018-03-30T19:28:34Z

I understand this problem better now. This is a deficiency within the scheduler for node exclusive scheduling mode. I can see why this case wasn't covered as the sched folks have focused on core-level scheduling as our testing coverage and the initial use cases.

I think the better place to fix this problem is actually in the resrc_tree_search function.

The purpose of this else branch is so that one can select the high-level resources leading to each exclusively allocated resource. This is needed when the job request is partially specified: when only core: 1 is requested, the logic should select the cluster, node, socket that contain that particular core.

But there is a deficiency in the code. When resrc walks the resource tree and visits a node vertex that is already allocated to another job, the match test will fail and this logic puts us on the else branch.

But unfortunately, the else branch doesn't check exclusivity. So the visiting node can be selected even if it's exclusively allocated to another job. Over scheduling!

My current patch is:

diff --git a/resrc/resrc_reqst.c b/resrc/resrc_reqst.c
index 9a3244c..fff6748 100644
--- a/resrc/resrc_reqst.c
+++ b/resrc/resrc_reqst.c
@@ -551,7 +551,10 @@ int64_t resrc_tree_search (resrc_api_ctx_t *ctx,
             nfound = 1;
             resrc_reqst->nfound++;
         }
-    } else if (resrc_tree_num_children (resrc_phys_tree (resrc_in))) {
+    } else if (resrc_tree_num_children (resrc_phys_tree (resrc_in))
+               && !(resrc_reqst_exclusive (resrc_reqst)
+                   && (resrc_size_allocs (resrc_in) || resrc_size_reservtns (resrc_in)))) {
+
         /*
          * This clause visits the children of the current resource
          * searching for a match to the resource request.  The found
@@ -562,6 +565,12 @@ int64_t resrc_tree_search (resrc_api_ctx_t *ctx,
          * defined.  E.g., it might only stipulate a node with 4 cores
          * and omit the intervening socket.
          */

I can't do a PR for this yet because I don't fully understand its impact to backfill schedulers. But for FCFS that @trws uses, I think there is a good chance this will work.

Tom, you are welcome to test this on your branch if you're interested.

I need to work on various milestone reports this afternoon. I'll see if I can circle back and understand its impact to backfill.

trws · 2018-04-02T02:48:21Z

I'm trying this out in the splash branch, will see what happens.

dongahn · 2018-04-02T18:53:27Z

Verdict?

trws · 2018-04-02T18:57:42Z

It seems to help. We're running it right now, but the predominent mode is actually using the new ncores functionality which helps a lot. I'm actually getting coscheduling the way they want now.

dongahn · 2018-04-02T19:00:33Z

Great! Please keep this open though. Need to double check this is safe with backfill scheduling before positing a PR.

dongahn · 2018-04-04T17:12:19Z

Need to double check this is safe with backfill scheduling before positing a PR.

It turns out thisresrc logic will require a redesign. For backfill cases, in this else branch we need to even check the future reservation state in determining whether to select the visiting (high-level) resource or not.

One can call resrc_walltime_match once again in this branch after creating a "fake" resrc_reqst object. But patching things like this over and over will likely make the code pretty unreadable.

When I circle around, I will try to patch this as much as possible. But it seems future really should be the new resource layer.

For @trws' purpose, this should work find though. That is, this commit, flux-framework/flux-sched@2317aaa in flux-sched PR #306.

dongahn · 2018-04-05T22:47:32Z

FYI -- I think I patched this enough so resrc now should work for fcfs AND backfill in the latest PR I just pushed forward: flux-framework/flux-sched#305

dongahn · 2018-04-10T21:58:19Z

OK. This has been fixed in flux-framework/flux-sched#305.

trws mentioned this issue Mar 29, 2018

wreck assumes 1 task per core #1378

Closed

grondo mentioned this issue Mar 29, 2018

wreck incremental improvements #1399

Merged

garlick mentioned this issue Mar 30, 2018

wreck: small fix for jobs with more nodes in R lite than tasks #1403

Merged

dongahn mentioned this issue Mar 31, 2018

large job launch with output redirect does not work [well] #1406

Closed

dongahn mentioned this issue Apr 5, 2018

zombie wrexecd processes for killed and/or exited jobs #1431

Closed

dongahn closed this as completed Apr 10, 2018

dongahn mentioned this issue May 3, 2018

sched: resource serialization failed: Resource temporarily unavailable flux-framework/flux-sched#336

Closed

disturbing but apparently innocuous bug in reporting of ranks working on a job #1400

disturbing but apparently innocuous bug in reporting of ranks working on a job #1400

Comments

trws commented Mar 29, 2018

dongahn commented Mar 29, 2018

trws commented Mar 29, 2018 via email

grondo commented Mar 29, 2018

trws commented Mar 29, 2018 via email

trws commented Mar 29, 2018

grondo commented Mar 29, 2018

trws commented Mar 29, 2018

dongahn commented Mar 29, 2018

grondo commented Mar 29, 2018

trws commented Mar 29, 2018

trws commented Mar 29, 2018

dongahn commented Mar 29, 2018

trws commented Mar 29, 2018

dongahn commented Mar 29, 2018

dongahn commented Mar 29, 2018

trws commented Mar 29, 2018

trws commented Mar 29, 2018

trws commented Mar 29, 2018

dongahn commented Mar 29, 2018 • edited Loading

dongahn commented Mar 29, 2018 • edited Loading

trws commented Mar 29, 2018

grondo commented Mar 29, 2018

trws commented Mar 29, 2018

trws commented Mar 29, 2018

dongahn commented Mar 30, 2018

dongahn commented Mar 30, 2018

trws commented Mar 30, 2018

dongahn commented Mar 30, 2018

trws commented Mar 30, 2018 via email

dongahn commented Mar 30, 2018

dongahn commented Mar 30, 2018 • edited Loading

trws commented Mar 30, 2018 • edited Loading

dongahn commented Mar 30, 2018

dongahn commented Mar 30, 2018

dongahn commented Mar 30, 2018

dongahn commented Mar 30, 2018

dongahn commented Mar 30, 2018

dongahn commented Mar 30, 2018

grondo commented Mar 30, 2018

trws commented Mar 30, 2018 via email

dongahn commented Mar 30, 2018

trws commented Mar 30, 2018 via email

dongahn commented Mar 30, 2018

dongahn commented Mar 30, 2018 • edited Loading

trws commented Apr 2, 2018

dongahn commented Apr 2, 2018

trws commented Apr 2, 2018

dongahn commented Apr 2, 2018

dongahn commented Apr 4, 2018

dongahn commented Apr 5, 2018

dongahn commented Apr 10, 2018

dongahn commented Mar 29, 2018 •

edited

Loading

dongahn commented Mar 29, 2018 •

edited

Loading

dongahn commented Mar 30, 2018 •

edited

Loading

trws commented Mar 30, 2018 •

edited

Loading

dongahn commented Mar 30, 2018 •

edited

Loading