Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disturbing but apparently innocuous bug in reporting of ranks working on a job #1400

Closed
trws opened this issue Mar 29, 2018 · 51 comments
Closed

Comments

@trws
Copy link
Member

trws commented Mar 29, 2018

Looking at the flux wreck ls output below, or the corresponding kvs entries, it looks like job 8 is running at least 11 processes, one on each of ranks 0-11, overlapping with job 7. In truth, it's running one process, only on rank 11. Somehow sched is populating the kvs with both the new resource request, and all of the resources from the previous job which is still running. This is even with the release requested resources call put back in. Somehow the result, in terms of executing the thing, still looks correct, but the output is really messed up.

splash:test_hycop_20180329-130523$ flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      1 complete   2018-03-29T11:15:11       0.141s        1 hostname
     2     10 complete   2018-03-29T11:18:03       0.400s   [1-10] flux
     3      1 complete   2018-03-29T11:18:03       2.754s        1 run-ddcmd.flu
     4     10 complete   2018-03-29T11:37:25      54.472s   [1-10] flux
     5      1 complete   2018-03-29T11:37:25       3.045s   [1-11] run-ddcmd.flu
     6     10 complete   2018-03-29T11:49:28       0.152s   [1-10] hostname
     7     10 complete   2018-03-29T12:01:16      56.422s   [1-10] flux
     8      1 complete   2018-03-29T12:01:16       3.095s   [1-11] run-ddcmd.flu
     9     10 complete   2018-03-29T12:12:52      51.523s   [1-10] flux
    10      1 complete   2018-03-29T12:12:52       3.154s   [1-11] run-ddcmd.flu
    11     10 complete   2018-03-29T12:20:03      10.520m   [1-10] flux
    12      1 complete   2018-03-29T12:20:03       3.103s   [1-11] run-ddcmd.flu
    13     10 complete   2018-03-29T12:39:59      24.765m   [1-10] flux
    14      1 complete   2018-03-29T12:39:59      24.885m   [1-11] run-ddcmd.flu
    15      5 complete   2018-03-29T13:03:38       0.192s    [1-5] hostname
    16     15 complete   2018-03-29T13:03:43       0.296s   [1-15] hostname
    17     10 running    2018-03-29T13:05:25       8.380m   [1-10] flux
    18      1 running    2018-03-29T13:05:25       8.375m   [1-11] run-ddcmd.flu
    19     10 complete   2018-03-29T13:10:05       1.003m   [1-21] sleep
    20      1 complete   2018-03-29T13:10:12       0.258s   [1-22] hostname
@dongahn
Copy link
Member

dongahn commented Mar 29, 2018

Looking at the flux wreck ls output below, or the corresponding kvs entries, it looks like job 8 is running at least 11 processes, one on each of ranks 0-11, overlapping with job 7. In truth, it's running one process, only on rank 11. Somehow sched is populating the kvs with both the new resource request, and all of the resources from the previous job which is still running. This is even with the release requested resources call put back in. Somehow the result, in terms of executing the thing, still looks correct, but the output is really messed up.

I'm looking at this part of the code anyway so I can take a look at it. What's the easiest way to reproduce this?

@trws
Copy link
Member Author

trws commented Mar 29, 2018 via email

@grondo
Copy link
Contributor

grondo commented Mar 29, 2018

Currently flux wreck ls is just summarizing the integer keys in lwj.x.y.ranks..

A quick fix would be to ignore rank directories that don't have a cores= field. Another side effect of having extra ranks dirs is that the presence of those directories determines on which ranks wrexecd is launched, so we may have a lot of unnecessary fork/exec goings on.

This issue reminded me that on my PR branch (#1399) flux wreck ls is broken. I still have to add code to parse R_lite there to get the RANKS field...

@trws
Copy link
Member Author

trws commented Mar 29, 2018 via email

@trws
Copy link
Member Author

trws commented Mar 29, 2018

for example, job 8 above has this in the kvs:

splash:test_hycop_20180329-142212$ flux kvs dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.1.cores = 1
lwj.0.0.8.rank.10.cores = 1
lwj.0.0.8.rank.11.cores = 1
lwj.0.0.8.rank.2.cores = 1
lwj.0.0.8.rank.3.cores = 1
lwj.0.0.8.rank.4.cores = 1
lwj.0.0.8.rank.5.cores = 1
lwj.0.0.8.rank.6.cores = 1
lwj.0.0.8.rank.7.cores = 1
lwj.0.0.8.rank.8.cores = 1
lwj.0.0.8.rank.9.cores = 1

@grondo
Copy link
Contributor

grondo commented Mar 29, 2018

The disturbing part of this is that looking at one of the ones that
should be running only one process, and end up running only one, all of
the rank.N.cores files are there, and they all have the value “1”.

erm, oof. That's not expected! 😢

@trws
Copy link
Member Author

trws commented Mar 29, 2018

Oh crap... it's not just running one process, it's actually overscheduling them. We only get output from one, somehow, but it runs them all. This is a really, really bad one.

@dongahn
Copy link
Member

dongahn commented Mar 29, 2018

This is a really, really bad one.

Let me look into this.

@grondo
Copy link
Contributor

grondo commented Mar 29, 2018

Can you dump the whole lwj.0.0.8 directory?

This should be fixed after merge of #1399, if lwj.0.0.8.ntasks = 1, fyi. (One would hope anyway)

@trws
Copy link
Member Author

trws commented Mar 29, 2018

Unfortunately the instance is dead as of about a minute ago... The lwj.0.0.8.ntasks was 1 though.

@trws
Copy link
Member Author

trws commented Mar 29, 2018

also ncores and nnodes were 1, I had that saved off in history.

@dongahn
Copy link
Member

dongahn commented Mar 29, 2018

It seems like the easiest way is to run a multi-node job that takes a
little while, and submit another job that takes one before the first
ends.

Ok. I can use sleep <k> to emulate this of course. What are the submit options did you use? -N x -n y or did you use some other combination. This should be very helpful.

@trws
Copy link
Member Author

trws commented Mar 29, 2018

The quick test I'm using is this:

flux submit -N 10 sleep 60
flux submit -N 1 -O out hostname

@dongahn
Copy link
Member

dongahn commented Mar 29, 2018

Seems to work okay on 4 nodes on quartz. Let me try 10 nodes.

quartz20{dahn}22: srun --pty --mpi=none -N 4 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
quartz20{dahn}21: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 4 sleep 260
submit: Submitted jobid 1
Try `flux --help' for more information.
quartz20{dahn}23: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 1 -O out hostname
submit: Submitted jobid 2
quartz20{dahn}24: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      4 running    2018-03-29T14:58:03      24.266s    [0-3] sleep
     2      1 complete   2018-03-29T14:58:20       0.047s        0 hostname
quartz20{dahn}25: flux kvs dir -R lwj.0.0.1.rank
lwj.0.0.1.rank.0.cores = 1
lwj.0.0.1.rank.1.cores = 1
lwj.0.0.1.rank.2.cores = 1
lwj.0.0.1.rank.3.cores = 1
quartz20{dahn}26: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.2.rank
lwj.0.0.2.rank.0.cores = 1

@dongahn
Copy link
Member

dongahn commented Mar 29, 2018

FYI -- user issue and then dental appointment. I will pick this up tonight.

@trws
Copy link
Member Author

trws commented Mar 29, 2018

Ok, thanks dong, trying a vanilla flux to see if it's something in the perf tweaks we were making.

@trws
Copy link
Member Author

trws commented Mar 29, 2018

@SteVwonder @dongahn, any idea where a "error fetching job and event" error would be coming from using most recent master core and sched?

@trws
Copy link
Member Author

trws commented Mar 29, 2018

Found it, current sched can't tolerate the lack of a null state yet.

@dongahn
Copy link
Member

dongahn commented Mar 29, 2018

@trws: would it be better if I just merge my flux-sched PR with temporary emulator breakage? @SteVwonder is busy with other stuff at the moment.

At least for the 4 node case, that branch worked okay.

@dongahn
Copy link
Member

dongahn commented Mar 29, 2018

That PR should also speed up the scheduling performance for high job submission rate.

@trws
Copy link
Member Author

trws commented Mar 29, 2018

If that's what you were testing on, that would be much appreciated. I'm looking at not being able to run correctly at all right now.

@grondo
Copy link
Contributor

grondo commented Mar 29, 2018

@dongahn, your PR needs to be rebased before we can hit the merge button.

@trws
Copy link
Member Author

trws commented Mar 29, 2018

For this bug to repro, node_exclusive has to be turned on in sched.

@trws
Copy link
Member Author

trws commented Mar 29, 2018

Also, @dongahn, the output you sent shows overscheduling a node...

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

@trws: probably exclusivity isn't turned on?

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

Actually, should have seen your comment above.

@trws
Copy link
Member Author

trws commented Mar 30, 2018

This is now a complete blocker. It means I can't run anything without overscheduling.

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

It appears the sched is configured to do only core-level scheduling!

I guess you changed this code and turned on the node exclusive scheduling and the error cropped up? If so, I will do the same and reproduce the misbehavior.

@trws
Copy link
Member Author

trws commented Mar 30, 2018 via email

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

bad!

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

Using the latest master for both flux-core and sched with one line change in this code setting the node exclusive to be true, I reproduced the over-scheduling problem. This would be the right behavior if you do core-level scheduling, but def. incorrect scheduling for exclusive node-level scheduling.

I will first see if I can fix this issue for this simple reproducer and then see if I can use a more complex case @trws posted in the beginning of this issue.

quartz1922{dahn}52: salloc -N 4 -ppdebug
salloc: Granted job allocation 535400
quartz10{dahn}21: srun --pty --mpi=none -N 4 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
quartz10{dahn}21: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 4 sleep 60
submit: Submitted jobid 1
quartz10{dahn}22: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 1 -O out hostname
submit: Submitted jobid 2
quartz10{dahn}23: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      4 running    2018-03-29T18:53:03      16.995s    [0-3] sleep
     2      1 exited     2018-03-29T18:53:13       0.059s        0 hostname
quartz10{dahn}24: flux kvs dir -R lwj.0.0.1.rank
lwj.0.0.1.rank.0.cores = 1
lwj.0.0.1.rank.1.cores = 1
lwj.0.0.1.rank.2.cores = 1
lwj.0.0.1.rank.3.cores = 1
quartz10{dahn}25: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.2.rank
lwj.0.0.2.rank.0.cores = 1

@trws
Copy link
Member Author

trws commented Mar 30, 2018

An equally acceptable result for splash would be to fix it so that the ncores, or cores per task, or something can be used to say ntasks:1 ncores:<cores-per-node> to get exclusive behavior. For now, we're actually using the single-core hwloc xml solution in production to get results over the weekend. 😨

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

Well. I have to take it back. I ran the test again with that one line change and it seems nodes are exclusively scheduled.

quartz10{dahn}22: srun --pty --mpi=none -N 8 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
quartz10{dahn}21: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 8 sleep 60
submit: Submitted jobid 1
quartz10{dahn}22: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 7 sleep 10
submit: Submitted jobid 2
quartz10{dahn}23: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 6 sleep 10
submit: Submitted jobid 3
quartz10{dahn}24: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 5 sleep 10
submit: Submitted jobid 4
quartz10{dahn}25: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 4 sleep 10
submit: Submitted jobid 5
quartz10{dahn}26: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 3 sleep 10
submit: Submitted jobid 6
quartz10{dahn}27: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 2 sleep 10
submit: Submitted jobid 7
quartz10{dahn}28: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 1 sleep 10
submit: Submitted jobid 8

quartz10{dahn}42: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      8 exited     2018-03-29T20:43:25       1.006m    [0-7] sleep
     2      7 exited     2018-03-29T20:44:26      10.100s    [0-6] sleep
     3      6 exited     2018-03-29T20:44:36      10.079s    [0-5] sleep
     4      5 exited     2018-03-29T20:44:46      10.079s    [0-4] sleep
     5      4 exited     2018-03-29T20:44:56      10.085s    [0-3] sleep
     6      3 running    2018-03-29T20:44:56       1.158m    [0-6] sleep
     7      2 exited     2018-03-29T20:45:06      10.079s    [0-1] sleep
     8      1 running    2018-03-29T20:45:06      59.338s    [0-2] sleep

The problem I see is job 6 and job 8 are incorrectly marked as running. And if I do ps,

quartz10{dahn}45: ps x
   PID TTY      STAT   TIME COMMAND
 14306 pts/0    Ss     0:00 -bin/tcsh
 16282 pts/0    Sl+    0:00 srun --pty --mpi=none -N 8 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
 16283 pts/0    S+     0:00 srun --pty --mpi=none -N 8 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
 16303 pts/1    Ssl    0:01 /g/g0/dahn/workspace/flux-cancel/inst/libexec/flux/cmd/flux-broker
 16406 pts/1    S      0:00 -bin/tcsh
 16486 ?        S      0:00 /g/g0/dahn/workspace/flux-cancel/inst/libexec/flux/wrexecd --lwj-id=6 --kvs-path=lwj.0.0.6
 16495 ?        S      0:00 /g/g0/dahn/workspace/flux-cancel/inst/libexec/flux/wrexecd --lwj-id=8 --kvs-path=lwj.0.0.8
 16509 pts/1    R+     0:00 ps x

I see wrexecd processes are still running... probably didn't get the notification of the program exits?

I will do some more testing for scheduling, though.

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

Ah, actually

     6      3 running    2018-03-29T20:44:56       1.158m    [0-6] sleep
     8      1 running    2018-03-29T20:45:06      59.338s    [0-2] sleep

Those two jobs seem wrong and that maybe why these jobs are still marked as running

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

No wonder:

quartz10{dahn}48: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.6.rank
lwj.0.0.6.rank.0.cores = 1
lwj.0.0.6.rank.1.cores = 1
lwj.0.0.6.rank.2.cores = 1
lwj.0.0.6.rank.3.cores = 1
lwj.0.0.6.rank.4.cores = 1
lwj.0.0.6.rank.5.cores = 1
lwj.0.0.6.rank.6.cores = 1
quartz10{dahn}49: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.0.cores = 1
lwj.0.0.8.rank.1.cores = 1
lwj.0.0.8.rank.2.cores = 1

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

Still looking. But I have to guess the bug is within select_resources of the scheduler plugin code called from here

I looked at the log file for the job 6: flux submit -N 3 sleep 10 While the log says:

sched.debug[0]: Found 4 node(s) for job 6, required: 3

When I dumped the select_tree object, it contains 7 compute nodes selected! Then, the logic to generate lwj...rank.N.cores emits the core counts for those 7 nodes.

Need to go a bit deeper to pinpoint the bug...

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

A bit difficult to diagnose because of all this recursion but I think I got it. It seems this code which is giving trouble for node-exclusive scheduling.

When the node request is exclusive, this code shouldn't select the node type in this else branch. The node level selection should only be done in the if branch.

For quick testing, I added the following conditional:

index 9e3764f..0dbda87 100644
--- a/sched/sched_fcfs.c
+++ b/sched/sched_fcfs.c
@@ -228,7 +228,8 @@ resrc_tree_t *select_resources (flux_t *h, resrc_api_ctx_t *rsapi,
          * defined.  E.g., it might only stipulate a node with 4 cores
          * and omit the intervening socket.
          */
-        selected_tree = resrc_tree_new (selected_parent, resrc);
+        if (strcmp (resrc_type (resrc), "node") != 0)
+            selected_tree = resrc_tree_new (selected_parent, resrc);
         children = resrc_tree_children (found_tree);
         child_tree = resrc_tree_list_first (children);
         while (child_tree) {

W/ this, at least my reproducer behaves correctly:

quartz16{dahn}38: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      8 exited     2018-03-29T22:53:00       1.002m    [0-7] sleep
     2      7 exited     2018-03-29T22:54:01      10.097s    [0-6] sleep
     3      6 exited     2018-03-29T22:54:11      10.077s    [0-5] sleep
     4      5 exited     2018-03-29T22:54:21      10.081s    [0-4] sleep
     5      4 exited     2018-03-29T22:54:31      10.087s    [0-3] sleep
     6      3 exited     2018-03-29T22:54:31      10.061s    [4-6] sleep
     7      2 exited     2018-03-29T22:54:41      10.067s    [4-5] sleep
     8      1 exited     2018-03-29T22:54:41      10.083s        6 sleep
quartz16{dahn}39: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.6.rank
lwj.0.0.6.rank.4.cores = 1
lwj.0.0.6.rank.5.cores = 1
lwj.0.0.6.rank.6.cores = 1

quartz16{dahn}40: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.6.cores = 1

@trws: I will need more validation including its effect on core-level scheduling and also make the code a bit more generic... But it seems worth a shot and maybe can be used for production run over the weekend...

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

Well it is kind of getting late, and thinking about this, the patch logic may not be quite complete. I will spend a bit more time tomorrow morning. But this IS the right bug site.

@grondo
Copy link
Contributor

grondo commented Mar 30, 2018

Wow, heroic effort @dongahn! Nice work!

Those two jobs seem wrong and that maybe why these jobs are still marked as running

There is a bug in the wreck use of Rlite (derived from ranks.N.cores) here. It is setting the nnodes of the job to the total number of ranks assigned, not the number of ranks used. However, what should the correct behavior be? We could either run successfully on the number of actually used nodes, or generate a fatal error at job startup when ntasks < nnodes.

@trws
Copy link
Member Author

trws commented Mar 30, 2018 via email

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

Yes this is correct. job 1 ran 1 min and exited. Then job 2 started right after and then ran 10 sec. I still haven't had a chance to think about a complete solution though.

@trws
Copy link
Member Author

trws commented Mar 30, 2018 via email

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

However, what should the correct behavior be? We could either run successfully on the number of actually used nodes, or generate a fatal error at job startup when ntasks < nnodes.

Yes, this is a bug cascaded from a bug from the scheduler. I sort of like the first semantics with a big warning message... But I can see why you may want the second semantics though.

@dongahn
Copy link
Member

dongahn commented Mar 30, 2018

I understand this problem better now. This is a deficiency within the scheduler for node exclusive scheduling mode. I can see why this case wasn't covered as the sched folks have focused on core-level scheduling as our testing coverage and the initial use cases.

I think the better place to fix this problem is actually in the resrc_tree_search function.

The purpose of this else branch is so that one can select the high-level resources leading to each exclusively allocated resource. This is needed when the job request is partially specified: when only core: 1 is requested, the logic should select the cluster, node, socket that contain that particular core.

But there is a deficiency in the code. When resrc walks the resource tree and visits a node vertex that is already allocated to another job, the match test will fail and this logic puts us on the else branch.

But unfortunately, the else branch doesn't check exclusivity. So the visiting node can be selected even if it's exclusively allocated to another job. Over scheduling!

My current patch is:

diff --git a/resrc/resrc_reqst.c b/resrc/resrc_reqst.c
index 9a3244c..fff6748 100644
--- a/resrc/resrc_reqst.c
+++ b/resrc/resrc_reqst.c
@@ -551,7 +551,10 @@ int64_t resrc_tree_search (resrc_api_ctx_t *ctx,
             nfound = 1;
             resrc_reqst->nfound++;
         }
-    } else if (resrc_tree_num_children (resrc_phys_tree (resrc_in))) {
+    } else if (resrc_tree_num_children (resrc_phys_tree (resrc_in))
+               && !(resrc_reqst_exclusive (resrc_reqst)
+                   && (resrc_size_allocs (resrc_in) || resrc_size_reservtns (resrc_in)))) {
+
         /*
          * This clause visits the children of the current resource
          * searching for a match to the resource request.  The found
@@ -562,6 +565,12 @@ int64_t resrc_tree_search (resrc_api_ctx_t *ctx,
          * defined.  E.g., it might only stipulate a node with 4 cores
          * and omit the intervening socket.
          */

I can't do a PR for this yet because I don't fully understand its impact to backfill schedulers. But for FCFS that @trws uses, I think there is a good chance this will work.

Tom, you are welcome to test this on your branch if you're interested.

I need to work on various milestone reports this afternoon. I'll see if I can circle back and understand its impact to backfill.

@trws
Copy link
Member Author

trws commented Apr 2, 2018

I'm trying this out in the splash branch, will see what happens.

@dongahn
Copy link
Member

dongahn commented Apr 2, 2018

Verdict?

@trws
Copy link
Member Author

trws commented Apr 2, 2018

It seems to help. We're running it right now, but the predominent mode is actually using the new ncores functionality which helps a lot. I'm actually getting coscheduling the way they want now.

@dongahn
Copy link
Member

dongahn commented Apr 2, 2018

Great! Please keep this open though. Need to double check this is safe with backfill scheduling before positing a PR.

@dongahn
Copy link
Member

dongahn commented Apr 4, 2018

Need to double check this is safe with backfill scheduling before positing a PR.

It turns out thisresrc logic will require a redesign. For backfill cases, in this else branch we need to even check the future reservation state in determining whether to select the visiting (high-level) resource or not.

One can call resrc_walltime_match once again in this branch after creating a "fake" resrc_reqst object. But patching things like this over and over will likely make the code pretty unreadable.

When I circle around, I will try to patch this as much as possible. But it seems future really should be the new resource layer.

For @trws' purpose, this should work find though. That is, this commit, flux-framework/flux-sched@2317aaa in flux-sched PR #306.

@dongahn
Copy link
Member

dongahn commented Apr 5, 2018

FYI -- I think I patched this enough so resrc now should work for fcfs AND backfill in the latest PR I just pushed forward: flux-framework/flux-sched#305

@dongahn
Copy link
Member

dongahn commented Apr 10, 2018

OK. This has been fixed in flux-framework/flux-sched#305.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants