-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disturbing but apparently innocuous bug in reporting of ranks working on a job #1400
Comments
I'm looking at this part of the code anyway so I can take a look at it. What's the easiest way to reproduce this? |
It seems like the easiest way is to run a multi-node job that takes a
little while, and submit another job that takes one before the first
ends.
…On 29 Mar 2018, at 14:14, Dong H. Ahn wrote:
> Looking at the flux wreck ls output below, or the corresponding kvs
entries, it looks like job 8 is running at least 11 processes, one on
each of ranks 0-11, overlapping with job 7. In truth, it's running one
process, only on rank 11. Somehow sched is populating the kvs with
both the new resource request, and all of the resources from the
previous job which is still running. This is even with the release
requested resources call put back in. Somehow the result, in terms of
executing the thing, still looks correct, but the output is really
messed up.
I'm looking at this part of the code anyway so I can take a look at
it. What's the easiest way to reproduce this?
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#1400 (comment)
|
Currently A quick fix would be to ignore rank directories that don't have a cores= field. Another side effect of having extra This issue reminded me that on my PR branch (#1399) |
The disturbing part of this is that looking at one of the ones that
should be running only one process, and end up running only one, all of
the rank.N.cores files are there, and they all have the value “1”.
…On 29 Mar 2018, at 14:27, Mark Grondona wrote:
Currently `flux wreck ls` is just summarizing the integer keys in
`lwj.x.y.ranks.`.
A quick fix would be to ignore rank directories that don't have a
cores= field. Another side effect of having extra `ranks` dirs is that
the presence of those directories determines on which ranks `wrexecd`
is launched, so we may have a lot of unnecessary fork/exec goings on.
This issue reminded me that on my PR branch (#1399) `flux wreck ls` is
broken. I still have to add code to parse `R_lite` there to get the
`RANKS` field...
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#1400 (comment)
|
for example, job 8 above has this in the kvs:
|
erm, oof. That's not expected! 😢 |
Oh crap... it's not just running one process, it's actually overscheduling them. We only get output from one, somehow, but it runs them all. This is a really, really bad one. |
Let me look into this. |
Can you dump the whole This should be fixed after merge of #1399, if |
Unfortunately the instance is dead as of about a minute ago... The lwj.0.0.8.ntasks was 1 though. |
also ncores and nnodes were 1, I had that saved off in history. |
Ok. I can use |
The quick test I'm using is this:
|
Seems to work okay on 4 nodes on quartz. Let me try 10 nodes.
|
FYI -- user issue and then dental appointment. I will pick this up tonight. |
Ok, thanks dong, trying a vanilla flux to see if it's something in the perf tweaks we were making. |
@SteVwonder @dongahn, any idea where a "error fetching job and event" error would be coming from using most recent master core and sched? |
Found it, current sched can't tolerate the lack of a null state yet. |
@trws: would it be better if I just merge my flux-sched PR with temporary emulator breakage? @SteVwonder is busy with other stuff at the moment. At least for the 4 node case, that branch worked okay. |
That PR should also speed up the scheduling performance for high job submission rate. |
If that's what you were testing on, that would be much appreciated. I'm looking at not being able to run correctly at all right now. |
@dongahn, your PR needs to be rebased before we can hit the merge button. |
For this bug to repro, node_exclusive has to be turned on in sched. |
Also, @dongahn, the output you sent shows overscheduling a node... |
@trws: probably exclusivity isn't turned on? |
Actually, should have seen your comment above. |
This is now a complete blocker. It means I can't run anything without overscheduling. |
It appears the sched is configured to do only core-level scheduling! I guess you changed this code and turned on the node exclusive scheduling and the error cropped up? If so, I will do the same and reproduce the misbehavior. |
I did, if it’s turned on, or even if you just set the number of node
resources to request to 1 rather than 0 in the request generation, it
goes completely off the deep end. The only workaround I’ve thought of
is to use hwloc reload to load a single core resource description in for
each node, so they all pretend to only have one core.
…On 29 Mar 2018, at 17:34, Dong H. Ahn wrote:
It appears the sched is configured to do [only core-level
scheduling](https://github.com/flux-framework/flux-sched/blob/master/sched/sched.c#L441)!
I guess you changed this code and turned on the node exclusive
scheduling and the error cropped up? If so, I will do the same and
reproduce the misbehavior.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1400 (comment)
|
bad! |
Using the latest master for both flux-core and sched with one line change in this code setting the node exclusive to be true, I reproduced the over-scheduling problem. This would be the right behavior if you do core-level scheduling, but def. incorrect scheduling for exclusive node-level scheduling. I will first see if I can fix this issue for this simple reproducer and then see if I can use a more complex case @trws posted in the beginning of this issue.
|
An equally acceptable result for splash would be to fix it so that the ncores, or cores per task, or something can be used to say |
Well. I have to take it back. I ran the test again with that one line change and it seems nodes are exclusively scheduled.
The problem I see is job 6 and job 8 are incorrectly marked as running. And if I do ps,
I see I will do some more testing for scheduling, though. |
Ah, actually
Those two jobs seem wrong and that maybe why these jobs are still marked as |
No wonder:
|
Still looking. But I have to guess the bug is within I looked at the log file for the job 6:
When I dumped the Need to go a bit deeper to pinpoint the bug... |
A bit difficult to diagnose because of all this recursion but I think I got it. It seems this code which is giving trouble for node-exclusive scheduling. When the node request is exclusive, this code shouldn't select the For quick testing, I added the following conditional: index 9e3764f..0dbda87 100644
--- a/sched/sched_fcfs.c
+++ b/sched/sched_fcfs.c
@@ -228,7 +228,8 @@ resrc_tree_t *select_resources (flux_t *h, resrc_api_ctx_t *rsapi,
* defined. E.g., it might only stipulate a node with 4 cores
* and omit the intervening socket.
*/
- selected_tree = resrc_tree_new (selected_parent, resrc);
+ if (strcmp (resrc_type (resrc), "node") != 0)
+ selected_tree = resrc_tree_new (selected_parent, resrc);
children = resrc_tree_children (found_tree);
child_tree = resrc_tree_list_first (children);
while (child_tree) { W/ this, at least my reproducer behaves correctly:
@trws: I will need more validation including its effect on core-level scheduling and also make the code a bit more generic... But it seems worth a shot and maybe can be used for production run over the weekend... |
Well it is kind of getting late, and thinking about this, the patch logic may not be quite complete. I will spend a bit more time tomorrow morning. But this IS the right bug site. |
Wow, heroic effort @dongahn! Nice work!
There is a bug in the wreck use of |
Is that correct output? I can’t tell by when you submitted, but it
looks like jobs 1 and 2 are completely overlapped?
…On 29 Mar 2018, at 23:09, Dong H. Ahn wrote:
A bit difficult to diagnose because of all this recursion but I think
I got it. It seems [this
code](https://github.com/flux-framework/flux-sched/blob/master/sched/sched_fcfs.c#L231)
which is giving trouble for node-exclusive scheduling.
When the node request is exclusive, this code shouldn't select the
`node` type in this `else` branch. The node level selection should
only be done in the `if` branch.
For quick testing, I added the following conditional:
```diff
index 9e3764f..0dbda87 100644
--- a/sched/sched_fcfs.c
+++ b/sched/sched_fcfs.c
@@ -228,7 +228,8 @@ resrc_tree_t *select_resources (flux_t *h,
resrc_api_ctx_t *rsapi,
* defined. E.g., it might only stipulate a node with 4
cores
* and omit the intervening socket.
*/
- selected_tree = resrc_tree_new (selected_parent, resrc);
+ if (strcmp (resrc_type (resrc), "node") != 0)
+ selected_tree = resrc_tree_new (selected_parent, resrc);
children = resrc_tree_children (found_tree);
child_tree = resrc_tree_list_first (children);
while (child_tree) {
```
W/ this, at least my reproducer behaves correctly:
```
quartz16{dahn}38: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck
ls
ID NTASKS STATE START RUNTIME RANKS
COMMAND
1 8 exited 2018-03-29T22:53:00 1.002m [0-7]
sleep
2 7 exited 2018-03-29T22:54:01 10.097s [0-6]
sleep
3 6 exited 2018-03-29T22:54:11 10.077s [0-5]
sleep
4 5 exited 2018-03-29T22:54:21 10.081s [0-4]
sleep
5 4 exited 2018-03-29T22:54:31 10.087s [0-3]
sleep
6 3 exited 2018-03-29T22:54:31 10.061s [4-6]
sleep
7 2 exited 2018-03-29T22:54:41 10.067s [4-5]
sleep
8 1 exited 2018-03-29T22:54:41 10.083s 6
sleep
```
```
quartz16{dahn}39: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs
dir -R lwj.0.0.6.rank
lwj.0.0.6.rank.4.cores = 1
lwj.0.0.6.rank.5.cores = 1
lwj.0.0.6.rank.6.cores = 1
quartz16{dahn}40: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs
dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.6.cores = 1
```
@trws: I will need more validation including its effect on core-level
scheduling and also make the code a bit more generic... But it seems
worth a shot and maybe can be used for production run over the
weekend...
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1400 (comment)
|
Yes this is correct. job 1 ran 1 min and exited. Then job 2 started right after and then ran 10 sec. I still haven't had a chance to think about a complete solution though. |
Gotcha. That’s definitely an improvement.
…On 30 Mar 2018, at 9:22, Dong H. Ahn wrote:
Yes this is correct. job 1 ran 1 min and exited. Then job 2 started
right after and then ran 10 sec. I still haven't had a chance to think
about a complete solution though.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1400 (comment)
|
Yes, this is a bug cascaded from a bug from the scheduler. I sort of like the first semantics with a big warning message... But I can see why you may want the second semantics though. |
I understand this problem better now. This is a deficiency within the scheduler for node exclusive scheduling mode. I can see why this case wasn't covered as the sched folks have focused on core-level scheduling as our testing coverage and the initial use cases. I think the better place to fix this problem is actually in the The purpose of this But there is a deficiency in the code. When But unfortunately, the My current patch is:
I can't do a PR for this yet because I don't fully understand its impact to backfill schedulers. But for FCFS that @trws uses, I think there is a good chance this will work. Tom, you are welcome to test this on your branch if you're interested. I need to work on various milestone reports this afternoon. I'll see if I can circle back and understand its impact to backfill. |
I'm trying this out in the splash branch, will see what happens. |
Verdict? |
It seems to help. We're running it right now, but the predominent mode is actually using the new ncores functionality which helps a lot. I'm actually getting coscheduling the way they want now. |
Great! Please keep this open though. Need to double check this is safe with backfill scheduling before positing a PR. |
It turns out this One can call When I circle around, I will try to patch this as much as possible. But it seems future really should be the new For @trws' purpose, this should work find though. That is, this commit, flux-framework/flux-sched@2317aaa in flux-sched PR #306. |
FYI -- I think I patched this enough so |
OK. This has been fixed in flux-framework/flux-sched#305. |
Looking at the
flux wreck ls
output below, or the corresponding kvs entries, it looks like job 8 is running at least 11 processes, one on each of ranks 0-11, overlapping with job 7. In truth, it's running one process, only on rank 11. Somehow sched is populating the kvs with both the new resource request, and all of the resources from the previous job which is still running. This is even with the release requested resources call put back in. Somehow the result, in terms of executing the thing, still looks correct, but the output is really messed up.The text was updated successfully, but these errors were encountered: