SoS cluster job hangs #1216

gaow · 2019-02-19T06:56:34Z

We have had discussions over this before (in some other tickets) but I believe this is due to a different cause since many things have been changed. I'm therefore using a new ticket. Again it is hard to provide a MWE but the script is along the lines of the example in #1213 . The symptom is classic: SoS hangs without submitting more jobs even though the cluster queue is empty. However when I use ctrl-c this time, the error message is a lot longer. See it below:

hang.log.txt

And interestingly, multiple ctrl-c killed SoS yet as you can see from the log file above, some more jobs in fact got submitted. It looks like some processes were zombie, and when killed they are freed up, and used immediately for hte next batch of submission. All these behavior have not previously been observed.

I'm on current master but this problem seems to be especially serious in recent SoS versions. To get my tasks all submitted I have to repeatedly doing ctrl-c when it hangs and start all over (with existing jobs skipped of course). But it is rather inconvenient. ... please let me know (tmr) what other information you need to look into this.

The text was updated successfully, but these errors were encountered:

gaow · 2019-02-19T13:33:45Z

Actually SoS 0.18.7 seems to be doing a lot better. I was only able to finish my jobs with that version. @BoPeng if it is not immediately clear what's going on, I can make an example on the cluster for you to take a look at in more details.

BoPeng · 2019-02-19T14:04:17Z

Yes, that would be helpful. It would be easier if I could stick to the "working master" principle so that you can run a git bisect to test, but unfortunately I pushed too many "should be working" but failed commits.

gaow · 2019-02-19T16:27:48Z

Thank you! No worries we are still under alpha. Yes the old version has its own problems: you are likely to be able to reproduce the ProcessKilled issue there. But it at least does not hang and can therefore finish the job if no other issue occurs.

gaow · 2019-02-25T01:41:02Z

For worker branch there may be other unexpected behavior (still testing) but so far I do not see hanging behavior. At least I assume even if it hangs at some point in the future it will be of different cause. I'll use new tickets for them if there is a need.

gaow · 2019-04-10T23:14:13Z

Reopen it due to current observation:

[MW] ps -u gaow | grep sos
32614 pts/46   00:01:01 sos
34731 pts/46   00:00:27 sos
34821 pts/46   00:01:22 sos
34832 pts/46   00:00:01 sos <defunct>
34947 pts/46   00:00:01 sos <defunct>

and the hanging behavior after.

gaow · 2019-04-11T00:01:47Z

Also interestingly at some point the defunct processes disappear

[MW] ps -u gaow | grep sos
 3476 pts/46   00:00:29 sos <defunct>
 3481 pts/46   00:00:29 sos <defunct>
32614 pts/46   00:03:05 sos
34731 pts/46   00:01:21 sos
34821 pts/46   00:02:12 sos
[MW] 
[MW] 
[MW] ps -u gaow | grep sos
 5628 pts/46   00:00:32 sos
 5828 pts/46   00:00:02 sos
32614 pts/46   00:03:45 sos
34731 pts/46   00:01:24 sos

The 2nd check happens when a new step submits jobs. it seems to somewhat recovered from the previous submission.

gaow · 2019-04-11T00:03:07Z

I wrote the above message ~20min ago though just posted it. This is what it looks like now:

[MW] ps -u gaow | grep sos
 5628 pts/46   00:01:01 sos
 5828 pts/46   00:01:31 sos <defunct>
32614 pts/46   00:06:31 sos
34731 pts/46   00:02:57 sos

Still one process short (expecting 4+1 process) and one process newly became defunct.

BoPeng · 2019-04-11T00:13:29Z

The log message shows more than 150 workers ... how did you call sos?

gaow · 2019-04-11T00:20:22Z

Oh well, interestingly as the submitter moves on to yet another step, the defunct was not only elimated but also the number of processes recovers:

[MW] ps -u gaow | grep sos

 5628 pts/46   00:01:03 sos
 5828 pts/46   00:01:31 sos <defunct>
32614 pts/46   00:06:59 sos
34731 pts/46   00:02:59 sos
[MW] 
[MW] 
[MW] ps -u gaow | grep sos
 5628 pts/46   00:02:43 sos
20105 pts/46   00:00:00 sos
32614 pts/46   00:09:44 sos
34731 pts/46   00:04:04 sos
51186 pts/46   00:00:04 sos
51202 pts/46   00:00:04 sos
[MW] ps -u gaow | grep sos
 5628 pts/46   00:02:43 sos
32614 pts/46   00:09:45 sos
34731 pts/46   00:04:04 sos
51186 pts/46   00:00:05 sos
51202 pts/46   00:00:05 sos
[MW] 
[MW] 
[MW] ps -u gaow | grep sos
 5628 pts/46   00:02:43 sos
20656 pts/46   00:00:00 sos
32614 pts/46   00:09:46 sos
34731 pts/46   00:04:04 sos
51186 pts/46   00:00:06 sos
51202 pts/46   00:00:06 sos
[MW] ps -u gaow | grep sos
 5628 pts/46   00:02:43 sos
32614 pts/46   00:09:49 sos
34731 pts/46   00:04:04 sos
51186 pts/46   00:00:07 sos
51202 pts/46   00:00:07 sos
[MW] ps -u gaow | grep sos
 5628 pts/46   00:02:43 sos
32614 pts/46   00:09:53 sos
34731 pts/46   00:04:05 sos
51186 pts/46   00:00:10 sos
51202 pts/46   00:00:10 sos

and you see at some point it has 6 sos ...

BTW this is running a MWE that I'll PM you soon.

gaow · 2019-04-11T13:23:29Z

With the MWE we cannot reproduce the hang. I've seen in the past that a hang caused by specifying a non-existing path to write *.err and *.out files for slurm jobs. In that scenario no error message is reported. Instead the SoS submitter simply hangs. But that's more of a user error. The MWE is essentially the same as my real script in question; with all R actions replaced by some trivial place holder code. My real analysis script can reproduce the hang twice in a row. I'm wondering if it is some user error that triggers it. But if the complain comes from R actions the real data analysis code, I'd expect SoS to quit on failure.

@BoPeng do you have any suggestions what i should check when it hangs? sos status on previously submitted jobs all showing completed status as expected. Not sure what else to check. Do you think the defunct behavior might play a part in it? (I have 3/5 defunct when the hang happens)

gaow · 2019-04-12T22:51:33Z

I confirm the hang persists. It in fact has 9 processes. After I ctrl-c 8 remains. None of them are defunct. I have to kill them manually.

gaow · 2019-04-13T15:00:16Z

After reverting the patch #1248 the workflow completed without hang. But I'm not sure if there has been specific code change for the hang behavior when initially reopened it 3 days ago -- not obvious from this ticket but many improvement has happened the past couple of days. Bottom line is I do not have an examples now to reliably reproduce. Not sure what to do from here. Should I close it and reopen in the future when I have another example?

BoPeng · 2019-04-13T15:12:11Z

I will close the ticket. It would be nice if you can keep this example somewhere so that we can test task queues with large number of tasks, or even benchmark performance of signature validation. Actually, I have a few test machine (though windows based) so it would be helpful to migrate the workflow there if there are not too many dependencies.

gaow · 2019-04-13T16:04:25Z

Actually, I have a few test machine (though windows based) so it would be helpful to migrate the workflow there if there are not too many dependencies.

That would be great! I will try again replacing the code with simply sleep = X; saveRDS() to remove all dependencies and pass you that version. so we can control the sleep parameter, and there will be enough files to benchmark signature checkts.

BoPeng · 2019-04-13T17:22:03Z

ok, It appears that our test case is not enough. I will wait for you to send me the (stress) test case and pass it on my workstations before I make a new release.

gaow · 2019-04-13T17:48:57Z

Here is the same workflow we've been running since last night but I replaced all the actual code with trivial codes:

issue_1216.tar.gz

This should not hang using the current master (I did not test). I'll make similar examples if down the line there are more complicated cases that hangs.

gaow closed this as completed Feb 25, 2019

gaow reopened this Apr 10, 2019

gaow referenced this issue Apr 12, 2019

Treating new status as terminal for submitting new tasks

d8159d6

BoPeng closed this as completed Apr 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SoS cluster job hangs #1216

SoS cluster job hangs #1216

gaow commented Feb 19, 2019

gaow commented Feb 19, 2019

BoPeng commented Feb 19, 2019

gaow commented Feb 19, 2019

gaow commented Feb 25, 2019

gaow commented Apr 10, 2019

gaow commented Apr 11, 2019

gaow commented Apr 11, 2019

BoPeng commented Apr 11, 2019

gaow commented Apr 11, 2019

gaow commented Apr 11, 2019

gaow commented Apr 12, 2019

gaow commented Apr 13, 2019

BoPeng commented Apr 13, 2019

gaow commented Apr 13, 2019

BoPeng commented Apr 13, 2019

gaow commented Apr 13, 2019

SoS cluster job hangs #1216

SoS cluster job hangs #1216

Comments

gaow commented Feb 19, 2019

gaow commented Feb 19, 2019

BoPeng commented Feb 19, 2019

gaow commented Feb 19, 2019

gaow commented Feb 25, 2019

gaow commented Apr 10, 2019

gaow commented Apr 11, 2019

gaow commented Apr 11, 2019

BoPeng commented Apr 11, 2019

gaow commented Apr 11, 2019

gaow commented Apr 11, 2019

gaow commented Apr 12, 2019

gaow commented Apr 13, 2019

BoPeng commented Apr 13, 2019

gaow commented Apr 13, 2019

BoPeng commented Apr 13, 2019

gaow commented Apr 13, 2019