Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SoS cluster job hangs #1216

Closed
gaow opened this issue Feb 19, 2019 · 16 comments
Closed

SoS cluster job hangs #1216

gaow opened this issue Feb 19, 2019 · 16 comments

Comments

@gaow
Copy link
Member

gaow commented Feb 19, 2019

We have had discussions over this before (in some other tickets) but I believe this is due to a different cause since many things have been changed. I'm therefore using a new ticket. Again it is hard to provide a MWE but the script is along the lines of the example in #1213 . The symptom is classic: SoS hangs without submitting more jobs even though the cluster queue is empty. However when I use ctrl-c this time, the error message is a lot longer. See it below:

hang.log.txt

And interestingly, multiple ctrl-c killed SoS yet as you can see from the log file above, some more jobs in fact got submitted. It looks like some processes were zombie, and when killed they are freed up, and used immediately for hte next batch of submission. All these behavior have not previously been observed.

I'm on current master but this problem seems to be especially serious in recent SoS versions. To get my tasks all submitted I have to repeatedly doing ctrl-c when it hangs and start all over (with existing jobs skipped of course). But it is rather inconvenient. ... please let me know (tmr) what other information you need to look into this.

@gaow
Copy link
Member Author

gaow commented Feb 19, 2019

Actually SoS 0.18.7 seems to be doing a lot better. I was only able to finish my jobs with that version. @BoPeng if it is not immediately clear what's going on, I can make an example on the cluster for you to take a look at in more details.

@BoPeng
Copy link
Contributor

BoPeng commented Feb 19, 2019

Yes, that would be helpful. It would be easier if I could stick to the "working master" principle so that you can run a git bisect to test, but unfortunately I pushed too many "should be working" but failed commits.

@gaow
Copy link
Member Author

gaow commented Feb 19, 2019

Thank you! No worries we are still under alpha. Yes the old version has its own problems: you are likely to be able to reproduce the ProcessKilled issue there. But it at least does not hang and can therefore finish the job if no other issue occurs.

@gaow
Copy link
Member Author

gaow commented Feb 25, 2019

For worker branch there may be other unexpected behavior (still testing) but so far I do not see hanging behavior. At least I assume even if it hangs at some point in the future it will be of different cause. I'll use new tickets for them if there is a need.

@gaow gaow closed this as completed Feb 25, 2019
@gaow
Copy link
Member Author

gaow commented Apr 10, 2019

Reopen it due to current observation:

[MW] ps -u gaow | grep sos
32614 pts/46   00:01:01 sos
34731 pts/46   00:00:27 sos
34821 pts/46   00:01:22 sos
34832 pts/46   00:00:01 sos <defunct>
34947 pts/46   00:00:01 sos <defunct>

and the hanging behavior after.

@gaow gaow reopened this Apr 10, 2019
@gaow
Copy link
Member Author

gaow commented Apr 11, 2019

Also interestingly at some point the defunct processes disappear

[MW] ps -u gaow | grep sos
 3476 pts/46   00:00:29 sos <defunct>
 3481 pts/46   00:00:29 sos <defunct>
32614 pts/46   00:03:05 sos
34731 pts/46   00:01:21 sos
34821 pts/46   00:02:12 sos
[MW] 
[MW] 
[MW] ps -u gaow | grep sos
 5628 pts/46   00:00:32 sos
 5828 pts/46   00:00:02 sos
32614 pts/46   00:03:45 sos
34731 pts/46   00:01:24 sos

The 2nd check happens when a new step submits jobs. it seems to somewhat recovered from the previous submission.

@gaow
Copy link
Member Author

gaow commented Apr 11, 2019

I wrote the above message ~20min ago though just posted it. This is what it looks like now:

[MW] ps -u gaow | grep sos
 5628 pts/46   00:01:01 sos
 5828 pts/46   00:01:31 sos <defunct>
32614 pts/46   00:06:31 sos
34731 pts/46   00:02:57 sos

Still one process short (expecting 4+1 process) and one process newly became defunct.

@BoPeng
Copy link
Contributor

BoPeng commented Apr 11, 2019

The log message shows more than 150 workers ... how did you call sos?

@gaow
Copy link
Member Author

gaow commented Apr 11, 2019

Oh well, interestingly as the submitter moves on to yet another step, the defunct was not only elimated but also the number of processes recovers:

[MW] ps -u gaow | grep sos

 5628 pts/46   00:01:03 sos
 5828 pts/46   00:01:31 sos <defunct>
32614 pts/46   00:06:59 sos
34731 pts/46   00:02:59 sos
[MW] 
[MW] 
[MW] ps -u gaow | grep sos
 5628 pts/46   00:02:43 sos
20105 pts/46   00:00:00 sos
32614 pts/46   00:09:44 sos
34731 pts/46   00:04:04 sos
51186 pts/46   00:00:04 sos
51202 pts/46   00:00:04 sos
[MW] ps -u gaow | grep sos
 5628 pts/46   00:02:43 sos
32614 pts/46   00:09:45 sos
34731 pts/46   00:04:04 sos
51186 pts/46   00:00:05 sos
51202 pts/46   00:00:05 sos
[MW] 
[MW] 
[MW] ps -u gaow | grep sos
 5628 pts/46   00:02:43 sos
20656 pts/46   00:00:00 sos
32614 pts/46   00:09:46 sos
34731 pts/46   00:04:04 sos
51186 pts/46   00:00:06 sos
51202 pts/46   00:00:06 sos
[MW] ps -u gaow | grep sos
 5628 pts/46   00:02:43 sos
32614 pts/46   00:09:49 sos
34731 pts/46   00:04:04 sos
51186 pts/46   00:00:07 sos
51202 pts/46   00:00:07 sos
[MW] ps -u gaow | grep sos
 5628 pts/46   00:02:43 sos
32614 pts/46   00:09:53 sos
34731 pts/46   00:04:05 sos
51186 pts/46   00:00:10 sos
51202 pts/46   00:00:10 sos

and you see at some point it has 6 sos ...

BTW this is running a MWE that I'll PM you soon.

@gaow
Copy link
Member Author

gaow commented Apr 11, 2019

With the MWE we cannot reproduce the hang. I've seen in the past that a hang caused by specifying a non-existing path to write *.err and *.out files for slurm jobs. In that scenario no error message is reported. Instead the SoS submitter simply hangs. But that's more of a user error. The MWE is essentially the same as my real script in question; with all R actions replaced by some trivial place holder code. My real analysis script can reproduce the hang twice in a row. I'm wondering if it is some user error that triggers it. But if the complain comes from R actions the real data analysis code, I'd expect SoS to quit on failure.

@BoPeng do you have any suggestions what i should check when it hangs? sos status on previously submitted jobs all showing completed status as expected. Not sure what else to check. Do you think the defunct behavior might play a part in it? (I have 3/5 defunct when the hang happens)

@gaow
Copy link
Member Author

gaow commented Apr 12, 2019

I confirm the hang persists. It in fact has 9 processes. After I ctrl-c 8 remains. None of them are defunct. I have to kill them manually.

@gaow
Copy link
Member Author

gaow commented Apr 13, 2019

After reverting the patch #1248 the workflow completed without hang. But I'm not sure if there has been specific code change for the hang behavior when initially reopened it 3 days ago -- not obvious from this ticket but many improvement has happened the past couple of days. Bottom line is I do not have an examples now to reliably reproduce. Not sure what to do from here. Should I close it and reopen in the future when I have another example?

@BoPeng
Copy link
Contributor

BoPeng commented Apr 13, 2019

I will close the ticket. It would be nice if you can keep this example somewhere so that we can test task queues with large number of tasks, or even benchmark performance of signature validation. Actually, I have a few test machine (though windows based) so it would be helpful to migrate the workflow there if there are not too many dependencies.

@BoPeng BoPeng closed this as completed Apr 13, 2019
@gaow
Copy link
Member Author

gaow commented Apr 13, 2019

Actually, I have a few test machine (though windows based) so it would be helpful to migrate the workflow there if there are not too many dependencies.

That would be great! I will try again replacing the code with simply sleep = X; saveRDS() to remove all dependencies and pass you that version. so we can control the sleep parameter, and there will be enough files to benchmark signature checkts.

@BoPeng
Copy link
Contributor

BoPeng commented Apr 13, 2019

ok, It appears that our test case is not enough. I will wait for you to send me the (stress) test case and pass it on my workstations before I make a new release.

@gaow
Copy link
Member Author

gaow commented Apr 13, 2019

Here is the same workflow we've been running since last night but I replaced all the actual code with trivial codes:

issue_1216.tar.gz

This should not hang using the current master (I did not test). I'll make similar examples if down the line there are more complicated cases that hangs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants