-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SoS cluster job hangs #1216
Comments
Actually SoS 0.18.7 seems to be doing a lot better. I was only able to finish my jobs with that version. @BoPeng if it is not immediately clear what's going on, I can make an example on the cluster for you to take a look at in more details. |
Yes, that would be helpful. It would be easier if I could stick to the "working master" principle so that you can run a |
Thank you! No worries we are still under alpha. Yes the old version has its own problems: you are likely to be able to reproduce the |
For |
Reopen it due to current observation:
and the hanging behavior after. |
Also interestingly at some point the
The 2nd check happens when a new step submits jobs. it seems to somewhat recovered from the previous submission. |
I wrote the above message ~20min ago though just posted it. This is what it looks like now:
Still one process short (expecting 4+1 process) and one process newly became defunct. |
The log message shows more than 150 workers ... how did you call sos? |
Oh well, interestingly as the submitter moves on to yet another step, the defunct was not only elimated but also the number of processes recovers:
and you see at some point it has 6 BTW this is running a MWE that I'll PM you soon. |
With the MWE we cannot reproduce the hang. I've seen in the past that a hang caused by specifying a non-existing path to write @BoPeng do you have any suggestions what i should check when it hangs? |
I confirm the hang persists. It in fact has 9 processes. After I |
After reverting the patch #1248 the workflow completed without hang. But I'm not sure if there has been specific code change for the hang behavior when initially reopened it 3 days ago -- not obvious from this ticket but many improvement has happened the past couple of days. Bottom line is I do not have an examples now to reliably reproduce. Not sure what to do from here. Should I close it and reopen in the future when I have another example? |
I will close the ticket. It would be nice if you can keep this example somewhere so that we can test task queues with large number of tasks, or even benchmark performance of signature validation. Actually, I have a few test machine (though windows based) so it would be helpful to migrate the workflow there if there are not too many dependencies. |
That would be great! I will try again replacing the code with simply |
ok, It appears that our test case is not enough. I will wait for you to send me the (stress) test case and pass it on my workstations before I make a new release. |
Here is the same workflow we've been running since last night but I replaced all the actual code with trivial codes: This should not hang using the current |
We have had discussions over this before (in some other tickets) but I believe this is due to a different cause since many things have been changed. I'm therefore using a new ticket. Again it is hard to provide a MWE but the script is along the lines of the example in #1213 . The symptom is classic: SoS hangs without submitting more jobs even though the cluster queue is empty. However when I use
ctrl-c
this time, the error message is a lot longer. See it below:hang.log.txt
And interestingly, multiple
ctrl-c
killed SoS yet as you can see from the log file above, some more jobs in fact got submitted. It looks like some processes were zombie, and when killed they are freed up, and used immediately for hte next batch of submission. All these behavior have not previously been observed.I'm on current
master
but this problem seems to be especially serious in recent SoS versions. To get my tasks all submitted I have to repeatedly doingctrl-c
when it hangs and start all over (with existing jobs skipped of course). But it is rather inconvenient. ... please let me know (tmr) what other information you need to look into this.The text was updated successfully, but these errors were encountered: