-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R_lite distribution failure causes all future jobs to stall at submitted or reserved #1425
Comments
I think the transition to failed is the expected one with tasks < nodes after #1403. It sounds like sched's state machine doesn't like it? I'm curious, does unloading sched fix it, e.g. can you run with wreckrun then? |
That seems to be part of it. If I use wreckrun immediate that seems to work, so it should be a sched thing, but unloading and re-loading doesn't seem to work for some reason. |
It is an issue in sched. Its state machine doesn't currently expect this state transition. I think the job isn't removed from the pending queue and the scheduler state probably gets into some sort of limbo state. I don't understand why unloading/reloading doesn't work though. If someone tell me, from what states the job can get into In the next round, we probably want to document all of the state changes a job can go through somewhere to firm up the contract. |
Is the problem with reloading sched that sched doesn't pick up the existing jobs? Or do new jobs not run? |
New jobs don't run. It may be that it tries to schedule the first job that's in reserved state and doesn't know what to do with it? |
Currently reloaded sched won't pick up the existing jobs. (should be a part of future resilience work.)
Hmm... Which sched version is yours based on? A submitted job shouldn't start from |
It does, but if there are waiting jobs when oddness happened then they somehow end up in a reserved state rather than allocated, and then the new sched doesn't know what to do with them. |
I don't see FAILED in the big I think STARTING is the only state that will transition to FAILED? |
ok. thanks. I will try to reproduce it as well tomorrow morning then. |
Yeah... and it needs to be covered. BTW the switch statement is based on the old state of a job.
Is it |
Sorry, @trws's report does show It does look like it could transition to So sounds like we should allow either |
@trws: I added support for The commit is flux-framework/flux-sched@69df09f My test shows it solves the failed event emitted due to the condition you described. But I couldn't reproduce the load/unload oddity. My guess is that was a bug cascaded from the lack of failed event support. Please try. |
This has been fixed in flux-framework/flux-sched#305. |
This error from the updated wreck appears to happen if there are less tasks than nodes among other times. If I see it though, the instance is no longer capable of scheduling or running work, even if the scheduler is reloaded.
The text was updated successfully, but these errors were encountered: