-
-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request][Launchers] Support partial failures #1377
Comments
The Launcher launch API is synchronous, at some time in the far future we might support asynchrnous launching. but for now I think it would make more sense to return a success indicator in JobReturn. class JobResult(Enum):
COMPLETED = 1
FAILED = 2
@dataclass
class JobReturn:
overrides: Optional[Sequence[str]] = None
return_value: Any = None
cfg: Optional[DictConfig] = None
hydra_cfg: Optional[DictConfig] = None
working_dir: Optional[str] = None
task_name: Optional[str] = None
job_result : Optional[JobResult] = None
exception : Optional[Exception] = None # if job_result i FAILED. we can also use the return_value instead.
class Launcher(Plugin):
def launch(
self, job_overrides: Sequence[Sequence[str]], initial_job_idx: int
) -> Sequence[JobReturn]: |
Why have It basically already look like the That said, the change like this seems pretty simple and should take a couple of line at most in the submitit plugin, I can help for this one if need be. |
Gotcha. I thought you were suggesting some formal async API.
It contains other things that can be useful (overrides, config, working directory etc). We can provide a result() method on it that would raise the exception. |
I meant why |
Ohhh, yeah - I understood it as why JobReturn. |
Then you're heading to async indeed :p |
eventually :) |
this is done. issues have been created for follow up tasks. |
🚀 Feature Request
Motivation
When running a sweeper, if some (but not all) jobs fail, the sweeper will fail while it could have continued.
Is your feature request related to a problem? Please describe.
Training neural networks on a cluster can lead to OOM errors, or timeouts of the jobs. In this case you could want to just move on to different settings instead of stopping the whole sweep.
Pitch
Describe the solution you'd like
The launchers could return
Future
-like objects (callingresult()
raises an exception if the job failed, but one can check theexception()
emethod beforehand which returns None or the exception, if one wants to avoid raising the error). This would allow the launchers to decide how to proceed with the exceptions.Describe alternatives you've considered
Anything is fine as long as raising the exception is deferred to the caller.
Are you willing to open a pull request? (See CONTRIBUTING)
No
The text was updated successfully, but these errors were encountered: