[Feature Request][Launchers] Support partial failures #1377

jrapin · 2021-02-04T11:05:49Z

🚀 Feature Request

Motivation

When running a sweeper, if some (but not all) jobs fail, the sweeper will fail while it could have continued.

Is your feature request related to a problem? Please describe.

Training neural networks on a cluster can lead to OOM errors, or timeouts of the jobs. In this case you could want to just move on to different settings instead of stopping the whole sweep.

Pitch

Describe the solution you'd like

The launchers could return Future-like objects (calling result() raises an exception if the job failed, but one can check the exception()emethod beforehand which returns None or the exception, if one wants to avoid raising the error). This would allow the launchers to decide how to proceed with the exceptions.

Describe alternatives you've considered

Anything is fine as long as raising the exception is deferred to the caller.

Are you willing to open a pull request? (See CONTRIBUTING)

No

The text was updated successfully, but these errors were encountered:

omry · 2021-02-04T20:27:07Z

The launchers could return Future-like objects (calling result() raises an exception if the job failed, but one can check the exception()emethod beforehand which returns None or the exception, if one wants to avoid raising the error). This would allow the launchers to decide how to proceed with the exceptions.

The Launcher launch API is synchronous, at some time in the far future we might support asynchrnous launching. but for now I think it would make more sense to return a success indicator in JobReturn.

class JobResult(Enum):
  COMPLETED = 1
  FAILED = 2

@dataclass
class JobReturn:
    overrides: Optional[Sequence[str]] = None
    return_value: Any = None
    cfg: Optional[DictConfig] = None
    hydra_cfg: Optional[DictConfig] = None
    working_dir: Optional[str] = None
    task_name: Optional[str] = None

    job_result : Optional[JobResult] = None
    exception : Optional[Exception] = None # if job_result i FAILED. we can also use the return_value instead.

class Launcher(Plugin):
    def launch(
        self, job_overrides: Sequence[Sequence[str]], initial_job_idx: int
    ) -> Sequence[JobReturn]:

jrapin · 2021-02-05T13:23:49Z

Why have JobResult if you have the exception ? exception seems enough, it's None if there is no exception and not None if there is, I guess the result stays to None if there was an exception.

It basically already look like the Future API, except that you may fail to identify that there was a bug when accessing result if the job could have returned None. Using a "future-like" object does not mean that we need to be asynchronous, the object is just a container for the result which may have already been computed before.

That said, the change like this seems pretty simple and should take a couple of line at most in the submitit plugin, I can help for this one if need be.

omry · 2021-02-07T08:41:17Z

Gotcha. I thought you were suggesting some formal async API.

Why have JobResult if you have the exception ?

It contains other things that can be useful (overrides, config, working directory etc).

We can provide a result() method on it that would raise the exception.
in any case, this will require small changes to all sweepers. It's probably best if the same person does it at the same time as introducing the change.

jrapin · 2021-02-07T10:05:31Z

It contains other things that can be useful (overrides, config, working directory etc).

I meant why JobResult, not JobReturn (I think you understood the latter?). JobResult(enum) seems to be straightforwardly inferable from exception being filled or not?

omry · 2021-02-07T20:17:19Z

Ohhh, yeah - I understood it as why JobReturn.
Yes - JobReturn is more of a future compatibility thing, it will allow adding additional states (queued, running).
It's not really needed for this feature.

jrapin · 2021-02-08T08:41:40Z

Then you're heading to async indeed :p

omry · 2021-02-09T20:20:42Z

eventually :)

jieru-hu · 2021-04-01T22:47:41Z

this is done. issues have been created for follow up tasks.

jrapin added the enhancement Enhanvement request label Feb 4, 2021

jieru-hu self-assigned this Feb 4, 2021

jieru-hu added this to the Hydra 1.1.0 milestone Feb 5, 2021

jieru-hu added the In progress Work in progress label Mar 4, 2021

omry mentioned this issue Mar 9, 2021

[Feature Request] Fault tolerance option for sweeps #1462

Closed

This was referenced Mar 9, 2021

[Feature Request] Fault tolerance option for Optuna Sweeper #1463

Open

Support partial failures in multirun mode. #1465

Merged

This was referenced Mar 29, 2021

Support multirun partial failure in Nevergrad sweeper #1511

Closed

Support multirun partial failure in Ax Sweeper #1512

Open

Support multirun partial failure in Optuna Sweeper #1513

Open

jieru-hu closed this as completed Apr 1, 2021

Jasha10 mentioned this issue Dec 16, 2021

[Bug] Failed Submitit Jobs are still reported as "completed" #1928

Open

2 tasks

tesfaldet mentioned this issue Jul 7, 2023

Multiple parallel runs on multi-GPU for hyperparameter search ashleve/lightning-hydra-template#514

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request][Launchers] Support partial failures #1377

[Feature Request][Launchers] Support partial failures #1377

jrapin commented Feb 4, 2021

omry commented Feb 4, 2021

jrapin commented Feb 5, 2021

omry commented Feb 7, 2021

jrapin commented Feb 7, 2021 •

edited

Loading

omry commented Feb 7, 2021

jrapin commented Feb 8, 2021

omry commented Feb 9, 2021

jieru-hu commented Apr 1, 2021

[Feature Request][Launchers] Support partial failures #1377

[Feature Request][Launchers] Support partial failures #1377

Comments

jrapin commented Feb 4, 2021

🚀 Feature Request

Motivation

Pitch

omry commented Feb 4, 2021

jrapin commented Feb 5, 2021

omry commented Feb 7, 2021

jrapin commented Feb 7, 2021 • edited Loading

omry commented Feb 7, 2021

jrapin commented Feb 8, 2021

omry commented Feb 9, 2021

jieru-hu commented Apr 1, 2021

jrapin commented Feb 7, 2021 •

edited

Loading