Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request][Launchers] Support partial failures #1377

Closed
jrapin opened this issue Feb 4, 2021 · 8 comments
Closed

[Feature Request][Launchers] Support partial failures #1377

jrapin opened this issue Feb 4, 2021 · 8 comments
Assignees
Labels
enhancement Enhanvement request In progress Work in progress
Milestone

Comments

@jrapin
Copy link
Contributor

jrapin commented Feb 4, 2021

🚀 Feature Request

Motivation

When running a sweeper, if some (but not all) jobs fail, the sweeper will fail while it could have continued.

Is your feature request related to a problem? Please describe.

Training neural networks on a cluster can lead to OOM errors, or timeouts of the jobs. In this case you could want to just move on to different settings instead of stopping the whole sweep.

Pitch

Describe the solution you'd like

The launchers could return Future-like objects (calling result() raises an exception if the job failed, but one can check the exception()emethod beforehand which returns None or the exception, if one wants to avoid raising the error). This would allow the launchers to decide how to proceed with the exceptions.

Describe alternatives you've considered

Anything is fine as long as raising the exception is deferred to the caller.

Are you willing to open a pull request? (See CONTRIBUTING)

No

@jrapin jrapin added the enhancement Enhanvement request label Feb 4, 2021
@omry
Copy link
Collaborator

omry commented Feb 4, 2021

The launchers could return Future-like objects (calling result() raises an exception if the job failed, but one can check the exception()emethod beforehand which returns None or the exception, if one wants to avoid raising the error). This would allow the launchers to decide how to proceed with the exceptions.

The Launcher launch API is synchronous, at some time in the far future we might support asynchrnous launching. but for now I think it would make more sense to return a success indicator in JobReturn.

class JobResult(Enum):
  COMPLETED = 1
  FAILED = 2

@dataclass
class JobReturn:
    overrides: Optional[Sequence[str]] = None
    return_value: Any = None
    cfg: Optional[DictConfig] = None
    hydra_cfg: Optional[DictConfig] = None
    working_dir: Optional[str] = None
    task_name: Optional[str] = None

    job_result : Optional[JobResult] = None
    exception : Optional[Exception] = None # if job_result i FAILED. we can also use the return_value instead.

class Launcher(Plugin):
    def launch(
        self, job_overrides: Sequence[Sequence[str]], initial_job_idx: int
    ) -> Sequence[JobReturn]:

@jieru-hu jieru-hu self-assigned this Feb 4, 2021
@jieru-hu jieru-hu added this to the Hydra 1.1.0 milestone Feb 5, 2021
@jrapin
Copy link
Contributor Author

jrapin commented Feb 5, 2021

Why have JobResult if you have the exception ? exception seems enough, it's None if there is no exception and not None if there is, I guess the result stays to None if there was an exception.

It basically already look like the Future API, except that you may fail to identify that there was a bug when accessing result if the job could have returned None. Using a "future-like" object does not mean that we need to be asynchronous, the object is just a container for the result which may have already been computed before.

That said, the change like this seems pretty simple and should take a couple of line at most in the submitit plugin, I can help for this one if need be.

@omry
Copy link
Collaborator

omry commented Feb 7, 2021

Gotcha. I thought you were suggesting some formal async API.

Why have JobResult if you have the exception ?

It contains other things that can be useful (overrides, config, working directory etc).

We can provide a result() method on it that would raise the exception.
in any case, this will require small changes to all sweepers. It's probably best if the same person does it at the same time as introducing the change.

@jrapin
Copy link
Contributor Author

jrapin commented Feb 7, 2021

It contains other things that can be useful (overrides, config, working directory etc).

I meant why JobResult, not JobReturn (I think you understood the latter?). JobResult(enum) seems to be straightforwardly inferable from exception being filled or not?

@omry
Copy link
Collaborator

omry commented Feb 7, 2021

Ohhh, yeah - I understood it as why JobReturn.
Yes - JobReturn is more of a future compatibility thing, it will allow adding additional states (queued, running).
It's not really needed for this feature.

@jrapin
Copy link
Contributor Author

jrapin commented Feb 8, 2021

Then you're heading to async indeed :p

@omry
Copy link
Collaborator

omry commented Feb 9, 2021

eventually :)

@jieru-hu
Copy link
Contributor

jieru-hu commented Apr 1, 2021

this is done. issues have been created for follow up tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhanvement request In progress Work in progress
Projects
None yet
Development

No branches or pull requests

3 participants