-
-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Submitit] Restructure the plugin configuration #682
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR does not work at all for now, it just states would should be the configs in my opinion
Note that it's actually shorter, simpler and actually provides more options.
We'll discuss names later, some may require renaming
plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py
Outdated
Show resolved
Hide resolved
plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py
Outdated
Show resolved
Hide resolved
plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py
Outdated
Show resolved
Hide resolved
plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py
Outdated
Show resolved
Hide resolved
plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py
Outdated
Show resolved
Hide resolved
ask for review when ready. |
I wouldnt call it First thing to deal with in my opinion is what the structure of the conf should look like:
I tried to change it to something like the following, removing one level
but that makes the init of the plugin a bit more complicated. When that is dealt with, the second question will of course be "how to change from local to slurm" |
The plugin is instantiated using hydra.util.instantiate. There is an open rfc to eliminate that and support recursive instantiation (#566). Try something like this (note that I renamed *Conf to *Params to reflect that I want to use them as the params object. @dataclass
class BaseParams:
# Where does this belong? maybe in base?
# if not move it to one of the the two specific configs
folder: str = "${hydra.sweep.dir}/.${hydra.launcher.params.executor}"
...
@dataclass
class SlurmParams(BaseParams):
...
@dataclass
class LocalParams(BaseParams):
...
@dataclass
class SlurmConf(ObjectConf):
cls: ...
params: SlurmParams = SlurmParams()
@dataclass
class LocalConf(ObjectConf):
cls: ...
params: LocalParams = LocalParams()
# finally, register two different choices:
ConfigStore.instance().store(
group="hydra/launcher",
name="submitit_local",
node=LocalConf,
provider="submitit_launcher",
)
ConfigStore.instance().store(
group="hydra/launcher",
name="submitit_slurm",
node=SlurmConf,
provider="submitit_launcher",
) There is a similar example here (without the ObjectConf). |
An alternative style which might be easier to read is this: @dataclass
class BaseParams:
...
@dataclass
class SlurmConf(ObjectConf):
cls :str = "hydra_plugins.....SubmititLauncher"
@dataclass
class SlurmParams(BaseParams):
...
params: SlurmParams = SlurmParams()
@dataclass
class LocalConf(ObjectConf):
cls :str = "hydra_plugins.....SubmititLauncher"
@dataclass
class LocalParams(BaseParams):
...
params: LocalParams = LocalParams() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the help.
It's now working locally. There are a few questions left before I move to test on manually in slurm (see comments)
# cluster to use (currently either "slurm" or "local" are supported, | ||
# None defaults to an available cluster) | ||
cluster: str = "slurm" | ||
executor: ExecutorName = ExecutorName.slurm # can we get rid of this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed here? Is there another way to get this information since we now call submitit_local
or submitit_slurm
. That would prevent a user from setting it to the wrong value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now you are using this as the flag to gate slurm or local initialization logic.
An alternative I think will be much cleaner is to use a different class. I will comment below.
self.params = OmegaConf.structured( | ||
LocalParams if params["executor"].value == "local" else SlurmParams | ||
) | ||
for key, val in params.items(): | ||
OmegaConf.update(self.params, key, val if key != "executor" else val.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this be simplified?
} | ||
) | ||
init_keys = specific_init_keys | set(init_renamer) # used config keys | ||
executor = submitit.AutoExecutor(**init_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now everything goes through AutoExecutor, whatever the choice of the executor. This makes it more maintainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can tell, it takes some trickery to prepare the correct dictionary for the AutoExecutor.
The reason is that the input config is no longer lining up with AutoExecutor parameters.
If you used SlurmExecutor and LocalExecutor directly, with corresponding SubmititSlurmLauncher and SubmititLocalLauncher as I suggested above I think you would be able to more easily match the init signature of each to what you want to pass the respective submitit launcher.
You can still use auto, but at least in this case the logic in each of the launchers will need to initialize a known mode which will simplify things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you used SlurmExecutor and LocalExecutor directly, with corresponding SubmititSlurmLauncher and SubmititLocalLauncher as I suggested above I think you would be able to more easily match the init signature of each to what you want to pass the respective submitit launcher.
I won't call SlurmExecutor
and LocalExecutor
directly, that would make the plugin harder to maintain. The alignment will actually be much simpler to maintain. Also, there's a PR on its way in submitit that would make it even easier.
plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py
Show resolved
Hide resolved
plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this more.
see my comments, I think we can simplify more by splitting into two different launcher classes.
they can inherit from your current launcher and have custom __init__()
methods.
plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py
Show resolved
Hide resolved
# cluster to use (currently either "slurm" or "local" are supported, | ||
# None defaults to an available cluster) | ||
cluster: str = "slurm" | ||
executor: ExecutorName = ExecutorName.slurm # can we get rid of this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now you are using this as the flag to gate slurm or local initialization logic.
An alternative I think will be much cleaner is to use a different class. I will comment below.
|
||
|
||
@dataclass | ||
class SubmititLauncherConf(ObjectConf): | ||
class LocalConf(ObjectConf): | ||
cls: str = "hydra_plugins.hydra_submitit_launcher.submitit_launcher.SubmititLauncher" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can define a launcher class for each mode and have and have the custom logic in it.
They can share a superclass or otherwise reuse most of the logic through other means.
cls: str = "hydra_plugins.hydra_submitit_launcher.submitit_launcher.SubmititLocalLauncher"
This will also allow you to named parameters in the __init__
of the class:
class SubmititLauncher(Launcher):
def __init__(self,
partition: Optional[str],
comment: Optional[str],
constraint: Optional[str],
exclude: Optional[str],
...
) -> None:
# forward directly to the implementation
# (my preference would be to SlurmExecutor and LocalExecutor at this point).
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not double down the name of each parameter both in the config and in the init. Naming them here would just add code for nothing in my opinion (we don't need type hints here since it's never called directly by a user, and all the checks can be efficiently performed through the config). So is there a shortcut to initialize the omega conf from a dictionary? I couldnt find any without requiring a for loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You asked how to simplify. this is my advice.
I don't think you will need to double much:
between config inheritance, and passing the actually needed parameters into each init there will be no duplication in my opinion.
regardless, it's your call.
Changing submitit can also help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but is there a better way to build the OmegaConf structure from a dict. Currently I am doing:
self.params = OmegaConf.structured(LocalParams)
for key, val in params.items():
OmegaConf.update(self.params, key, val)
bur I would think I am missing something (and it would not work with Enum, but fortunately there is no more Enum now).
Otherwise I'll use the Config directly but I prefer the flexibility of OmegaConf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try :
self.params = OmegaConf.structured(LocalParams(**params))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general answer to:
but is there a better way to build the OmegaConf structure from a dict
is:
OmegaConf.create({})
In this example you want to get the type safety from the dataclass.
Here is some context:
https://omegaconf.readthedocs.io/en/latest/structured_config.html#merging-with-other-configs
You can also try what I suggested, which is possibly a better match here.
} | ||
) | ||
init_keys = specific_init_keys | set(init_renamer) # used config keys | ||
executor = submitit.AutoExecutor(**init_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can tell, it takes some trickery to prepare the correct dictionary for the AutoExecutor.
The reason is that the input config is no longer lining up with AutoExecutor parameters.
If you used SlurmExecutor and LocalExecutor directly, with corresponding SubmititSlurmLauncher and SubmititLocalLauncher as I suggested above I think you would be able to more easily match the init signature of each to what you want to pass the respective submitit launcher.
You can still use auto, but at least in this case the logic in each of the launchers will need to initialize a known mode which will simplify things
|
||
# specify resources/parameters | ||
baseparams = set(dataclasses.asdict(BaseParams()).keys()) | ||
print(baseparams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget the prints.
plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/config.py
Show resolved
Hide resolved
self.params = OmegaConf.structured( | ||
LocalParams if params["executor"].value == "local" else SlurmParams | ||
LocalParams if self._EXECUTOR == "local" else SlurmParams | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding a base class to express that the slurm launcher is not REALLY a local launcher.
# your current local:
class BaseSubmititLauncher(Launcher):
def _get_params_class() -> Type[BaseParams]:
raise NotImplementedError
... # the rest is the same.
class LocalSubmititLauncher(BaseSubmititLauncher)
def _get_params_class() -> Type[BaseParams]:
return LocalParams
class SlurmSubmititLauncher(BaseSubmititLauncher)
def _get_params_class() -> Type[BaseParams]:
return SlurmParams
You can go back to a DictConfig with something like this (and now you don't need the update loop)
self.params = OmegaConf.structured(self._get_params_class(**params))
for key, val in params.items(): | ||
OmegaConf.update(self.params, key, val if key != "executor" else val.value) | ||
OmegaConf.update(self.params, key, val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised that worked in the previous version when you use the actual object and not a DictConfig.
regardless, you can get rid of this loop with the example I gave you above.
plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py
Outdated
Show resolved
Hide resolved
@omry it's now manually tested and working on SLURM cluster. |
# cluster to use (currently either "slurm" or "local" are supported, | ||
# None defaults to an available cluster) | ||
cluster: str = "slurm" | ||
submitit_folder: str = "${hydra.sweep.dir}/.submitit/%j" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's talk about this for a moment:
With Hydra, each job is getting a working directory inside the sweep dir.
If we can keep all the files for the job in that directory it would be really good (users will have an easier time debugging issues for a particular job).
This may take some submitit changes, but is something to consider for the future.
Right now, if a job is failing the user will need to:
- Look at the log file in the job output directory.
- Look in the submitit output folder, figure out which files belong to the job that caused the issue somehow and check the output/error files.
If we managed to use the working directory here it would be better.
I suspect currently this is not possible without changes to submitit.
An alternative solution to this problem is to symlink the stdout and stderr from the submitit folder into the job working directory. But this can also be tricky.
This is not a feature request for this PR, but something to think about to improve the user experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can keep all the files for the job in that directory it would be really good
Indeed (for now this is the same behavior as before, but I agree it is not convenient)
I suspect currently this is not possible without changes to submitit.
It would actually be possible if not using arrays, but that's not a good option. With arrays on the other hand we can't specify the exact folder of each run individually, it must be deduced from the parameters of the array. Hacking something is somehow possible but would require a lot of (mostly unhelpful) changes in submitit and would not be robust anyway.
The other option would be that hydra uses the folder provided by submitit (using the job_id instead of 0, 1, 2 etc). But that's probably not something you'd want either, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user can customize the sweep subdirectory through Hydra.
So yeah, I can't really use the submitit generated dir.
plugins/hydra_submitit_launcher/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py
Outdated
Show resolved
Hide resolved
} | ||
) | ||
init_keys = specific_init_keys | {"submitit_folder"} | ||
executor = submitit.AutoExecutor(cluster=self._EXECUTOR, **init_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is confusing to me:
The auto executor is expecting the strings "local" or "slurm" as options for cluster?
``` | ||
|
||
Note that this plugin expects a valid environment in the target host. usually this means a shared file system between | ||
the launching host and the target host. | ||
|
||
Submitit supports 3 types of queues: auto, local and slurm. Its config looks like this | ||
Submitit actually implements 2 different launchers: `submitit_slurm` to run on a SLURM cluster, and `local` to test locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about the following?
- Include links to config.py with dataclasses instead of copying them in.
- include the outputs of
$ python foo.py hydra/launcher=submitit_slurm --cfg hydra -p hydra.launcher
$ python foo.py hydra/launcher=submitit_local-cfg hydra -p hydra.launcher
When describing each of the two launchers.
I am still trying to discover what is the best way to document plugins with structured configs, and I think I like the above proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I saw that you already added similar output.
Also since the primary use case is SLURM we can focus on it.
This will output the current slurm parameters:
python foo.py hydra/launcher=submitit_slurm --cfg hydra -p hydra.launcher
and I recommend that instead of inlining the dataclasses we can just give a link.
Also, an example of how to override the launcher parameters from the command line would be helpful:
python foo.py hydra/launcher=submitit_slurm hydra.launcher.params.partition=prod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I saw that you already added similar output.
Yes I find them the most interesting since they make it easier to discover settings than just stating the config classes.
and I recommend that instead of inlining the dataclasses we can just give a link.
I have shared feelings about this, since the interest of adding the config files there came from the description of each parameters. Then again, since it often ends up being outdated if not careful, it's probably better to just have the link.
So I just:
- removed the config classes
- added a link the the config classes
- added the discovery of submitit_local parameters
- added a commandline override example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point I will support field description in the help or --cfg. when that happens we will be able a better description here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. please respond to my suggestion to change the docs.
``` | ||
|
||
Note that this plugin expects a valid environment in the target host. usually this means a shared file system between | ||
the launching host and the target host. | ||
|
||
Submitit supports 3 types of queues: auto, local and slurm. Its config looks like this | ||
Submitit actually implements 2 different launchers: `submitit_slurm` to run on a SLURM cluster, and `local` to test locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I saw that you already added similar output.
Also since the primary use case is SLURM we can focus on it.
This will output the current slurm parameters:
python foo.py hydra/launcher=submitit_slurm --cfg hydra -p hydra.launcher
and I recommend that instead of inlining the dataclasses we can just give a link.
Also, an example of how to override the launcher parameters from the command line would be helpful:
python foo.py hydra/launcher=submitit_slurm hydra.launcher.params.partition=prod
…auncher/submitit_launcher.py Co-authored-by: Omry Yadan <omry@fb.com>
ci/circleci: py36_win seems broken :s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reran the windows thing, probably transient infra issue.
# cluster to use (currently either "slurm" or "local" are supported, | ||
# None defaults to an available cluster) | ||
cluster: str = "slurm" | ||
submitit_folder: str = "${hydra.sweep.dir}/.submitit/%j" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user can customize the sweep subdirectory through Hydra.
So yeah, I can't really use the submitit generated dir.
# finally, register two different choices: | ||
ConfigStore.instance().store( | ||
group="hydra/launcher", | ||
name="submitit_local", | ||
node=LocalConf, | ||
provider="submitit_launcher", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can lose LocalConf and SlurmConf class definitions without losing functionality or clarity.
ConfigStore.instance().store(..., node=ObjectConf(cls: str ="XYZClass", params = XYZParams()))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have done something wrong but I get:
omegaconf.errors.UnsupportedValueType: Value 'LocalParams' is not a supported primitive type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is LocalParams a dataclass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its a BaseParams, which is a dataclass, does it need a decorator as well? I'm not really familiar with dataclasses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to create a minimal example showing the problem with OmegaConf directly.
this seems to work and is very close to what you are doing.
In [1]: from dataclasses import dataclass
In [9]: from typing import Any
In [2]: from omegaconf import OmegaConf
In [3]: @dataclass
...: class Foo:
...: num : int = 10
...:
In [4]: @dataclass
...: class LocalFoo(Foo):
...: name: str = "name"
...:
In [5]: OmegaConf.structured(LocalFoo)
Out[5]: {'num': 10, 'name': 'name'}
In [6]: OmegaConf.structured(LocalFoo())
Out[6]: {'num': 10, 'name': 'name'}
In [7]: OmegaConf.structured({"add-hoc":LocalFoo()})
Out[7]: {'add-hoc': {'num': 10, 'name': 'name'}}
In [10]: @dataclass
...: class ObjectConf:
...: cls: str = "abc"
...: params: Any = None
...:
In [11]: OmegaConf.structured(ObjectConf(cls="xyz", params=LocalFoo()))
Out[11]: {'cls': 'xyz', 'params': {'num': 10, 'name': 'name'}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example from someone else (he didn't complain so I think it's working for him):
I increased the no-output timeout for windows. can you rebase and try again? |
going to land it, I am having some test failures with the existing submitit plugin after making some core changes and it will be best to fix them on top of your changes. |
One thing I noticed that is missing is a news fragment. especially for breaking changes like this one we should add them. |
Thanks for this and for merging ;) |
Motivation
Only use AutoExecutor for submitit plugin, since it is the recommended method
Todo
Have you read the Contributing Guidelines on pull requests?
Yes
Test Plan
To do