Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Pipeline): optimize() #230

Merged
merged 13 commits into from
Jan 26, 2024
Merged

feat(Pipeline): optimize() #230

merged 13 commits into from
Jan 26, 2024

Conversation

eddiebergman
Copy link
Contributor

@eddiebergman eddiebergman commented Jan 22, 2024

Alright, buckle up, this is a bigger one (780 lines added).

The objective of this PR was to enable pipeline.optimize(...) to support fastest defaults towards just, well evaluating a pipeline, while still enabling it to extend out further into the future and not be too locked up in terms of use case.

Get me HPO results quickly

Let's start with the simplest procedure:

pipline = ...
history = pipeline.optimize(...)  # <- What needs to go in here

The first thing is the .optimizer(target=...) which should be a function that given a Trial and Node (pipeline), should return a Trial.Report, formally: Callable[[Trial, Node], Trial.Report]. This is the users target function to use HPO nomeclature.

The target= can also be an EvaluationProtocol, which will be used to define pre-defined flows so to speak. For example, a future addition would allow something like:

pipeline.optimize(
    target=SklearnCVProtocol(X, y, splitter=...),  # User can use predefined evaluation protocols
    ...
)

Choosing an Optimizer and Creating one

The next parts revolve around the optimizer, namely optimizer= to select one, the seed=, working_dir= and metrics= it will expect. These constitue the changes to the Optimizer class. By default, it's None and just tries to get some optimizer installed and use that. There could be extra work to also detect what's compatible with the pipeline, but if they're using a custom seearch space defintion, then they likely will be able to pass in the specific optimizer they wanted anywho.

Scheduling

Next things to consider were about task execution, namely around the Scheduler and running it. It just uses n_workers: int = 1 and scheduler: Scheduler | None = None to mean that it runs the opimization in local processes with 1 worker. Both of these parameters can be freely changed though. There are also a host of parameters related to error control, plugins and max_trials= and more which are documented in the code and understandable through there.

Finer Control

The last major part to understand is setup_only: bool = False. We assume the default behaviour is as the code snippet above, "just get me some HPO runs". However more advanced users (me) might want to have multiple pipelines setup to run HPO, but I do not want to start them yet, and only care more about setting up the entire flow.

pipeline = ...
scheduler = pipeline.optimizer(..., setup_only=True)

In this case, the pipeline is ready to be optimized and will run as soon as I call scheduler.run() where notably, I am now in control of the scheduler and the run() call.


Other use cases considered:

  • You want to use your own History object which may be shared between pipeline optimization runs. optimize(history=my_history). This won't lock or cause race conflicts as all callbacks happen async in the main process.
  • You may want custom callbacks, this is handled through on_begin which allows you to do a host of custom things which have been documented.

Copy link
Collaborator

@LennartPurucker LennartPurucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor points, but one big one: does this have no tests?

src/amltk/_util.py Outdated Show resolved Hide resolved
src/amltk/pipeline/node.py Outdated Show resolved Hide resolved
src/amltk/pipeline/node.py Show resolved Hide resolved
@eddiebergman
Copy link
Contributor Author

Not yet, I wanted to flesh it out properly first before testing it. Also the docs were taking to long to render so I have a PR that fixes that first, coming up

@eddiebergman
Copy link
Contributor Author

eddiebergman commented Jan 23, 2024

Any advice on what to test would be appreciated. It's the weird kind of function which merely pieces things together but doesn't implement much, i.e. I'd be mainly testing that the matches do indeed match, the one thing being concrete would be the heuristic.

More concretely, what should I test specifically about this function that wouldn't be caught by the tests of its individual pieces?

Copy link
Collaborator

@aron-bram aron-bram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easy to read and clean!
I don't actually see what you would need to test here to be honest, assuming that the classes in use here are already tested individually, which they seem to be.

src/amltk/evalutors/evaluation_protocol.py Outdated Show resolved Hide resolved
src/amltk/evalutors/evaluation_protocol.py Show resolved Hide resolved
src/amltk/evalutors/evaluation_protocol.py Show resolved Hide resolved
src/amltk/optimization/optimizer.py Show resolved Hide resolved
src/amltk/optimization/optimizer.py Show resolved Hide resolved
src/amltk/optimization/optimizer.py Show resolved Hide resolved
@eddiebergman eddiebergman force-pushed the feat-pipeline-optimizer branch from 908b4d8 to 6874f42 Compare January 26, 2024 16:36
@eddiebergman
Copy link
Contributor Author

eddiebergman commented Jan 26, 2024

Thank you for the reviews @LennartPurucker @aron-bram! I made changes based on your comments.

Feel free to review post-mortem for any issues but after adding some tests and verifying it seems to be working as intended, I will merge once the automated tests pass.

Note: Ignore what I said about testing... I found small variable name bugs which caused behaviors not to occur. Live by the tests, die by the tests

@eddiebergman eddiebergman merged commit bded378 into main Jan 26, 2024
6 checks passed
@eddiebergman eddiebergman deleted the feat-pipeline-optimizer branch January 26, 2024 17:02
@@ -146,8 +146,7 @@ def create(
!!! note

Subclasses should override this with more specific configuration
but these 3 arguments should be all that's necessary to create
the optimizer.
but these arguments should be all that's necessary to create the optimizer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future proofing docs 🙂

@@ -1083,7 +1083,7 @@ def register_optimization_loop( # noqa: C901, PLR0915, PLR0912
walltime_limit=process_walltime_limit,
cputime_limit=process_cputime_limit,
)
plugins = (*_plugins, plugin)
_plugins = (*_plugins, plugin)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tend to easily make these type of mistakes...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feat] Easy `pipeline.optimize(task_to_optimize, metric, **scheduler_arguments)
3 participants