refactor(runtime): Migrate metahyper to `neps.runtime` #71

eddiebergman · 2024-04-25T18:26:13Z

This is a semi-large PR but the main change was to integrate meta-hyper to just be part of neps. As such, I renamed it to just be neps/runtime.py.

Here's a synopsis that included at the top of the file:

"""Module for the runtime of a single instance of NePS running.
An important advantage of NePS with a running instance per worker and no
multiprocessing is that we can reliably use globals to store information such
as the currently runnig configuraiton, without interfering with other
workers which have launched.
This allows us to have a global `Trial` object which can be accessed
using `import neps.runtime; neps.get_in_progress_trial()`.

---

This module primarily handles the worker loop where important concepts are:
* **State**: The state of optimization is all of the configurations, their results and
 the current state of the optimizer.
* **Shared State**: Whenever a worker wishes to read or write any state, they will _lock_ the
 shared state, declaring themselves as operating on it. At this point, no other worker can
 access the shared state.
* **Optimizer Hydration**: This is the process through which an optimzier instance is _hydrated_
 with the Shared State so it can make a decision, i.e. for sampling. Equally we _serialize_
 the optimizer when writing it back to Shared State
* **Trial Lock**: When evaluating a configuration, a worker must _lock_ it to declared itself
 as evaluating it. This communicates to other workers that this configuration is in progress.
 
### Loop
We mark lines with `+` as the worker having locked the Shared State and `~` as the worker
having locked the Trial. The trial lock `~` is allowed to fail, in which case all steps
with a `~` are skipped and the loop continues.

1. + Check exit conditions
2. + Hydrate the optimizer
3. + Sample a new Trial
3. Unlock the Shared State
4. ~ Obtain a Trial Lock
5. ~ Set the global trial for this work to the current trial
6. ~ Evaluate the trial
7. ~+ Lock the shared state
8. ~+ Write the results of the config to disk
9. ~+ Update the optimizer if required (used budget for evaluating trial)
10. ~ Unlock the shared state
11. Unlock Trial Lock
"""

Some major points about the runtime.py:

Due to NePS worker setup, there's always only one trial being evaluated per worker, which can be accessed through a global. This was previously just done for the tblogger. I tried to make this more explicit and a runtime api that is allowed to be used through neps. This could be useful in simplifying user API, as shown in the below example. In this way, we do not need to inject anything into the function signature itself, preventing backwards breaking compatibility when wanting to add new features, as well as providing a simple access point for all features we can provide for a user in a given trial. Discussed with @Neeratyoy

import neps
def myfunc(**config):
    neps.log({"hello": "world"})
    neps.save_checkpoint(my_model)
    neps.trial.previous_pipeline_directory
    neps.trial.previous_fidelity_evluated_at

The major functionality that leaked into other parts of NePS was the read() which would load in the configs and results. This now needs to be explicitly done with the SharedState which better indicates what it actually requires to read, as well as requiring an explicit declaration of locking the shared state, such that it's evident it's not a cheap operation and that it may also cause blocks.
Changed the nomenclature of pending and pending free of which the latter was a subset of the former. Now each trial is in one of 4 non-overlapping states. To be clear, the Trial is mostly for booking inside of runtime and does not leek into the reset of NePS, however it might be useful that optimizers are aware of it as it contains more information than just the ConfigResult

@property
def state(self) -> Trial.State:
    if not empty_file(self.result_file):
        return Trial.State.COMPLETE
    elif self.lock.is_locked():
        return Trial.State.IN_PROGRESS
    elif not empty_file(self.config_file):
        return Trial.State.PENDING
    else:
        return Trial.State.CORRUPTED

While doing so there are quite a few minor changes made other than code structure:

Remove the Sampler abstraction, it was only used by BaseSampler. Moved all Sampler methods into BaseSampler
Lots of typing fixes throughout, the major one that touched a lot of files is that def load_results(..., evaluating: dict[str, SearchSpace]) previously stated pending_evaluations: dict[str, ConfigResult] which was incorrect.
Remove YamlSerializer class. There's now just a serialize() and deserialize() function which does the same thing. The automatic config transformations it was doing using sampler.load_config() inside it's deserialize() was removed and just done explicitly on the result of deserialize().
Unfortunatly, it seems that there was some formatting that happened accidentally (sorry). This should be fixed up with a migration away from pylint isort and black and just replaced with ruff alone.
Created a neps/types.py file for commonly used simple types. I would hope to remove the string and dict typing that exists in a lot of places as it becomes very unmaintainable at some point. (SMAC had a lot of silent issues due to this)
Reworked _Locker to accept arguments such as timeout and polling. There's also an environment variable that can be used to control this if really required. The biggest change is that all operations are done with a with block to prevent possibilities of accidentally not unlocking. This also handles cases where exceptions might get thrown in the block. As side note, the locking time is not really what hurts NePS in hpo benchmarking with cheap evaluations, although lower locking certainly does help. Turns out the serialization is the slowest part as described further in [Optim] Consider pickle for optimizer state file in run (with option to toggle) #64
I turned the following deprecations into explicit errors (as it's been a while)

evaluation_fn_params = inspect.signature(evaluation_fn).parameters
if "previous_working_directory" in evaluation_fn_params:
    raise RuntimeError(
        "the argument: 'previous_working_directory' was deprecated. "
        f"In the function: '{evaluation_fn.__name__}', please,  "
        "use 'previous_pipeline_directory' instead. "
    )
if "working_directory" in evaluation_fn_params:
    raise RuntimeError(
        "the argument: 'working_directory' was deprecated. "
        f"In the function: '{evaluation_fn.__name__}', please,  "
        "use 'pipeline_directory' instead. "
    )

Potentially fixed issue Metahyper sampling extra config #42, my beset guess is that writing the result of a configuration never actually took the shared state lock, i.e. it was writing to shared state without taking ownership of it. This also would have been a race condition present if attempting to read from a in-write file. @TarekAbouChakra

# New solution
with shared_state.lock(poll=_poll, timeout=_timeout):
    trial.write_to_disk()
    if account_for_cost:
        assert eval_cost is not None
        with sampler.using_state(shared_state.optimizer_state_file):
            sampler.used_budget += eval_cost

eddiebergman changed the title ~~refactor(runtime): Integrate metahyper closely~~ refactor(runtime): Migrate metahyper to neps.runtime Apr 25, 2024

eddiebergman added 6 commits April 29, 2024 17:59

refactor(runtime): Integrate metahyper closely

c3dac60

refactor(_Locker): Use poll everywhere

d4f850a

fix: Broken import

fdf1ef2

typing: Use Union for <=3.9 compatibility

da6e282

typing: Use Dict in TypeAlias for <=3.8 compat

7ae75ad

doc: Type fixes

96c7889

eddiebergman force-pushed the metahyper-to-runtime branch from 41e1a3d to 96c7889 Compare April 29, 2024 15:59

eddiebergman merged commit 335a1a3 into master Apr 29, 2024
11 checks passed

eddiebergman deleted the metahyper-to-runtime branch April 29, 2024 16:01

This was referenced Apr 29, 2024

Refactored Metahyper and Its Test Cases #54

Closed

Metahyper sampling extra config #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(runtime): Migrate metahyper to `neps.runtime` #71

refactor(runtime): Migrate metahyper to `neps.runtime` #71

eddiebergman commented Apr 25, 2024 •

edited

Loading

refactor(runtime): Migrate metahyper to neps.runtime #71

refactor(runtime): Migrate metahyper to neps.runtime #71

Conversation

eddiebergman commented Apr 25, 2024 • edited Loading

refactor(runtime): Migrate metahyper to `neps.runtime` #71

refactor(runtime): Migrate metahyper to `neps.runtime` #71

eddiebergman commented Apr 25, 2024 •

edited

Loading