parallel trial execution #380

bpkroth · 2023-06-01T16:39:45Z

Would like to be able to execute multiple experiment trials in parallel

bpkroth · 2023-10-03T17:58:24Z

Possible method would be to use the async + background event loop thread technique demonstrated in #510

bpkroth · 2023-11-29T20:45:17Z

Related: #463 (First class scheduler support)

bpkroth · 2023-11-29T21:15:29Z

A few thoughts on design possibilities for this:

We need some sort of "pending trial configs" abstraction.
- A simple way to implement this I think would be to provide a small class that interfaces with the storage backend and looks up pending trial/configs based on their status.
  Scheduling queue: Allow submitting trials with future timestamps and use watermarks to retrieve configs and results #676 starts this work
  - Note that that would support the ability to inject new trial/configs manually, repeats (e.g., for noisy handling confidence), or to mix ensembles of different "suggestion" modes (e.g., SMAC, FLAML, LLM, etc.) since it can separate the optimization method from the trial/config scheduling (see below).
We need a notion of a "worker" (terminology to be determined).
- At a high level, this could be a forked child process that manages running one trial to completion and returning the results to the orchestrator (i.e., the optimization loop in run.py)
- top-level mlos_bench cli config will be needed to determine how many of these slots there should be active (e.g., max 10 parallel trials).
Trial configs would need to be able to make use of a new built in variable (e.g., $worker_instance_id) in order to template out VM names, for instance, so there are no overlaps between parallel trial instances.
We need to a "scheduler" component, that matches a pending trial/config to a worker.
- This should support pluggable policies so that when a worker becomes idle (returns a status from a finished execution) we can determine which of the pending trials to assign to it vs. generate a new suggestion
  - This might need additional configs like max pending trials or max suggestions or some such. See Also Implement more flexible "Stopping Conditions" #427 for relation to improved stopping conditions specifications.
- Additionally, for restart and error handling, we will need to store which worker_id was assigned which trial in the backend storage.
The run.py optimization loop will need to be adjusted so that the main process now acts purely as an orchestrator. Something like multiprocessing pool could work well for this.

bpkroth added the enhancement New feature or request label Jun 1, 2023

bpkroth mentioned this issue Nov 29, 2023

Move deploymentName to Environment args instead of Service config #610

Closed

This was referenced Nov 30, 2023

mlos_bench: first class scheduler support #463

Open

CompositeOptimizer #619

Open

bpkroth mentioned this issue Feb 23, 2024

mlos_bench: grid search support #688

Closed

bpkroth mentioned this issue Mar 4, 2024

Split Environment.run() into two (async) parts - WIP #693

Closed

bpkroth mentioned this issue Mar 19, 2024

WIP: Introduce TrialRunner Abstraction #720

Draft

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel trial execution #380

parallel trial execution #380

bpkroth commented Jun 1, 2023

bpkroth commented Oct 3, 2023

bpkroth commented Nov 29, 2023 •

edited

Loading

bpkroth commented Nov 29, 2023 •

edited

Loading

parallel trial execution #380

parallel trial execution #380

Comments

bpkroth commented Jun 1, 2023

bpkroth commented Oct 3, 2023

bpkroth commented Nov 29, 2023 • edited Loading

bpkroth commented Nov 29, 2023 • edited Loading

bpkroth commented Nov 29, 2023 •

edited

Loading

bpkroth commented Nov 29, 2023 •

edited

Loading