Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel trial execution #380

Open
bpkroth opened this issue Jun 1, 2023 · 3 comments
Open

parallel trial execution #380

bpkroth opened this issue Jun 1, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@bpkroth
Copy link
Contributor

bpkroth commented Jun 1, 2023

Would like to be able to execute multiple experiment trials in parallel

@bpkroth bpkroth added the enhancement New feature or request label Jun 1, 2023
@bpkroth
Copy link
Contributor Author

bpkroth commented Oct 3, 2023

Possible method would be to use the async + background event loop thread technique demonstrated in #510

@bpkroth
Copy link
Contributor Author

bpkroth commented Nov 29, 2023

Related: #463 (First class scheduler support)

@bpkroth
Copy link
Contributor Author

bpkroth commented Nov 29, 2023

A few thoughts on design possibilities for this:

  • We need some sort of "pending trial configs" abstraction.
    • A simple way to implement this I think would be to provide a small class that interfaces with the storage backend and looks up pending trial/configs based on their status.
      Scheduling queue: Allow submitting trials with future timestamps and use watermarks to retrieve configs and results #676 starts this work
      • Note that that would support the ability to inject new trial/configs manually, repeats (e.g., for noisy handling confidence), or to mix ensembles of different "suggestion" modes (e.g., SMAC, FLAML, LLM, etc.) since it can separate the optimization method from the trial/config scheduling (see below).
  • We need a notion of a "worker" (terminology to be determined).
    • At a high level, this could be a forked child process that manages running one trial to completion and returning the results to the orchestrator (i.e., the optimization loop in run.py)
    • top-level mlos_bench cli config will be needed to determine how many of these slots there should be active (e.g., max 10 parallel trials).
  • Trial configs would need to be able to make use of a new built in variable (e.g., $worker_instance_id) in order to template out VM names, for instance, so there are no overlaps between parallel trial instances.
  • We need to a "scheduler" component, that matches a pending trial/config to a worker.
    • This should support pluggable policies so that when a worker becomes idle (returns a status from a finished execution) we can determine which of the pending trials to assign to it vs. generate a new suggestion
    • Additionally, for restart and error handling, we will need to store which worker_id was assigned which trial in the backend storage.
  • The run.py optimization loop will need to be adjusted so that the main process now acts purely as an orchestrator. Something like multiprocessing pool could work well for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant