Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add hyper-parameter optimization api #2532

Closed
schlichtanders opened this issue Sep 24, 2019 · 4 comments
Closed

add hyper-parameter optimization api #2532

schlichtanders opened this issue Sep 24, 2019 · 4 comments
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature

Comments

@schlichtanders
Copy link

Dear DVC folk,

Motivation

you mention it yourself on your documentation: Fully versioned Hyperparameter Optimization comes to mind when using DVC.

Little Research

I just made a quick research and it gets apparent very soon that this needs a specific implementation for dvc.

All the existing hyperparameter optimizers like python's hyperopt

  • assume their own hyper-parameter API how the hyperoptimization orchestration process communicates with the single algorithm
  • and distribute the computation using their individual distribution machinery

Suggestions how to integrate to DVC

It seems to me the following is needed for hyperparameter optimization to be a natural addition to DVC:

  1. each triggered hyperoptimization orchestration should have its own git branch subfolder

  2. each single hyperoptimization run should have its own subbranch under that subfolder

  3. a file-based hyper-parameter API, probably based on json

    • i.e. the hyper-parameter configurations should be stored in a file format
      • e.g. spearmint uses a custom json format, SMAC a completely custom file-format
    • and in addition also the choosen parameters for a concrete run
      • everything I found by now either passes the hyperparemters python-internally as arguments to a function, or on the commandline as arguments to a script... so no convention to copy, but anyway it is just a dictionary with values.
    • using a common json format this would enable easy tracking/comparing of different parameters accross hyperoptimization git branches similar to how dvc metrics already work.
    • and also the final run could easily be written as a .dvc routine itself by calling dvc repro
  4. it would unbelievably awesome to not reinvent the wheel entirely, but provide wrappers around existing hyperoptimization* packages like hyperopt or smac or others

    the principle idea is simple: instead of running a concrete algorithm with the specific framework, you run a wrapper which

    1. checksout a new hyperoptimization branch
    2. grabs the hyperparameters from the framework-specific API (e.g. as commandline args) and writes them into the new json file format
    3. runs dvc repro myalgorithm.dvc on a previously specified routine myalgorithm.dvc
    4. commits everything on the branch
    5. somehow find out the winner of the hyper-optimization, create a specific branch for this, and commit everything nicely.

    wrapping existing optimization frameworks has several advantages

    • less code to maintain and also only against stable APIs
    • monitoring webui and else for evaluating or live-inspecting the hyperoptimization may be already available
    • the community could be new wrappers

Of course more details will pop up while actually implementing this, e.g. how to integrate hyperoptimization with .dvc pipeline files as neatly as possible (for instance we may want to commit both the single run.dvc as well as a hyperopt.dvc to the same repository -- these need to interact seamlessly together)

What do you think about this suggested approach?

@dmpetrov
Copy link
Member

Hi @schlichtanders Thank you for the detailed description - interesting perspective to the hyper params and good ideas!

We are discussing experimentation scenarios in DVC and it looks like DVC needs special support for some cases. A recent discussion example - #2379. I'd love to discuss this from the point of hyperparameter tuning case and hyper param optimization packages.

Could you please clarify a few things:

  1. Are you thinking about creating a branch AND a subfolder in the branch for each of the experiment?
  2. Do you mean generating an experiment JSON file for each of the runs?
  3. What is the sequence? My understanding - the wrapper "asks" a hyperoptimization package about the next set of params through API, generates a new branch with a proper JSON config and runs it. So, at the end of the day, you will have N branches with runs and a final N+1 branch with the final result (you will probably clean up some of the branches). Is my understanding correct?

The major question I have - Why do we need two abstractions: branches AND subfolders? Additional questions:
Q1. Can we use only branches? Experiment as a branch is well supported in DVC.
Q2. Can we use only subfolders? I know that this experiment-as-a-folder is not supported in DVC yet.
Q3. Which abstraction would you prefer (if experiment-as-a-folder will be supported)?
Q4. Which of the branches OR subfolders would you prefer to keep/commit into Git?

@ghost ghost added feature request Requesting a new feature research awaiting response we are waiting for your reply, please respond! :) discussion requires active participation to reach a conclusion and removed research labels Sep 26, 2019
@pared
Copy link
Contributor

pared commented Oct 7, 2019

Related: https://discuss.dvc.org/t/best-practice-for-hyperparameters-sweep/244

@schlichtanders
Copy link
Author

schlichtanders commented Oct 13, 2019

I made some progress and created a small example, however currently have no time completing it.

Nevertheless here the link:
https://github.com/schlichtanders/dvc_hyperopt_example

the idea is simple: after defining two helper functionalities a hyperparameter search is just a little wrapper script which calls another .dvc file

two helpers

  • bin/git_push_set_upstream++.py which pushes a local branch to remote by adding an incremental integer suffix.

    E.g. if you branch is "myhyperoptimizationbranch" it would be pushed as "myhyperoptimizationbranch/1" if it is the first one, or "myhyperoptimizationbranch/43" if there already "myhyperoptimizationbranch/1" to "myhyperoptimizationbranch/42" on the remote)

  • bin/git_merge_hyperoptimization.py (which should be named dvc_merge_hyperoptimization rather) which takes a hyperoptimization branch prefix, looks at all subbranches and merges them according to a given metric using dvc metrics

    E.g. with pointing to "myhyperoptimizationbranch", uses dvc metrics to get the metric information of "myhyperoptimizationbranch/1" to "myhyperoptimizationbranch/43" and merges the best one

I hope I find time in november/december to finish this and answer all your questions respectively

@alexvoronov
Copy link

I had two thoughts related to potential API for hyperparameters on how to choose whether to store resulting models or not ("treat it as cache" and "treat it as optimal decision"). I posted them in another thread: #2379 (comment)

If API would allow such flexibility, exact decision can be easily delegated to other libraries. Unfortunately I don't have anything more concrete than this wish/feature request yet.

@efiop efiop removed the awaiting response we are waiting for your reply, please respond! :) label Nov 6, 2019
@efiop efiop closed this as completed May 3, 2021
@iterative iterative locked and limited conversation to collaborators May 3, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

5 participants