Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce hyper parameters and config #3393

Closed
dmpetrov opened this issue Feb 24, 2020 · 59 comments
Closed

Introduce hyper parameters and config #3393

dmpetrov opened this issue Feb 24, 2020 · 59 comments
Assignees
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature product: VSCode Integration with VSCode extension

Comments

@dmpetrov
Copy link
Member

For an ML experiment, it is important to know metrics as well as the parameters that were used in order to get the metrics. Today there is no training/processing parameter concept in DVC which creates a problem when a user needs to visualize an experiment for example in some UI.

A common workaround is to track parameters as metrics. However, the meaning of metrics is different. All the UI tools (including dvc metrics diff) need to show deltas where deltas do not make sense to some types of params. For example, delta for learning rate might be ok to see (values are still better), but delta for a number of layers (32, 64 or 128) does not make sense, the same for not numeric params like strings.

Also, config/parameters are a pre-requisite for experiment management (#2799 or CI/CD scenarios) when DVC (or other automation tools) need to change training regarding provided parameters.

Another benefit of the "understanding" parameter - DVC can use this information during repro. For example, DVC can realize that a step process which depends on config file config.json should not be run despite the config file change because the metrics it uses were not changed.

We need to introduce the experiment config file/parameters file with a fixed structure that DVC can understand.

Open questions:

  1. Filename. config.json, dvcconfig.json, params.json.
  2. File format: json, text config/ini (name=value), Hydra, ... We can run a survey.
  3. How to track param dependency for stages. We can introduce a new type of dependency: param. If it is given then the stage depends on the file and on particular params values. Like dvc run -p learning_rate -p LL5_levels ....
  4. DVC should probably support groups of params. Param name pattern could be used : dvc run -p 'processing.*' ...
@dmpetrov dmpetrov added feature request Requesting a new feature discussion requires active participation to reach a conclusion labels Feb 24, 2020
@pared pared pinned this issue Feb 24, 2020
@elgehelge
Copy link
Contributor

elgehelge commented Feb 24, 2020

Great initiative @dmpetrov. What do you imagine the files contain? Is it stuff like required=true or is it types like learning_rate=float or is it default values like learning_rate=3.5e-5?

[Edit] Ah, so this is not about being able to run a pipeline with command line parameters? You are just talking about adding parameters as a concept, which would be just a file, right?

@elgehelge
Copy link
Contributor

I think putting parameters in a file might be a step in the wrong direction, but I need to think a little more 🤔

@dmpetrov
Copy link
Member Author

@elgehelge yes, it is about config with options that you can read from your code.

Having a config file that many ML pipeline stages use (and depend on) is a quite usual approach. We just need to specify a subset of params instead.

I'd love to hear other options!

@elgehelge
Copy link
Contributor

elgehelge commented Feb 25, 2020

Okay. A config/params file could be a good first step, but not having parameters as commandline arguments would make day-to-day development a bit cumbersome: edit file, run repro, edit file, run repro, and so on... , and also it would be hard to plug external tool for running jobs into this.

I do like the idea of having a file of default parameters though. But the important word here is default, since it would still be nice to be able to overwrite them using the commandline, but maybe this is another issue that comes later 😅

  1. File name: Naming. config.json might conflict with other project related configs. And any name containing the word "config" makes me think that it contains settings. I am all for params.json.
  2. File format: First I thought json. But then I saw that ini files have groups, which I find super useful. For instance, at a previous job we hardcoded 3 different sets of default parameters into the code, one for training a tiny model (for debugging purposes), one for training a model for a small segment of clients, and one for training a full model. It could also be that maybe for external compute or CI you would like to train for longer, so you could have a params specifically for this. And I guess there are plenty more use-cases where you might want more than one set of defaults. For this reason a like ini, however, two very strong reasons not to pick ini: (1) There is no formal standard, (2) There is no types, (3) Arrays er annoying to define. So, while researching on this I found TOML, which seems to solve all the problems related to ini (https://en.wikipedia.org/wiki/TOML)
  3. Yes! So the dvc file could look something like this maybe??
md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py
deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4
  path: train_model.py
- md5: 809776eaeef854443d39e4f76bd64c84.dir
  path: data/split
- json: '0.0005'
  param: learning_rate
outs:
- md5: 47b3781954c760e1bd19ac9e56e141b1
  path: model/mymodel.pkl
  1. I love it.

Together with "the build cache" (#1234) I think this would work quite well.

[Edit] Removed me discussion about commandline support for parameters, let's take that somewhere else.

@elgehelge
Copy link
Contributor

While taking a step in this direction, we might also want to consider to go all the way and add environment variables as a dependency. Maybe like so?

deps:
- env: TF_ADJUST_SATURATION_FUSED
  value: 1

It is not something I fin important myself, not might be worth considering.

@dmpetrov
Copy link
Member Author

@elgehelge thank you for the feedback. All very helpful.

  1. Filename. Agree that config is too broad. params or dvcparams are better options.
  2. ini and json are definitely the most common cases. We actually can support both: params.conf or params.json (error if both are presented).

Note, for automation scenarios and visualization, we need a command to show all the params in a unified way like:

$ dvc params
learning_rate = 0.005
num_classes = 1000
augument.VerticalFlip = True
augument.Rotate = [20, 60, 90]
process.UseOpenCL = True

# The actual params file content:
$ cat params.conf
learning_rate = 0.005
num_classes = 1000

[augument]
VerticalFlip = True
Rotate = [20, 60, 90]

[process]
UseOpenCL = True

@elgehelge also, there is some misunderstanding in dvc run/repro. We need a way of defining params in dvc run and changing params in dvc repro.

Examples:

Action Command
Define dependency to param in default params file params.conf dvc run -p learning_rate -d train.py -d images/ -o model.pkl python train.py
Define dependency to a param in a custome params file my_param.conf dvc run -P my_param.conf -p learning_rate -d train.py -d images/ -o model.pkl python train.py
Change a param in current workspace and repro dvc repro model.pkl.dvc -p learning_rate 0.004
Change a param in branch, repro and commit dvc repro model.pkl.dvc -p learning_rate 0.004 --branch new_lr --commit-message "try new learning rate 0.004"

Note, in both dvc repro examples the command depends on the param file. However, it should NOT retrain the model if the params file changed without the corresponded param change.

PS: sorry for being too detailed. I'm trying to provide as many details as possible to make sure everybody on the same page.

@elgehelge
Copy link
Contributor

@dmpetrov Great examples. Two comments.

  • Is there really a need for the "Change a param in branch, repro and commit"-action? Would it be nicer to just make a new branch, do dvc repro model.pkl.dvc -p learning_rate 0.004, and then commit? In a real world scenario I think the most common flow would be a bit different: Imagine that you just used the default learning rate, and now you want to tune it. First, you add support for a new parameter in your code, and then you add the parameter as a dependency. So this connot be done in a single command anyways.

  • Maybe I was not being clear in my previous comment. I think it is important to consider the use-case of having multiple "parameters sets". One example could be this. Let's say you have a model that takes a whole day to train, but you would still like a CI to test that the code can run. It that case we would like a parameter set for a small fast model, and a parameter set for the real deployable one. Using toml this could look like:

[bigmodel]
learning_rate = 0.005
epochs = 25
data.Languages = ["english", "spanish", "chinese"]
data.MaxSamples = -1

[testmodel]
learning_rate = 0.005
epochs = 2
data.Languages = ["english"]
data.MaxSamples = 1000

And using json, you would need to have two different files, and thus two different stages. I am not saying one is better than the other, we just have to discuss it.

There is probably many other use-cases of having multiple parameter sets? Like having multiple customers which require different model settings (maybe the data needs different preprocessing).

@efiop efiop added the product: VSCode Integration with VSCode extension label Mar 3, 2020
@dmpetrov
Copy link
Member Author

dmpetrov commented Mar 6, 2020

Would it be nicer to just make a new branch, do dvc repro model.pkl.dvc -p learning_rate 0.004, and then commit? .... First, you add support for a new parameter in your code, and then you add the parameter as a dependency. So, this cannot be done in a single command anyways.

@elgehelge, yes, this is the scenario "Change a param in current workspace and repro". You are right that this is a default mode for brand new parameters or functionality.

While the last scenario "Change a param in branch, repro and commit" is automation on top of the previous one. It is needed when a parameter was already extracted and you do hyper param search in an automated or semi-automated way - from a script for example or through some service.

I think it is important to consider the use-case of having multiple "parameters sets"

Very good point! It was not mentioned in the description. Happy to discuss this.

My thoughts... In the current proposal, the params could be easily overwritten dvc repro -p learning_rate 0.004. This is not enough for the use case you brought. To support your use case we can overwrite the entire param file like dvc repro --param-file params_test.conf -p learning_rate 0.004 while the default dvc repro will work with params_conf.conf which was defined in dvc run --param-file params_prod.conf or a default one params.conf.

@elgehelge what do you think about this proposal?

  1. Is it ok to keep separate files, not separate sections? What is your preference?
  2. Do we need a default params.conf or we should force users to specify it in each dvc run?
  3. Any other ideas?

@elgehelge
Copy link
Contributor

  1. Yes I think multiple files will support most use-casen including future ones this we might not be able to imagine now.
  2. I think a default is nice. Most projects will probability only need one file, so the tool is easier to get started with if there is a standardized place to put your parameters.

I wonder if I could try helping out and get the code to know a little. Maybe I would need some pointers about where to start 😅

@dmpetrov
Copy link
Member Author

dmpetrov commented Mar 6, 2020

@elgehelge sure! It would be great if you can take this issue.
This is a bit heavy issue. Let's chat quickly - I'll give you a bit more context around the process and point you to the right people who can help. Please ping me in Discord to schedule the chat (or shoot an email to dmitry@iterative.ai).

@casperdcl
Copy link
Contributor

Regarding conf/ini-vs-json, I'd say if wanting any level of manual user-editing yaml is best :)

@elgehelge
Copy link
Contributor

elgehelge commented Mar 10, 2020

Regarding conf/ini-vs-json, I'd say if wanting any level of manual user-editing yaml is best :)

Yes, yaml is an option too. I don't have enough experience with these different formats, so I don't have an opinion. One argument could also be that the .dvc file is yaml and it might be good with some consistency.

dvc repro model.pkl.dvc -p learning_rate 0.004

What about data types in this cmd? It is going to be a problem with 3 and "3"?

@casperdcl
Copy link
Contributor

I assume some sort of intelligent try: float(value) except: value type thing would be nice.

@casperdcl
Copy link
Contributor

Also what about this format?

deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4
  path: train_model.py
- md5: 809776eaeef854443d39e4f76bd64c84.dir
  path: data/split
- params:
  - learning_rate: 0.0005
  - filters: [32, 32, 64, 128, 64, 32, 1]
  - dropout: false
  - activation: "relu"

@elgehelge
Copy link
Contributor

I talked with @efiop on Discord. We were actually talking about

deps:
  - param: myparam
    value: paramvalue

However the suggestion given by @casperdcl is more readable, and also it would allow params.yaml to look similar to the .dvc file.

@efiop
Copy link
Contributor

efiop commented Mar 11, 2020

I guess the main question here is how to specify the params configs that we need to read. Combining our suggestions it might look something like:

deps:
  - path: params.conf
    params:
    - learning_rate: 0.0005
    - filters: [32, 32, 64, 128, 64, 32, 1]
    - dropout: false
    - activation: "relu"

which, actually, looks pretty good to me (but then again I'm not a DS). @dmpetrov @elgehelge how does it look to you guys?

@casperdcl
Copy link
Contributor

starting to look a lot like #3439 (comment)

deps:
  - path: params.conf
    md5: 1q2w3e4r5t6y7u8i9o0p9o8i7u6y5t4r
    filter: python extract_params_default.py --learning-rate 0.0005 --filters '32, 32, 64, 128, 64, 32, 1' --dropout False --activation relu

@dmpetrov
Copy link
Member Author

dmpetrov commented Mar 11, 2020

Just to make sure we are all on the same page.

  1. [Granular\conditional dependency] We introduce a config file that can be used in many stages and a stage triggers repro (to the config) only if a specific subset of the params was changed.
  2. [Default params file] The config file format can be config/ini, json or yaml. The major question is - that's the default name and format.
  3. [Custom params file] Users can redefine the config file name for a particular stage (or each of the stages).
  4. [Trigger for granular\conditional dependency] The subset of the params that a stage use should be specified in .dvc file like
eps:
  - path: params.conf
    params:
    - learning_rate: 0.0005
    - filters: [32, 32, 64, 128, 64, 32, 1]
    - dropout: false
    - activation: "relu"

Is anything need to be changed in these statements?

@dmpetrov
Copy link
Member Author

What about data types in this cmd? It is going to be a problem with 3 and "3"?

Good question. Do we need types for params at all? We can say all are strings because there is no reason to compare them in some advanced way, in contrast to metrics.

@elgehelge Do you have a good scenario in mind when param type is important and can give us more value?

@dmpetrov
Copy link
Member Author

starting to look a lot like #3439 (comment)

🤔 you are right, @casperdcl. It looks like another type of granular\conditional dependency.

@elgehelge
Copy link
Contributor

elgehelge commented Mar 11, 2020

Okay. Where to start. There is multiple different things I would like to achieve, and they kind of relates.

  1. Convenience. Instead of having your preprocessing and model parameters spread over many files and hardcoded into your code, it would be convenient to be able to collect them one place, but in a way that does not chance all of your stages if just a single parameter is changed. Today multiple parameter files (one for each state) would be possible, but its a little less convenient. As @casperdcl point out this is indeed related to Support function specific dependencies #3439 since this related to making a dependency to a subset of a file. Since we are no longer talking about what I would call "parameterised pipelines", I like the idea of not introducing a new -p option, but instead just extending the -d option to something like -d somefile.yaml::learning_rate

However, I am still envisioning "parameterised pipelines" which would help os achieve:

  1. Hyperparameter tuning. Sometimes I would like to add a new parameter and try several values of that parameter. It would be nice if the actual values of the parameters are included in the .dvc file for reproducibility, but I would like to not have to change any files during the process. I think this requires command line arguments to be involved. This is about changing the actual parameters at runtime, and thus related to...

  2. Automation tools. We would like automation tools to be able to eg. run two different versions of your pipeline (eg. one model for phones and one for a web app). Or maybe you have a tool for starting jobs where you would like to start many jobs with various parameters, maybe you are using a tool for automatic hyperparameter search.

Would the following run of commands make sense in your world?? Maybe "pipeline arguments" is a better name than "parameters"?

Define data prep stage:
dvc run -p augment.rotate=False -d prepare_data.py -d images/ -o prep_data.csv python prepare_data.py --rotate {augment.rotate}
Define model training stage:
dvc run -p learning_rate=0.004 -d train.py -d prep_data.csv -o model.pkl python train.py --lr {learning_rate}
Now run the pipeline with different parameters/arguments:
dvc repro model.pkl.dvc -p learning_rate=0.01 -p augment.rotate=True

@elleobrien
Copy link

elleobrien commented Mar 11, 2020

Just jumping in with a quick question- how would the parameters stored in the param file relate to what is hard-coded in, say, my tensorflow model training script? I can't imagine you'd want to automate the extraction of this meta-data from the code (since you'd have to build a meta-data extractor for each different modeling framework); would these parameters necessarily be arguments to the script that defines and trains your model?

In other words... how are you going to ensure that these parameters match what's in your model training code, other than specifying them as arguments to scripts?

@dmpetrov
Copy link
Member Author

dmpetrov commented Mar 17, 2020

I created a survey in a private data science community ods.ai

Q: What file format you use for conf file / hyper params files:

  1. conf/ini
  2. yaml
  3. json
  4. other (please clarify)

Results so far (updated):

  • conf/ini: 2
  • yaml: 37
  • json: 8
  • other: (6 for python)

I'll update it in ~12 hours. But it seems like params.yaml a good default option.

Update: yaml is the most commonly used format according to the survey. @jorgeorpinel let's use params.yaml by default.

@Suor
Copy link
Contributor

Suor commented Mar 17, 2020

I have a question: why we want this to be ini or json if we use yaml everywhere else?

On adding params to .dvc file, I would say it is a no go for a couple of reasons:

  • those params are meant to affect more than one stage
  • we are going to deprecate .dvc files for run commands in store whole DAG in one DVC-file #1871, not sure when those will land though and design might still change

An alternative to a separate file is to add a section to new pipeline/"stage collection" file. See example here regarding metrics.

@casperdcl
Copy link
Contributor

Regarding conf/ini-vs-json, I'd say if wanting any level of manual user-editing yaml is best :)

It looks like #1871 is potentially blocking both #3439 and this (#3393), or at least confusing implementation. Maybe it would make more logical sense to tackle #1871 first?

@dmpetrov
Copy link
Member Author

On adding params to .dvc file

@Suor this params.yaml is not dvc-file, but just a params file. In dvc-files we need to save the subset of params that the stage depends on and values to identify changes.

@Suor
Copy link
Contributor

Suor commented Mar 17, 2020

@Suor this params.yaml is not dvc-file, but just a params file. In dvc-files we need to save the subset of params that the stage depends on and values to identify changes.

So there won't be one param file, but lots of params spread around? Or this is like dep checksums?

@dmpetrov
Copy link
Member Author

Or this is like dep checksums?

Exactly. One params file params.yaml and dep-checkums spread around dvc-files (until single file is invented).

@Suor
Copy link
Contributor

Suor commented Mar 17, 2020

Then we have a conflict with collecting stages into single file again. As that file won't contain any checksums, so param values will also go to an accompanying .lock file and not be really visible. Is that an issue?

@dmpetrov
Copy link
Member Author

Is that an issue?

Not at all. However, I'd expect to have the params list in dvc-file, close to a stage definition but extract value (think checksums) to .lock file.

@Suor
Copy link
Contributor

Suor commented Mar 17, 2020

@dmpetrov adding values to the single stages file will bring issues. This will mean file is not entirely human controlled anymore. Which will result in:

  • need to use slow yaml lib (ruamel.yaml) to handle that
  • merge conflicts on generated data
  • implementation will need to collect deps from two separate places: lock and stages file
  • no way or at least harder to abstract out and switch "front-end", i.e. switch yaml stages files to python stages files

@dmpetrov
Copy link
Member Author

adding values to the single stages file will bring issues.

You are right. But I assume that dvc run still writes to dvc-file, keeps *.lock files and we have the same set of issues for that. I don't see how the params file makes it more complicated. I'd appreciate if you could clarify that.

@elgehelge
Copy link
Contributor

I think we should tackle #1871 first. Or at least not holde back with that. I will continue with this and make sure it is in a state where all the [must have]'s are satisfied, but I still don't think it should be merge before #1871 is done and merged in. At that time we will have a better understanding of how these might conflict.

@elgehelge
Copy link
Contributor

elgehelge commented Mar 19, 2020

5. must have] To overwrite the default file you can just introduce capital -P process_params.yaml. In most of the cases, only one param file will be used. (@dmpetrov)

Instead of adding yet another option, maybe this is something that should be handled by an environment variable?

efiop pushed a commit that referenced this issue Mar 19, 2020
* Work in progress

* added file parsing and name validation + adjust schema

* Exceptions on bad input

* Support multiple parameters

* Support multi 's in
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

* Restyled by black

Co-authored-by: elgehelge <helgemunkjacobsen@gmail.com>
Co-authored-by: Restyled.io <commits@restyled.io>
efiop added a commit that referenced this issue Mar 19, 2020
@efiop efiop self-assigned this Mar 20, 2020
@dmpetrov
Copy link
Member Author

maybe this is something that should be handled by an environment variable?

Env variables break reproducibility.

@elgehelge
Copy link
Contributor

maybe this is something that should be handled by an environment variable?

Env variables break reproducibility

Not regarding parameters since these are part of the dvc file.

As a side note, I would argue that environment variables should be supported as a dependency (but this is not a requirement for my argument to hold).

@elgehelge
Copy link
Contributor

Just read up on MLFlow projects. In my opinion they have done it the right way.

entry_points:
  main:
    parameters:
      data_file: path
      regularization: {type: float, default: 0.1}
    command: "python train.py -r {regularization} {data_file}"
  validate:
    parameters:
      data_file: path
    command: "python validate.py {data_file}"

A few great things about this approach:

  • Not having to point at an external file
  • Each parameter is closely coupled with the step (entrypoint) that uses the parameter
  • They do not force a default value

@shcheklein shcheklein unpinned this issue Apr 10, 2020
@dmpetrov
Copy link
Member Author

Just read up on MLFlow projects. In my opinion they have done it the right way.

@elgehelge To my mind, this approach goes against the best practice of having a config file with hyper params. Probably 80%+ projects have config. It would be not easy to convince ppl to use this specific format. Also, I heard complaints about duplications in MlFlow pipelines - in your code example you repeated regularization twice, you need to repeat it again when you parse the args and one more time when you output it mlflow.log_params("regularization", args.regularization).

In general, the MlFlow project seems not getting enough traction (comparing to MlFlow tracking) to use it as a good example.

@dmpetrov
Copy link
Member Author

We did a great job by implementing the params #3515 (kudos to @elgehelge) and multi-stage dvc file which tracks params values through dvc.lock (thank you for @skshetry).

Closing this issue. Please LMK if there is something we need to extract to a separate one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature product: VSCode Integration with VSCode extension
Projects
None yet
Development

No branches or pull requests

6 participants