-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce hyper parameters and config #3393
Comments
Great initiative @dmpetrov. What do you imagine the files contain? Is it stuff like [Edit] Ah, so this is not about being able to run a pipeline with command line parameters? You are just talking about adding parameters as a concept, which would be just a file, right? |
I think putting parameters in a file might be a step in the wrong direction, but I need to think a little more 🤔 |
@elgehelge yes, it is about config with options that you can read from your code. Having a config file that many ML pipeline stages use (and depend on) is a quite usual approach. We just need to specify a subset of params instead. I'd love to hear other options! |
Okay. A config/params file could be a good first step, but not having parameters as commandline arguments would make day-to-day development a bit cumbersome: edit file, I do like the idea of having a file of default parameters though. But the important word here is default, since it would still be nice to be able to overwrite them using the commandline, but maybe this is another issue that comes later 😅
Together with "the build cache" (#1234) I think this would work quite well. [Edit] Removed me discussion about commandline support for parameters, let's take that somewhere else. |
While taking a step in this direction, we might also want to consider to go all the way and add environment variables as a dependency. Maybe like so?
It is not something I fin important myself, not might be worth considering. |
@elgehelge thank you for the feedback. All very helpful.
Note, for automation scenarios and visualization, we need a command to show all the params in a unified way like:
@elgehelge also, there is some misunderstanding in Examples:
Note, in both PS: sorry for being too detailed. I'm trying to provide as many details as possible to make sure everybody on the same page. |
@dmpetrov Great examples. Two comments.
And using There is probably many other use-cases of having multiple parameter sets? Like having multiple customers which require different model settings (maybe the data needs different preprocessing). |
@elgehelge, yes, this is the scenario "Change a param in current workspace and repro". You are right that this is a default mode for brand new parameters or functionality. While the last scenario "Change a param in branch, repro and commit" is automation on top of the previous one. It is needed when a parameter was already extracted and you do hyper param search in an automated or semi-automated way - from a script for example or through some service.
Very good point! It was not mentioned in the description. Happy to discuss this. My thoughts... In the current proposal, the params could be easily overwritten @elgehelge what do you think about this proposal?
|
I wonder if I could try helping out and get the code to know a little. Maybe I would need some pointers about where to start 😅 |
@elgehelge sure! It would be great if you can take this issue. |
Regarding |
Yes,
What about data types in this cmd? It is going to be a problem with 3 and "3"? |
I assume some sort of intelligent |
Also what about this format? deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4
path: train_model.py
- md5: 809776eaeef854443d39e4f76bd64c84.dir
path: data/split
- params:
- learning_rate: 0.0005
- filters: [32, 32, 64, 128, 64, 32, 1]
- dropout: false
- activation: "relu" |
I talked with @efiop on Discord. We were actually talking about
However the suggestion given by @casperdcl is more readable, and also it would allow |
I guess the main question here is how to specify the params configs that we need to read. Combining our suggestions it might look something like:
which, actually, looks pretty good to me (but then again I'm not a DS). @dmpetrov @elgehelge how does it look to you guys? |
starting to look a lot like #3439 (comment) deps:
- path: params.conf
md5: 1q2w3e4r5t6y7u8i9o0p9o8i7u6y5t4r
filter: python extract_params_default.py --learning-rate 0.0005 --filters '32, 32, 64, 128, 64, 32, 1' --dropout False --activation relu |
Just to make sure we are all on the same page.
Is anything need to be changed in these statements? |
Good question. Do we need types for params at all? We can say all are strings because there is no reason to compare them in some advanced way, in contrast to metrics. @elgehelge Do you have a good scenario in mind when param type is important and can give us more value? |
🤔 you are right, @casperdcl. It looks like another type of granular\conditional dependency. |
Okay. Where to start. There is multiple different things I would like to achieve, and they kind of relates.
However, I am still envisioning "parameterised pipelines" which would help os achieve:
Would the following run of commands make sense in your world?? Maybe "pipeline arguments" is a better name than "parameters"? Define data prep stage: |
Just jumping in with a quick question- how would the parameters stored in the In other words... how are you going to ensure that these parameters match what's in your model training code, other than specifying them as arguments to scripts? |
I created a survey in a private data science community ods.ai Q: What file format you use for conf file / hyper params files:
Results so far (updated):
I'll update it in ~12 hours. But it seems like Update: |
I have a question: why we want this to be ini or json if we use yaml everywhere else? On adding params to
An alternative to a separate file is to add a section to new pipeline/"stage collection" file. See example here regarding metrics. |
@Suor this |
So there won't be one param file, but lots of params spread around? Or this is like dep checksums? |
Exactly. One params file |
Then we have a conflict with collecting stages into single file again. As that file won't contain any checksums, so param values will also go to an accompanying |
Not at all. However, I'd expect to have the params list in dvc-file, close to a stage definition but extract value (think checksums) to |
@dmpetrov adding values to the single stages file will bring issues. This will mean file is not entirely human controlled anymore. Which will result in:
|
You are right. But I assume that |
I think we should tackle #1871 first. Or at least not holde back with that. I will continue with this and make sure it is in a state where all the [must have]'s are satisfied, but I still don't think it should be merge before #1871 is done and merged in. At that time we will have a better understanding of how these might conflict. |
Instead of adding yet another option, maybe this is something that should be handled by an environment variable? |
* Work in progress * added file parsing and name validation + adjust schema * Exceptions on bad input * Support multiple parameters * Support multi 's in Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help! * Restyled by black Co-authored-by: elgehelge <helgemunkjacobsen@gmail.com> Co-authored-by: Restyled.io <commits@restyled.io>
Env variables break reproducibility. |
Not regarding parameters since these are part of the dvc file. As a side note, I would argue that environment variables should be supported as a dependency (but this is not a requirement for my argument to hold). |
Just read up on MLFlow projects. In my opinion they have done it the right way. entry_points:
main:
parameters:
data_file: path
regularization: {type: float, default: 0.1}
command: "python train.py -r {regularization} {data_file}"
validate:
parameters:
data_file: path
command: "python validate.py {data_file}" A few great things about this approach:
|
@elgehelge To my mind, this approach goes against the best practice of having a config file with hyper params. Probably 80%+ projects have config. It would be not easy to convince ppl to use this specific format. Also, I heard complaints about duplications in MlFlow pipelines - in your code example you repeated In general, the MlFlow project seems not getting enough traction (comparing to MlFlow tracking) to use it as a good example. |
We did a great job by implementing the params #3515 (kudos to @elgehelge) and multi-stage dvc file which tracks params values through Closing this issue. Please LMK if there is something we need to extract to a separate one. |
For an ML experiment, it is important to know metrics as well as the parameters that were used in order to get the metrics. Today there is no training/processing parameter concept in DVC which creates a problem when a user needs to visualize an experiment for example in some UI.
A common workaround is to track parameters as metrics. However, the meaning of metrics is different. All the UI tools (including
dvc metrics diff
) need to show deltas where deltas do not make sense to some types of params. For example, delta for learning rate might be ok to see (values are still better), but delta for a number of layers (32, 64 or 128) does not make sense, the same for not numeric params like strings.Also, config/parameters are a pre-requisite for experiment management (#2799 or CI/CD scenarios) when DVC (or other automation tools) need to change training regarding provided parameters.
Another benefit of the "understanding" parameter - DVC can use this information during
repro
. For example, DVC can realize that a stepprocess
which depends on config fileconfig.json
should not be run despite the config file change because the metrics it uses were not changed.We need to introduce the experiment config file/parameters file with a fixed structure that DVC can understand.
Open questions:
config.json
,dvcconfig.json
,params.json
.dvc run -p learning_rate -p LL5_levels ...
.dvc run -p 'processing.*' ...
The text was updated successfully, but these errors were encountered: