-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model parameter tracking #2379
Comments
@hhoeflin hi, and thanks for sharing your use case!
This is the problem we are aware of. Hyperparameter search is part of many people workflow and unfortunately we don't have yet established how we would like to tackle it. Would you be interested in discussing this subject? You intend to put parameteres inside yaml file, that is perfectly understandable, but what about produced outputs? If you produce few models, differing by params used to train them, would you like to preserve all of them? Or choose the best(according to one of the metrics) and preserve only it? |
Yes, I would definitely be interested in discussing this subject more. So first I would characterize my parameters as parameters that hold for the entire "branch" or at least the current commit. I assume the setting you describe refers to fitting e.g. 10 models with 10 different learning rates, where even the performance after each epoch could be a separate metric? So for me, the metric files should be flexible and support all of this. Of course, outputs can get very big rather quickly, which is why I think for such a scope and export into csv files would be needed so that the result can easily be plotted and explored in other tools. |
Yes, in extreme casse i was thinking about exploring parameters like
That is a good point, to compare results, user would probably want to explore, with some visualization tool (for example TensorBoard) how did training went across epochs, right? What about preserving "best model" (for convinience lets say we already know how to choose the best).
|
At the moment this really depends on the application. Often, disk space permitting, I keep all, at least for a while. Later I may only keep one. My reference to "best" however was also intended in regards to metrics. I think currently metrics are intended to be numerical, right? With this, when describing models as they evolve through epochs, this may have something like so "parameters" such as the epoch may be mixed numerical and string. In any case, metrics should support both, storing results per epoch as well as best. From this perspective, epoch is actually just another parameter for an existing model. I think one difficulty here is also how to impose structure so that a metrics json file could reliably serialized to a csv file. |
@hhoeflin Actually, AFAIK type of metric does not matter, because we do not provide any logic related to metrics that would be type-related. So for example The thing is that introducing some logic related to hyperparameter search would probably make us introduce some behaviour based on metrics values. |
For me the important this is not so much logic around hyperparameter search, but the ability to
I have just started thinking about these issues and don't really have good answers yet. |
Also, for some time already there is hanging issue that might be related to this particular use case: |
I see why someone would want to do this, but for me this is not a priority for the following reason: |
There are different tools already to do hyperparameter optimization: https://en.wikipedia.org/wiki/Hyperparameter_optimization#Open-source_software I don't think DVC will bring any value by trying to solve this problem again, it is better to seek a smooth integration with other tools. As I understand from what @hhoeflin expressed here #2379 (comment) , the idea is to have sort of like the following tracking UI: Where you can have I think this is a really valid concern and an interesting one (cc: @iterative/engineering). By the way, @hhoeflin , you can use MLFlow alongside DVC. |
@MrOutis |
@hhoeflin thank you for bringing this issue - it is an important scenario which we are thinking about. There are still a few open questions. I'd like to make sure I understand the question correctly. Could you please clarify a few things:
btw... what command you run to get this error: |
For me, statement 1) is correct. If hyperparameter search could be captured, that would be nice though. hope that helps! P.S. I used "dvc metrics add" |
Thank you, @hhoeflin ! It is clear with (1). Re (2) - currently DVC "static" metrics pretty well. However, if you'd like to see a graph difference you need to implement this by yourself. We need to think more about supporting this scenario - one of the open questions. Re (3) - so, you'd like to have commits for each set of parameters. Is it correct? If so, there is a discussion about this #1691 and it looks like this will be implemented soon. Today we show only metrics difference between branches\commits without any association to parameters. To track metrics history properly, we need to have this metrics-params association (this needs to be done based on some config baseline). It looks like we need to introduce a new concept of (model)config and config changes (parameters) into DVC. @hhoeflin do you agree on this? Do you see any simpler solution without introducing this concept? |
For (3), this is what I was thinking. Thinking about it, I am not sure if such rather complex behaviour necessarily needs to be provided as a command-line interface tool. An alternative approach could be to expose/expand/provide examples for python classes that can iterate within branches and well across the entire repo and allow the user to do operations on selected files, e.g. iterate through a repo, read all yaml files of a certain type, return as a dict with (branch, commit) as the key and leave it to the user what to do with this. |
If you are talking about the metrics-params association - it looks like the association has to be created for proper visualization anyway (python or command line). I think we should implement this in DVC first and then extract in API. @hhoeflin one more question for you if you don't mind :) |
@hhoeflin one of the workaround that come to my mind is to just dump to the "final" metrics file (json, csv, tsv - it does not matter for DVC and it provide some interface to work with all these formats) all input/global parameters along with actual metrics (like AUC). This way you will be able to see them with Is it a reasonable approach or we would still missing something? |
I was looking for best practice on how to organize hyperparameters search, and was pointed to this nice thread. From the discussion here I see that I'm not alone, but I also see that there is no "silver bullet" yet. Here is what I'm missing: I think I’d like to separate a generic workflow/pipeline/DAG definition from each of its “instances” (with the hashes and all). Reading through the links in this thread, it looks like some combination of snakemake for pipeline definition, with the addition of makepp's hash-based dependency tracking, all built on top of DVC, might lead to some sort of solution for me. |
I have a few thoughts about which models to save, I wonder if they resonate with anyone else too: 1. Cache metaphorHyperparameter values can be seen as a key, and model can be seen as a stored value, in some hashtable or cache. If there is no stored model, we (re)compute the model and its metrics (e.g. if we come up with a new metric that we need for all models we tried so far). What to store and what to recompute depends on the application (e.g. we can store 5 best models seen so far, or everything, or nothing). If model file is tracked by DVC, then .dvc file have enough information to recompute the model. Those model files are parametrized by specific values of hyperparameters. Here I come back to generic workflow/pipeline that can create instances for specific parameters. 2. Optimize what to store and what to recomputeWe can assign a "price" to computing time and to storage (and to data transfer, and storage duration etc). If storage is very cheap, we save all models. If storage is expensive, we recompute everything. We can also look at the time it took to compute the model, and at the resulting model size, to make the decision on whether to store the model, or do delete it and recompute in the future if needed. |
Hi! I am also on the look out for a way to track my parameters with dvc. TLDR: I need a few parameters tracked for the download > explore > preprocess stages, and I use MLflow for hyper parameter optimization tracking. This seems different from what other need, who want DVC to handle the hyperparamter optimization. My current approach: I have a shell script which produces the pipelines (With many stages, I only show one here):
Then, I have a folder which contains the parameters (and params/auto for those that get generated by stages, for example params/auto/expolre.yaml contains mean and std of the mnist dataset, which is used for preprocessing, and then later when packaging the model to a docker container to also scale the input data to what the model expects). This works, however the whole grep'n'sed is very error prone. I need to make three changes (-d params/auto/explore.yaml, mean=$..., and later in the command -P mean=$mean.) to implement a new paramter. If you are interested in what I have : https://github.com/ssYkse/mlflowworkshop |
All of these are implemented in the coming DVC 1.0:
Closing. Please let me know if something is not implemented yet. |
I usually save various parameters for my deep learning projects in yaml files (i.e. learning rate ranges to search, how to pre-process the data for training/testing etc). It would be nice to have an easy way to track and show them. I wanted to use dvc metrics show for that, and this way it would allow me to show my input parameters for a run next to the output metrics for that run. I thought that would be very handy to track what has actually changed.
However when I try to add that parameter yaml file with dvc metrics add, I get an error
ERROR: failed to add metric file 'my_file.yaml' - unable to find DVC-file with output 'my_file.yaml'
This error is understandable - as this is a parameter file, there is no previous step that produced it. So this may need to be handled somewhat differently.
Does such a feature make sense from your perspective? Would it be possible to add this?
Thanks
The text was updated successfully, but these errors were encountered: