-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML experiments and hyperparameters tuning #2799
Comments
I think I almost-but-not-quite understand the aim here. I feel like I'm missing some key concept.
This seems to be almost a contradiction - the most robust way to "keep track" is to commit separately.
This could be satisfied by for example by a bash script looping through param choices with @nteract/papermill for notebook users. I think it would be quite hard to try to write a tool to do this in a language/platform agnostic way. It's hard enough with To be all-encompassing we'd have to wind up supporting multiple ways of passing in params: env vars, cli args,
Again a
Would need to create a formal metrics specification, or at least be very intelligent about automatically interpreting and visualising whatever the end-users throw at us.
Not sure how "best" can be automated with "not necessarily the highest metrics"
All could be handled by the bash script.
Probably part of any potential formal metrics spec.
and
Really seems like end-users writing bash/batch scripts would solve this. Overall I feel like this has two requirements:
I'd be against designing (1) from scratch owing to: Also vaguely related maybe worth considering org-wide project boards (https://github.com/orgs/iterative/projects) for managing epics as well as cross-repo issues (e.g. iterative/dvc.org#765 and iterative/example-versioning#5) |
@casperdcl good questions but let's start with the major one:
Let's imagine you are jumping to a hyperparameter tunning stage. You need to run a few experiments. You don't know in advance how many experiments are needed. Usually, it takes 10-20 but it might easily take 50-100. Questions:
|
Ah I think we were both not using accurate language :) You do indeed want to commit results in some form (metrics for each experiment/summary of metrics/metadata to allow easy reproduction of experiments - which could just be the looping script). You don't necessarily want to commit runs (saved models, generated tweaked source code). And when I said commit separately I should've just said commit. (Separately implies multiple commits, which isn't necessary unless you want to save each model and its outputs... which may actually still be useful. 1. Run multiple experiments 2. Save each in a separate branch commit 3. Collate metrics and use them to delete most branches. No clear advantage of this over multiple dirs. Maybe if you want to save the 2 best models on two different branches which will then fork?) I think the rest of my comment dealt with the multi-dir, single-commit approach anyway (which as I understand is what you also intended). |
Yeah :) Sorry, I put the description in a very abstract form to not to push to any solutions. This abstract form gives a lot of opportunities for different interpretations which is probably the root cause of the misunderstanding. To be clear, I don't see any other solutions besides dirs yet but it would be great if we can consider other options. I definitely want to give an ability to commit the results (both metrics as well as runs) but not necessarily all the results (ups to a user).
👍 |
Doesn't sound like clean to me. It would be a very messy commit and if that experiments involve code changes it would be way easier to have a commit for each, this way you can Additionally, if we have a git commit for each experiment we want to save then it would be very easy to save associated artifacts too.
We might simply
Not necessarily, if we make a dir copy for each experiment, than that would be a different dvc repo, and we won't need any parallel processing for single repo. |
What worries me the most is the weight of the project. If we decide to go with dir approach, we need to either make a repo copy for each experiment or somehow link/use dvc controlled artifacts from original repo. I think that copy is fine for first version, but later we need to come up with something that would not duplicate our artifacts. That would probably align with parallel execution plans too. |
About the whole dir copy thing... The My main concern is this all seems very language-, code layout-, and OS-specific and best left to the user to figure out. I think it would be helpful if we gave a concrete example of how I feel like trying to create an app to automate this process in generic scenarios is a bit like trying to create an app to help people use a computer. Sounds more like a manual/course than a product. |
@Suor you are right, but ideally, it should be a user's choice - some folks are very against 50 commits and it would be great to provide some options to avoid this (if we can :) ). In the dir-per-experiment paradigm, all the experiments might be easily saved in a single commit with all the artifacts (changed files and outputs) since they are separated. What do you think about this approach? ADDED:
Yeap. An additional, experiment specific option might be helpful like
First, it looks like we have a bit different opinions regarding implementation. I assume that we copy all the artifacts in an experiment dir which gives us an ability to commit experiments (one by one or in a bulk). While you assume that we clone a repo to a dir. We can discuss the pros and cons of these methods. I won't be surprised if we find more options. Thus, it depends on implementation. If it is a separate repo as a dir then we cannot commit it in the main repo. In this case, you are right in the above - separate commits above will be required. If we run in a separate dir with no cloning (just copying and instantiating data artifacts) then we parallel running support might require. |
@pared you are right. I don't think we can afford to make a copy of data artifacts. So, there is only one option - the most complicated one, unfortunately. |
Exactly. Notebook is kind of a specific language. I'd suggest building a language-agnostic version first based on config files or code file changes - copy all code in a dir, instantiate all the data files and run an experiment. Later we can introduce something more language/Notebook specific.
Totally! We definitely need an example. This issue was created to initiate the discussion and collect the initial set of requirements. But the development process of MVP should be example-driven.
I see this as an attempt to help users use one of the "best practices" - save all the experiments (in dirs :) ) and compare the results. |
What about trying to automatically generate a Git submodule for experiments? 1. Somehow mark code or data files as "under experimentation". 2. Watch those files and make a commit every time they're written (similar to IPython Notebook checkpoints) 3. Tell DVC to stop watching this experiment. And do we have any ideas on what the interface would look like? Another command, a separate tool, a UI?
If this is the case. Perhaps a file linking system or UI that shows the user a growing set of virtual dirs simultaneously, one per experiment. Either based in the single-commit, multiple dir strategy, the git submodule, or something else. |
@dmpetrov Do you think that we could restrict (at least in the beginning) experiments feature to systems where linking is possible? That would eliminate the risk of experiment throttling disk space. Also in such case implementation does not seem too hard. We would just need to create a repo with default *-link cache type and point cache to the |
@jorgeorpinel Might be I didn't get the idea but Git-submodule means a separate repo. So, we end up having 50 Git repositories instead of 50 commits. It looks like an even more havier approach that we currently have.
Initially, the command line one. I see that as part or repro. Line |
No, just one submodule with a single copy of the source code, and 50 commits in it. Although now that I think about it, it's similar to just making a branch, and the latter is probably easier... |
@pared No restriction is needed. We should use link-type that was specified by the user. My point is - we cannot create a copy if the user prefers reflinks. Also, I don't think we need to create any repo. Experiments should work in an existing repo. |
@dmpetrov I agree that experiments should work in existing repo. What I have in mind by "creating the new repo" was that I imagined, that we will store each experiment as a current repo "copy" in some special directory, like |
@pared yes, it is very likely we will need to store a copy of the current repo. It might be directly in project root dir |
By the way, I have not seen anyone mentioning MLflow. I haven't tried it myself yet, but the description promises to manage the ML lifecycle, including experimentation and reproducibility. How did they solved this issue? Any chance to just integrate/build on top of that or some other similar tool? Or an API for integrating third-party ML lifecycle tools? |
I thought of those dirs as copies of a git/dvc repo. So if you commit it's state, probably to a branch, to a separate commit you might access all the artifacts easily. It will work with gc seamlessly and so on. A copy dvc repo also retains all the functionality, you may Do you suggest a copy of everything in a subdir, but still being the same git/dvc repo? And then committing the whole structure. Not sure how this will work, but I haven't thought about that much. And, yes if that is the same repo you most probably need parallelized runs. The thing is with subdirs in a single repo, we can't refer to different versions of an artifact by changing Also, how do you mainline some experiment then? Do we need som specific dvc command for that?
Checking out artifacts is an issue both implementations have. We can simply checkout artifacts for a new copy if we use fast links. But if we use copy we might want to make some lightweight links to already checked out copies in the original dir. This could be ignored or at least wait for a while though. We have
I don't see any advantage of a git submodule over a simple clone. Why should we complicate this?
I see that a basic building block is creating a dir copy (a clone or just a copy) and checking out artifacts there. Maybe
Or maybe it's ok to bundle it from the start, like @dmitry envisions. Not sure dvc experiment <experiment-name> some_stage.dvc
# or
dvc exp <experiment-name> some_stage.dvc
# or even
dvc try <experiment-name> some_stage.dvc We will need commands to manage all these, probably. If these are just dirs then we can commit everything as is, which is a plus. But we will still need something to diff, compare metrics, mainline an experiment. Since these are just dirs (and clones are mostly dirs too) we get some of these for free, which I like a lot:
|
Regarding the MLFlow, idea - it looks like an augmented conda env.yml file which supports tracking input CLI params and you need to use their python API for logging outputs/results/metrics. They do have a nice web UI for visualising said logs, though. |
The thing is that if you clone a Git repo inside a Git repo and add it, Git just ignores the inner repo's contents. I think it stages a dummy empty file with the name of the embedded repos' dir. So we may be forced to use submodules, depending on the specific needs. Here's Git's output when you clone a repo inside a repo and stage it:
|
@Suor it looks like we are on the same page with that.
Right. Yes, I think we should consider this subdir-in-the-same-repo option. This allows a user to commit many subdirs in a single commit or just remove subdirs using a regular
If you copy a whole structure and change paths in dvc-files it should not be an issue except the cases when a whole path was used like
🤷♂️ The option I like the most so far:
I don't think we need to invent something new here. We should use the same data file liking strategy as specified in a repo. From the file management point of view, the experiment subdirs play the same role as branches and commits and should use the same strategy.
Yeah. I'd prefer to create and execute an experiment as a single, simple command. No matter if it is
Exactly! We will get a lot of stuff for free. More than that - it should align well with data scientists' intuition of creating dirs for experiments. The last command ( |
@alexvoronov the integration itself is a good idea. Unfortunately, DVC experiments cannot be built on top of MlFlow because MlFlow has a different purpose and focuses on metrics visualization. But the visualization part can be nicely implemented on top of existing solutions. There are a few more MlFlow analogs: Weights & Biases, comet ml and others. It would be great to create a unified integration with these tools. @casperdcl brought a good point about conda env.yml. It might be another integration. We should definitely keep the UI and visualization in mind but I would not start with that. |
@dmpetrov how would this work? What I have in mind is:
What I don't like about incorporating experiments into What if I want to prepare a few experiment "drafts" by editing my repo, and then, at the end of the day, just
First two could be joined into one with some flag like I want to get back to creating the experiment directory:
I think it should not be in the project root dir.
If experiments will be in dedicated directory (
|
@pared first of all let me make the disclaimer that I have not followed this discussion very carefully and I am not sure that I understand all the ideas presented here. So, it is quite possible that I don't know what I am talking about.
Using a command like
If these experiments are going to be managed transparently (meaning that the users only use |
@dashohoxha
So in few words |
So, basically you want to clone all the data and DVC-files to an experiment directory, which can use the same cache ( It is not clear whether you modify the pipeline (or the parameters) of an experiment before you create it or after you create it, and how you are going to do it. [By the way, |
Closed since experiments functionality has been implemented internally (but not released/finalized). At this point, it will be more useful for specific experiments related issues to be handled in their own tickets. UPDATE: See https://github.com/iterative/dvc/wiki/Experiments |
Hi! Question on this @pmrowla, did |
|
Sounds good. I assume then there's no way to explore/pick run-caches and "promote" them into registered experiments as of now. I.e. they're basically separate features. |
Yeah, they are really two separate features. |
UPDATE: Skip to #2799 (comment) for a summary and updated requirements, and #2799 (comment) for the beginning of the implementation discussion.
Problem
There are a lot of discussions on how to manage ML experiments with DVC. Today's DVC design allows ML experiments through Git-based primitives such as commits and branches. This works nicely for large ML experiments when code writing and testing required. However, this model is too heavy for the hyperparameters tuning stage when the user makes dozens of small, one-line changes in config or code. Users don't want to have dozens of Git-commits or branches.
Requirements
A lightweight abstraction needs to be created in DVC to support hyperparameters-like tiny experiments without Git-commits. Hyperparameters tunning stage can be considered as a separate user activity outside of Git workflow. But the result of this activity still needs to be managed by Git preferably by a single commit.
High-level requirements to the hyperparameters tunning stage:
dvc gc
might be needed.What should NOT be covered by this feature?
This feature is NOT about the hyperparameter grid-search. In most cases, hyperparameters tuning is done by users manually using "smart" assumptions and hypotheses about hyperparameter space. Grid-search can be implemented on top of this feature/command using
bash
for example.bash
might be also a requirement for this feature request.Possible implementations
This is an open question but many data scientists create directories for each of the experiments. In some cases, people create directories for a group of experiments and then experiments inside. We can use some of these ideas/practices to better align with users' experience and intuition.
Actions
This is a high-level feature request (epic). The requirements and an initial design need to be discussed and more feature requests need to be created. @iterative/engineering please share your feedback. Is something missing here?
EDITED:
Related issues
#2379
#2532
#1018 can be relevant (?)
Discussion
The text was updated successfully, but these errors were encountered: