Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental queue experiments from csv command #1120

Merged
merged 4 commits into from
Dec 6, 2021

Conversation

mattseddon
Copy link
Member

@mattseddon mattseddon commented Dec 3, 2021

1/5 master <- this <- #1122 <- #1124 <- #1123 <- #1125

From getting demos ready and going through the pain of manually queuing experiments I would assume that someone working with the extension would want some quick interface to queue multiple experiments. I considered using a quick pick and other formats to get the data in but the time trusted CSV seemed to be the simplest way to start a conversation.

Demo

Screen.Recording.2021-12-03.at.3.15.51.pm.mov

@mattseddon mattseddon added the product PR that affects product label Dec 3, 2021
@mattseddon mattseddon self-assigned this Dec 3, 2021
@mattseddon mattseddon force-pushed the add-queue-experiments-from-csv branch from 832c39c to 01ed540 Compare December 3, 2021 04:28
@sroy3
Copy link
Contributor

sroy3 commented Dec 3, 2021

This looks like a very cool feature. CSV is not only simple, but a really fast way of entering values and editing. I think I'll use this often in the future.

@shcheklein
Copy link
Member

Cool stuff, @mattseddon ! This is one of the pain points ML eng, data scientists would have. I think it's fine to merge this, eventually we'll to have some UI/UX way along this to run batch experiments. Regular things are - grid search (right from the table you would specify start, end, step for a param or set of params and it would generate a "grid" of experiments), random search - etc. We'll need to support those right from the table.

CSV/TSV/JSON support sounds great anyway for me. It think that may be even DVC should support this cc @dberenbaum @daavoo what do you think?

@daavoo
Copy link
Contributor

daavoo commented Dec 3, 2021

CSV/TSV/JSON support sounds great anyway for me. It think that may be even DVC should support this cc @dberenbaum @daavoo what do you think?

For DVC, I have mixed feelings.

On the one hand, it could make sense as a language-agnostic format for defining experiments.

On the other hand, without proper UI to fill the file (unlike VSCode which could build a UI on top), I don't see how this format would be more user-friendly than having an Experiments API in Python or even writing a bash/python/X script (similar to https://dvc.org/blog/hyperparam-tuning).


A little bit off-topic (maybe should be moved to a separated discussion):

There are quite a few methods for hyperparameter tuning which are more efficient (and effective) than Grid/Manual/Random search. If people are not using those it's because they are lacking a happy path (i.e. existing libraries might be hard to set up or it is tedious to adapt the training code to work with).

That's why many are working on building UI/UX for hyperparameter tuning and/or integrations with other tools. These are 2 examples of Bayesian Search: https://docs.wandb.ai/guides/sweeps/quickstart / https://docs.valohai.com/howto/tasks/bayesian/)

I would carefully consider how much time to invest in building tools around Grid/Manual/Random search.

In my opinion (I might be biased from DVCLive perspective), the happy path for hyperparameter tuning in DVC should focus on providing a Python API to queue/launch experiments and work on integrations with known libraries (iterative/dvclive#118).

})
})

export const waitForLock = async (cwd: string): Promise<void> => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F] This is very much a temporary solution to the complicated problem of queue-ing experiments firing every data update. We could implement something along the lines of #948 (comment) but we would need to

  1. add some kind of mechanism that sends all events to the date update queue
  2. queue the experiment(s)
  3. stop sending data update events to the queue
  4. run the queue

lr,weight_decay
0.0001,0.02
0.00075,0.01
0.0005,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F] I have committed this for an integration test, we can also use it to set up demos.

@dberenbaum
Copy link
Contributor

Nice!

CSV/TSV/JSON support sounds great anyway for me.

I have no problem with this as a quick way to test out queuing lots of experiments (and I might find it useful for demos and stuff), but I agree it might not add much for users in DVC. It's pretty easy for users to do this themselves in whatever format they prefer, and I'm not sure how to integrate it into the CLI.

If I am coming up with new ideas on the fly, I'm not sure adding them to CSV saves enough time over adding them directly to the queue (and some of them might not be parameter changes). If I have lots of combinations of parameters to try (like in hyperparameter tuning), I'm more likely to define some criteria programmatically than write manually in a CSV.

There's a related issue in iterative/dvc#5615, which has a lot of 👍. I think adding more options to manage the queue is likely to be more useful than adding experiments from a CSV (other than for internal testing like it's being used here).

eventually we'll to have some UI/UX way along this to run batch experiments. Regular things are - grid search (right from the table you would specify start, end, step for a param or set of params and it would generate a "grid" of experiments), random search - etc. We'll need to support those right from the table.

A little bit off-topic (maybe should be moved to a separated discussion):

Lots to discuss about how to do more complex experiment batches and hyperparameter tuning. I agree the discussion probably is better had elsewhere, but let's definitely have this discussion before we start doing anything on it.

@mattseddon mattseddon force-pushed the add-queue-experiments-from-csv branch from ca31a8d to 7acec17 Compare December 6, 2021 22:41
@codeclimate
Copy link

codeclimate bot commented Dec 6, 2021

Code Climate has analyzed commit 7acec17 and detected 1 issue on this pull request.

Here's the issue category breakdown:

Category Count
Duplication 1

The test coverage on the diff in this pull request is 92.4% (85% is the threshold).

This pull request will bring the total coverage in the repository to 96.7% (0.0% change).

View more on Code Climate.

@mattseddon
Copy link
Member Author

@dberenbaum @daavoo agree with all of your points, agree that the entire approach needs a lot of work.

I do think it is important to re-iterate that the extension is trying to hide as much complexity from the user as possible. We also want to lower the barrier to entry for data scientists who aren't as au fait with the terminal. We should keep that in mind when we come back to this.

Thank you both.

@mattseddon mattseddon merged commit 8b13a3e into master Dec 6, 2021
@mattseddon mattseddon deleted the add-queue-experiments-from-csv branch December 6, 2021 22:53
@dberenbaum
Copy link
Contributor

Well now I can't leave it alone 😄

With that CLI avoidance in mind, what makes sense to me as we get more towards improved UX is to focus on how to set up a single experiment to queue, rather than how to submit a whole batch of experiments. For example, there could be a GUI that shows all the parameters, and users can edit their values and click "queue experiment" without needing to edit yaml files or run terminal commands.

As @daavoo alluded, when it comes to larger scale hyperparameter tuning, users probably can't avoid doing this programmatically since there are so many experiments to queue, the methods of defining the hyperparameter space are too varied, and frequently they can't even all be queued up front because later iterations base their values on the results of previous iterations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product PR that affects product
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants