Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: exp queue CLI #7592

Closed
4 tasks done
pmrowla opened this issue Apr 19, 2022 · 23 comments
Closed
4 tasks done

Epic: exp queue CLI #7592

pmrowla opened this issue Apr 19, 2022 · 23 comments
Assignees
Labels
A: experiments Related to dvc exp

Comments

@pmrowla
Copy link
Contributor

pmrowla commented Apr 19, 2022

Meta-issue for initial planned exp queue issues. (This does not cover "nice to haves" as outlined in the notion proposal)

dvc-task/exp backend tasks:

exp queue CLI tasks:

@pmrowla pmrowla added the A: experiments Related to dvc exp label Apr 19, 2022
@pmrowla pmrowla added this to DVC Apr 19, 2022
@pmrowla pmrowla moved this to Backlog in DVC Apr 19, 2022
@pmrowla pmrowla moved this from Backlog to Todo in DVC Apr 19, 2022
@daavoo
Copy link
Contributor

daavoo commented Apr 19, 2022

Should we close #5615 in favor of this one?

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 19, 2022

I think we can leave it open for now, this is more just tasking for the current planned work, but the original discussion contains some more features that are currently unplanned/out of scope

@pmrowla pmrowla moved this from Todo to In Progress in DVC Apr 29, 2022
@pmrowla
Copy link
Contributor Author

pmrowla commented May 18, 2022

@dberenbaum queue start/stop/remove/kill are all available in the dev branch for testing now. The status commands and stop --kill will be added in follow-up PRs.

@karajan1001 karajan1001 moved this from In Progress to Review In Progress in DVC Jun 7, 2022
@dberenbaum
Copy link
Collaborator

This is looking good now! I wanted to summarize some of the questions/issues I still have:

  1. What to do with existing commands and flags? Specifically, dvc exp run --run-all and dvc exp remove --queue?
  2. How can we check if the queue is active/running/processing?
  3. Should we show both queued and failed experiments in dvc exp show and have flags to show/hide them?
  4. How should queued experiments be named and should they match the final experiment name?

@dberenbaum
Copy link
Collaborator

I would say the first two are critical to resolve before making it public.

Also, there's one more general bug I have seen: sometimes the queue does not reflect the order in which experiments were added. Let me know if you need more info.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 10, 2022

  1. What to do with existing commands and flags? Specifically, dvc exp run --run-all and dvc exp remove --queue?

I think the existing commands should essentially be aliases to the new behavior, and formally deprecated, so when the user runs them we would output something like

dvc exp run --run-all is deprecated and will be removed in a future DVC release. Please use dvc queue start instead.

For the aliased behavior:

  • exp run --temp should just be exp run --queue + queue start + queue logs <exp>
  • exp run --run-all should just be queue start. I don't think we should try to duplicate the existing log behavior for --run-all since it wasn't particularly useful anyways (instead we can just output a message saying use queue logs to view logs for a specific experiment)
  • exp remove --queue should be something like queue remove --all
    • We still need to add an --all flag for queue remove, but since we also show finished experiments in queue status now we need distinct flags for removing "all queued experiments" vs "all failed/successful experiments" vs "all queued+completed experiments" (see further comments below)
  1. How can we check if the queue is active/running/processing?

For now I think we can just add this to queue status (start the output with line(s) listing the number of running workers). Eventually it will probably be worth it to have a queue ps or queue proc command for getting details about each worker, but for now status should at least indicate whether or not the queue worker is actively running.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 10, 2022

Also, there's one more general bug I have seen: sometimes the queue does not reflect the order in which experiments were added. Let me know if you need more info.

This is an underlying celery issue, the basic filesystem broker which we use doesn't absolutely guarantee that tasks will be executed in FIFO order (but in most cases it will end up running things in FIFO order). This is something we can fix upstream, but I think we should get any other issues on the DVC side sorted out first.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 10, 2022

  1. Should we show both queued and failed experiments in dvc exp show and have flags to show/hide them?

This is also related to the queue remove behavior question (from #7855 (comment)):

Will this also include remove support for successful experiments?

There's a couple questions that need to be clarified here:

  • do we need a method for removing a successful exp so it doesn't show up in queue status but keeping the actual experiment data (so it stays visible in exp show)?
  • or should queue remove <successful exp> also imply exp remove to remove the exp data (and vice versa)

Basically we need to define what exactly is the difference between exp remove and queue remove.


On the queue side of things, removing a failed or successful exp is meant to be used when the logs are no longer needed (it's a cleanup step). This is separate from removing the experiment git/dvc data (i.e. exp remove).

So maybe we should just have a dvc queue clean command that is specifically for cleaning up logs/status entries for finished experiments?

This way we could have:

  • exp remove: only works on valid (successful) experiments, removes git + dvc data for that experiment
  • queue remove: only works on queued (not yet run) experiments, removes queue entry
  • queue clean: removes/prunes status entries + log data for finished experiments (may need flags to clean only successful or only failed exps?)

In the queue status UI we could generate two separate tables for "queued/running" and "finished" status information to make it more obvious why queue remove only applies to the "queued" exps, and why clean only applies to the "finished" exps.

And in exp show we could display only successful exps by default, and have flags for showing queued/failed exps to make it more obvious why exp remove only applies to successful exps.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 10, 2022

  1. How should queued experiments be named and should they match the final experiment name?

This is also related to the bug where generated final exp names for checkpoint exps can't be used with queue kill and queue remove (the queued exp isn't associated with the separate generated exp name).

We probably need to just revisit exp automatic naming, the simplest solution here would be to just generate random exp names when they are queued instead of using the current method of naming exps based on the pipeline result. However, this means we would lose the "prevent duplicate exps" behavior which we have now (but in previous discussions it's been raised that this may not be desirable behavior in the first place).

Basically, if a user runs exp run twice with the same set of parameters, is it ok for us to generate two separate experiments in git (even though they have the same result)? On the DVC side we would still be using run-cache, so the subsequent runs wouldn't be fully reproduced/duplicated. But exp show/queue status/etc would all show two entries.

If having duplicate exps is acceptable then we can just auto generate names right away, so queued exp names will always match the final ones.

@karajan1001
Copy link
Contributor

  • exp remove --queue should be something like queue remove --all

    • We still need to add an --all flag for queue remove, but since we also show finished experiments in queue status now we need distinct flags for removing "all queued experiments" vs "all failed/successful experiments" vs "all queued+completed experiments" (see further comments below)

This one had ready been ready, it calls clear in celery_queue

@dberenbaum
Copy link
Collaborator

For the aliased behavior:

* `exp run --temp` should just be `exp run --queue` + `queue start` + `queue logs <exp>`

What if there are other experiments in the queue?

* `exp run --run-all` should just be `queue start`. I don't think we should try to duplicate the existing log behavior for `--run-all` since it wasn't particularly useful anyways (instead we can just output a message saying use `queue logs` to view logs for a specific experiment)

It could be unexpected if users are running jobs where they assume exp run --run-all will block until the experiments finish.

* `exp remove --queue` should be something like `queue remove --all`

👍

We don't claim that dvc exp is stable in 2.x, so there are no hard requirements to maintain existing behavior, but obviously we want to avoid breaking changes if feasible.

@dberenbaum
Copy link
Collaborator

For now I think we can just add this to queue status (start the output with line(s) listing the number of running workers).

👍

@dberenbaum
Copy link
Collaborator

This way we could have:

* `exp remove`: only works on valid (successful) experiments, removes git + dvc data for that experiment

* `queue remove`: only works on queued (not yet run) experiments, removes queue entry

* `queue clean`: removes/prunes `status` entries + log data for finished experiments (may need flags to clean only successful or only failed exps?)

In the queue status UI we could generate two separate tables for "queued/running" and "finished" status information to make it more obvious why queue remove only applies to the "queued" exps, and why clean only applies to the "finished" exps.

Do we need queue clean and two separate tables in queue status? Why not just have queue remove do what you suggested for queue clean? There just won't be any logs for experiments still in the queue. Can we consolidate it into queue remove [--queued] [--failed] [--succeeded] [--all]?

And in exp show we could display only successful exps by default, and have flags for showing queued/failed exps to make it more obvious why exp remove only applies to successful exps.

👍

@dberenbaum
Copy link
Collaborator

4. How should queued experiments be named and should they match the final experiment name?

Having separate queue and exp names actually makes much more sense now that the queue is more distinct, and it might even be helpful to keep them separate given the discussion above about the differences between queued and completed experiments. I think it's fine to keep it for now if we clean up all the queue UI so that it clearly references queue IDs and tasks rather than experiments or revs.

We probably need to just revisit exp automatic naming, the simplest solution here would be to just generate random exp names when they are queued instead of using the current method of naming exps based on the pipeline result. However, this means we would lose the "prevent duplicate exps" behavior which we have now (but in previous discussions it's been raised that this may not be desirable behavior in the first place).

Basically, if a user runs exp run twice with the same set of parameters, is it ok for us to generate two separate experiments in git (even though they have the same result)? On the DVC side we would still be using run-cache, so the subsequent runs wouldn't be fully reproduced/duplicated. But exp show/queue status/etc would all show two entries.

Let's discuss in #7879 since I think it's out of scope for this epic.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 13, 2022

From planning discussion:

  • dvc queue logs should support following output from multiple experiments
  • exp run --temp should match current behavior (to run one-off background experiments) and will not be considered part of the new "queue" (queue status/logs command will not work with --temp)
  • autogenerated checkpoint exp names not working with queue commands will be filed as separate issue but not addressed right away

@dberenbaum
Copy link
Collaborator

dberenbaum commented Jun 13, 2022

  • dvc queue logs should support following output from multiple experiments

To clarify, I don't think dvc queue logs needs to accept multiple experiment names. I think --run-all should automatically follow all experiments in the queue. A nice-to-have would be for dvc queue logs to do the same if no experiments are provided.

  • autogenerated checkpoint exp names not working with queue commands will be filed as separate issue but not addressed right away

#7889

Other items:

  • dvc exp remove --queue should act as an alias of queue commands with a deprecation warning.
  • dvc exp show should only include successfully completed experiments by default but should have options to include queued/failed items from the queue. (edit: exp show --queued/--failed)

Finally, are there any docs issues open for dvc queue yet?

@dberenbaum
Copy link
Collaborator

Once dvc exp run --temp/--run-all are working as expected, I think we could merge to main and delete the dvc-task-dev branch. Any concerns?

@karajan1001
Copy link
Contributor

dvc queue logs should support following output from multiple experiments

To clarify, I don't think dvc queue logs needs to accept multiple experiment names. I think --run-all should automatically follow all experiments in the queue. A nice-to-have would be for dvc queue logs to do the same if no experiments are provided.

If our support for the multi-worker is only testing, following multi-output simultaneously might not need for now. Here the requirement is that we automatically follow the currently running experiment.

exp run --temp should match current behavior (to run one-off background experiments) and will not be considered part of the new "queue" (queue status/logs command will not work with --temp)

So for --temp we will run something in the background not put it into the queue. And for --queue disregard whether --temp is provided, will be put into the celery queue for future operation?

Finally, are there any docs issues open for dvc queue yet?

I don't remember there was any of this.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 14, 2022

So for --temp we will run something in the background not put it into the queue. And for --queue disregard whether --temp is provided, will be put into the celery queue for future operation?

Yes, --queue and --temp will be mutually exclusive now. --temp will be used for standalone background runs and --queue will put the experiment into the main celery queue

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 14, 2022

docs issue: iterative/dvc.org#3658

@karajan1001
Copy link
Contributor

@dberenbaum @pmrowla
During my usage, I found another problem. The dvc queue move (--all) didn't work on failed/succeed tasks. The user experience is quite split here. That queue status showed something but when we dvc queue move it. it says No tasks found in queue named ['xx']. Beside dvc queue move --all didn't affect them. It's a user experience problem, just like what in dvc exp show.

@dberenbaum
Copy link
Collaborator

@karajan1001 Do you mean dvc queue remove? If so, good catch. We still need options for remove --failed, remove --success/succeeded, and remove --queued. Then remove --all should combine those.

@jorgeorpinel
Copy link
Contributor

Hi. It would be great to have a summary of the changes to existing commands and options (e.g. exp run --temp and --queue becoming mutually exclusive now). Added a task about that to iterative/dvc.org#3658. Thanks

karajan1001 added a commit to karajan1001/dvc that referenced this issue Jun 30, 2022
Followed from iterative#7592

1. Seperate remove method to a new file.
2. Add queued, failed, processed flags to remove.
3. Add new unit tests for queue status --queue/fail/success
4. Implement methods to remove done tasks.
5. add some new unit and functional test for celery_queue.remove
6. bump into dvc-task version 0.0.13
pmrowla pushed a commit that referenced this issue Jul 5, 2022
Followed from #7592

1. Seperate remove method to a new file.
2. Add queued, failed, processed flags to remove.
3. Add new unit tests for queue status --queue/fail/success
4. Implement methods to remove done tasks.
5. add some new unit and functional test for celery_queue.remove
6. bump into dvc-task version 0.0.13
pmrowla pushed a commit that referenced this issue Jul 6, 2022
Followed from #7592

1. Seperate remove method to a new file.
2. Add queued, failed, processed flags to remove.
3. Add new unit tests for queue status --queue/fail/success
4. Implement methods to remove done tasks.
5. add some new unit and functional test for celery_queue.remove
6. bump into dvc-task version 0.0.13
pmrowla pushed a commit that referenced this issue Jul 11, 2022
Followed from #7592

1. Seperate remove method to a new file.
2. Add queued, failed, processed flags to remove.
3. Add new unit tests for queue status --queue/fail/success
4. Implement methods to remove done tasks.
5. add some new unit and functional test for celery_queue.remove
6. bump into dvc-task version 0.0.13
pmrowla pushed a commit that referenced this issue Jul 12, 2022
Followed from #7592

1. Seperate remove method to a new file.
2. Add queued, failed, processed flags to remove.
3. Add new unit tests for queue status --queue/fail/success
4. Implement methods to remove done tasks.
5. add some new unit and functional test for celery_queue.remove
6. bump into dvc-task version 0.0.13
@pmrowla pmrowla closed this as completed Jul 12, 2022
Repository owner moved this from Review In Progress to Done in DVC Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants