-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove
: output file/dir by path
#2357
Changes from all commits
cdf1bde
405dcf3
63d864e
d47b688
a4921ad
895cf55
ba1bdf5
1daeb5f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,27 +9,30 @@ optionally delete them). | |
usage: dvc remove [-h] [-q | -v] [--outs] targets [targets ...] | ||
|
||
positional arguments: | ||
targets stages (found in dvc.yaml) or .dvc files to remove. | ||
targets Tracked files/directories, stage names (found in | ||
dvc.yaml), or .dvc files to remove. | ||
``` | ||
|
||
## Description | ||
|
||
Safely removes `.dvc` files or stages from `dvc.yaml`. This includes deleting | ||
the corresponding `.gitignore` entries (based on the `outs` fields removed). | ||
Safely removes tracked data (by file name, stage name, or `.dvc` file path). | ||
This includes deleting the corresponding `.gitignore` entries. | ||
|
||
> `dvc remove` doesn't remove files from the DVC <abbr>cache</abbr> or | ||
> [remote storage](/doc/command-reference/remote). Use `dvc gc` for that. | ||
|
||
It takes one or more stage names (see `-n` option of `dvc run`) or `.dvc` file | ||
names as `targets`. | ||
It takes one or more stage names (see `-n` option of `dvc run`), `.dvc` file | ||
names or tracked files/directories as `targets`. | ||
|
||
If there are no stages left in `dvc.yaml` after the removal, then both | ||
`dvc.yaml` and `dvc.lock` are deleted. `.gitignore` is also deleted if there are | ||
no more entries left in it. | ||
|
||
Note that the actual <abbr>output</abbr> files or directories of the stage | ||
(`outs` field) are not removed by this command, unless the `--outs` option is | ||
used. | ||
Note that, when using stage name as target, the actual <abbr>output</abbr> files | ||
or directories of the stage (`outs` field) are not removed by this command, | ||
unless the `--outs` option is used which will remove **all** of them. | ||
Alternatively, you can the names of individual <abbr>output</abbr> files or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but in this case we don't do anything to the .dvc or dvc.yaml? this mixed semantics confuses me a bit to be honest with some targets we touch dvc.yaml, but not data ... with some targets we touch data, but not DVC files also, It feels to me that even if we merge this (and the DVC core one), we'll need to get back to the whiteboard pretty soon with this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Overall, not sure if this has enough value to do this now at least - it seems it's pretty much a replacement to sorry, that we didn't get back to you initially in the DVC core ticket (or may be I missed the discussion). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree with the problem of mixed semantics @shcheklein . Sadly I don't think I'm in position to explain I think the original use cases were:
But maybe I misinterpreted the use cases along the implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, sorry for the lack of clarity here after you've already put in so much work @daavoo.
I just tested, and it looks like .dvc is removed with this command but dvc.yaml won't be touched. Is that right @daavoo? There might be enough straightforward utility here to merge the PR if it is limited to making The rest might be valuable, too, but it seems like the semantics need to be straightened out in iterative/dvc#5791 first. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you consider that we need to take a step back and put this and the dvc P.R. on pause or even discarding them in favor of waiting to find the right semantics, I would not mind it at all. I wanted to dive into the source code and your thoughtful development process and I did; don't feel obligated to try to merge this just because of the work i already put. |
||
directories of a stage as `targets`. | ||
|
||
💡 Refer to [Undo Adding Data](/doc/user-guide/how-to/stop-tracking-data) to see | ||
how it helps replace data that is tracked by DVC. | ||
|
@@ -47,61 +50,127 @@ how it helps replace data that is tracked by DVC. | |
|
||
- `-v`, `--verbose` - displays detailed tracing information. | ||
|
||
## Example: remove a .dvc file | ||
## Example: Remove stage outputs | ||
|
||
Let's imagine we have `foo.csv` and `bar.csv` files, that are already | ||
[tracked](/doc/command-reference/add) by DVC: | ||
Let's imagine we have a `train` stage in `dvc.yaml`, and corresponding files in | ||
the <abbr>workspace</abbr>: | ||
|
||
```yaml | ||
train: | ||
cmd: python train.py data.csv | ||
deps: | ||
- data.csv | ||
- train.py | ||
outs: | ||
- logs | ||
- model.h5 | ||
``` | ||
|
||
```dvc | ||
$ ls | ||
bar.csv bar.csv.dvc foo.csv foo.csv.dvc | ||
dvc.lock dvc.yaml data.csv data.csv.dvc model.h5 logs train.py | ||
|
||
$ cat .gitignore | ||
/foo.csv | ||
/bar.csv | ||
/data.csv | ||
/model.h5 | ||
/logs | ||
``` | ||
|
||
This removes `foo.csv.dvc` and double checks that its entry is gone from | ||
`.gitignore`: | ||
Using `dvc remove` on the stage name will remove the stage from `dvc.yaml`, and | ||
corresponding entries from `.gitignore`. With the `--outs` option, the actual | ||
files and directories are deleted too (`logs/` and `model.h5` in this example): | ||
|
||
```dvc | ||
$ dvc remove foo.csv.dvc | ||
$ dvc remove train --outs | ||
|
||
$ ls | ||
bar.csv bar.csv.dvc foo.csv | ||
dvc.lock dvc.yaml data.csv data.csv.dvc train.py | ||
|
||
$ cat .gitignore | ||
/bar.csv | ||
/data.csv | ||
``` | ||
|
||
> The same procedure applies to tracked directories. | ||
> Notice that the dependencies (`data.csv` and `train.py`) are not deleted. | ||
|
||
## Example: remove a stage and its output | ||
## Example: remove a specific stage output | ||
Comment on lines
-77
to
+95
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure we need this example AND the next one ("remove specific data" from .dvc files) |
||
|
||
Let's imagine we have a `train` stage in `dvc.yaml`, and corresponding files in | ||
the <abbr>workspace</abbr>: | ||
Assuming we have the same initial <abbr>workspace</abbr> as before: | ||
|
||
```yaml | ||
train: | ||
cmd: python train.py data.py | ||
cmd: python train.py data.csv | ||
daavoo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
deps: | ||
- data.csv | ||
- train.py | ||
outs: | ||
- model | ||
- logs | ||
- model.h5 | ||
``` | ||
|
||
```dvc | ||
$ ls | ||
dvc.lock dvc.yaml foo.csv foo.csv.dvc model train.py | ||
dvc.lock dvc.yaml data.csv data.csv.dvc model.h5 logs train.py | ||
|
||
$ cat .gitignore | ||
/data.csv | ||
/model.h5 | ||
/logs | ||
``` | ||
|
||
Using `dvc remove` on the stage name will remove that entry from `dvc.yaml`, and | ||
its outputs from `.gitignore`. With the `--outs` option, its outputs are also | ||
deleted (just the `model` file in this example): | ||
`dvc remove` can also be used on **individual** <abbr>outputs</abbr> of a | ||
stage (by file name): | ||
|
||
```dvc | ||
$ dvc remove train --outs | ||
$ dvc remove model.h5 | ||
|
||
$ ls | ||
dvc.lock dvc.yaml foo.csv foo.csv.dvc train.py | ||
dvc.lock dvc.yaml data.csv data.csv.dvc logs train.py | ||
|
||
$ cat .gitignore | ||
/data.csv | ||
/logs | ||
``` | ||
|
||
> Notice that the dependencies (`data.csv` and `train.py`) are not deleted. | ||
`model.h5` file is removed from the <abbr>workspace</abbr> and `.gitignore`, | ||
but note that `dvc.yaml` is not updated. | ||
|
||
## Example: remove specific data | ||
|
||
Assuming we have the same initial <abbr>workspace</abbr> as before: | ||
|
||
```yaml | ||
train: | ||
cmd: python train.py data.csv | ||
deps: | ||
- data.csv | ||
- train.py | ||
outs: | ||
- logs | ||
- model.h5 | ||
``` | ||
|
||
```dvc | ||
$ ls | ||
dvc.lock dvc.yaml data.csv data.csv.dvc model.h5 logs train.py | ||
|
||
$ cat .gitignore | ||
/data.csv | ||
/model.h5 | ||
/logs | ||
``` | ||
|
||
Using `dvc remove` on a tracked file name will remove the corresponding `.dvc` | ||
file and `gitignore` entry: | ||
|
||
```dvc | ||
$ dvc remove data.csv | ||
|
||
$ ls | ||
dvc.lock dvc.yaml data.csv model.h5 logs train.py | ||
|
||
$ cat .gitignore | ||
/model.h5 | ||
/logs | ||
``` | ||
|
||
> The same procedure applies to tracked directories. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,7 +30,7 @@ corresponding `.gitignore` entry). The data file is now no longer being tracked | |
after this: | ||
|
||
```dvc | ||
$ dvc remove data.csv.dvc | ||
$ dvc remove data.csv | ||
|
||
$ git status | ||
Untracked files: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it won't be untracked, right? it will become missing? |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be great to have an example (maybe just extending the one below) that shows how to remove the data from the cache, although I don't know that current gc functionality is sufficient to really be useful in this scenario.