Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove: output file/dir by path #2357

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 101 additions & 32 deletions content/docs/command-reference/remove.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,27 +9,30 @@ optionally delete them).
usage: dvc remove [-h] [-q | -v] [--outs] targets [targets ...]

positional arguments:
targets stages (found in dvc.yaml) or .dvc files to remove.
targets Tracked files/directories, stage names (found in
dvc.yaml), or .dvc files to remove.
```

## Description

Safely removes `.dvc` files or stages from `dvc.yaml`. This includes deleting
the corresponding `.gitignore` entries (based on the `outs` fields removed).
Safely removes tracked data (by file name, stage name, or `.dvc` file path).
This includes deleting the corresponding `.gitignore` entries.

> `dvc remove` doesn't remove files from the DVC <abbr>cache</abbr> or
> [remote storage](/doc/command-reference/remote). Use `dvc gc` for that.

Comment on lines 21 to 23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great to have an example (maybe just extending the one below) that shows how to remove the data from the cache, although I don't know that current gc functionality is sufficient to really be useful in this scenario.

It takes one or more stage names (see `-n` option of `dvc run`) or `.dvc` file
names as `targets`.
It takes one or more stage names (see `-n` option of `dvc run`), `.dvc` file
names or tracked files/directories as `targets`.

If there are no stages left in `dvc.yaml` after the removal, then both
`dvc.yaml` and `dvc.lock` are deleted. `.gitignore` is also deleted if there are
no more entries left in it.

Note that the actual <abbr>output</abbr> files or directories of the stage
(`outs` field) are not removed by this command, unless the `--outs` option is
used.
Note that, when using stage name as target, the actual <abbr>output</abbr> files
or directories of the stage (`outs` field) are not removed by this command,
unless the `--outs` option is used which will remove **all** of them.
Alternatively, you can the names of individual <abbr>output</abbr> files or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but in this case we don't do anything to the .dvc or dvc.yaml?

this mixed semantics confuses me a bit to be honest

with some targets we touch dvc.yaml, but not data ... with some targets we touch data, but not DVC files

also, --outs that is not clear how it should be have when file/directory path is provided

It feels to me that even if we merge this (and the DVC core one), we'll need to get back to the whiteboard pretty soon with this dvc remove that we are long overdue cc @efiop @dberenbaum

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, not sure if this has enough value to do this now at least - it seems it's pretty much a replacement to rm -rf (+git command to remove .gitignore), unless I'm missing something? @daavoo what is your use case that you had in mind?

sorry, that we didn't get back to you initially in the DVC core ticket (or may be I missed the discussion).

Copy link
Contributor Author

@daavoo daavoo Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, not sure if this has enough value to do this now at least - it seems it's pretty much a replacement to rm -rf (+git command to remove .gitignore), unless I'm missing something? @daavoo what is your use case that you had in mind?

sorry, that we didn't get back to you initially in the DVC core ticket (or may be I missed the discussion).

I agree with the problem of mixed semantics @shcheklein . Sadly I don't think I'm in position to explain why (an intended use case) but rather how (what the new behavior is); I just picked a Good First Issue ticket to force myself to dive into the source code 😅 .

I think the original use cases were:

But maybe I misinterpreted the use cases along the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry for the lack of clarity here after you've already put in so much work @daavoo.

but in this case we don't do anything to the .dvc or dvc.yaml?

I just tested, and it looks like .dvc is removed with this command but dvc.yaml won't be touched. Is that right @daavoo?

There might be enough straightforward utility here to merge the PR if it is limited to making dvc remove data.xml equivalent to dvc remove data.xml.dvc to support iterative/dvc#2575 (comment).

The rest might be valuable, too, but it seems like the semantics need to be straightened out in iterative/dvc#5791 first.

Copy link
Contributor Author

@daavoo daavoo Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you consider that we need to take a step back and put this and the dvc P.R. on pause or even discarding them in favor of waiting to find the right semantics, I would not mind it at all.

I wanted to dive into the source code and your thoughtful development process and I did; don't feel obligated to try to merge this just because of the work i already put.

directories of a stage as `targets`.

💡 Refer to [Undo Adding Data](/doc/user-guide/how-to/stop-tracking-data) to see
how it helps replace data that is tracked by DVC.
Expand All @@ -47,61 +50,127 @@ how it helps replace data that is tracked by DVC.

- `-v`, `--verbose` - displays detailed tracing information.

## Example: remove a .dvc file
## Example: Remove stage outputs

Let's imagine we have `foo.csv` and `bar.csv` files, that are already
[tracked](/doc/command-reference/add) by DVC:
Let's imagine we have a `train` stage in `dvc.yaml`, and corresponding files in
the <abbr>workspace</abbr>:

```yaml
train:
cmd: python train.py data.csv
deps:
- data.csv
- train.py
outs:
- logs
- model.h5
```

```dvc
$ ls
bar.csv bar.csv.dvc foo.csv foo.csv.dvc
dvc.lock dvc.yaml data.csv data.csv.dvc model.h5 logs train.py

$ cat .gitignore
/foo.csv
/bar.csv
/data.csv
/model.h5
/logs
```

This removes `foo.csv.dvc` and double checks that its entry is gone from
`.gitignore`:
Using `dvc remove` on the stage name will remove the stage from `dvc.yaml`, and
corresponding entries from `.gitignore`. With the `--outs` option, the actual
files and directories are deleted too (`logs/` and `model.h5` in this example):

```dvc
$ dvc remove foo.csv.dvc
$ dvc remove train --outs

$ ls
bar.csv bar.csv.dvc foo.csv
dvc.lock dvc.yaml data.csv data.csv.dvc train.py

$ cat .gitignore
/bar.csv
/data.csv
```

> The same procedure applies to tracked directories.
> Notice that the dependencies (`data.csv` and `train.py`) are not deleted.

## Example: remove a stage and its output
## Example: remove a specific stage output
Comment on lines -77 to +95
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need this example AND the next one ("remove specific data" from .dvc files)


Let's imagine we have a `train` stage in `dvc.yaml`, and corresponding files in
the <abbr>workspace</abbr>:
Assuming we have the same initial <abbr>workspace</abbr> as before:

```yaml
train:
cmd: python train.py data.py
cmd: python train.py data.csv
daavoo marked this conversation as resolved.
Show resolved Hide resolved
deps:
- data.csv
- train.py
outs:
- model
- logs
- model.h5
```

```dvc
$ ls
dvc.lock dvc.yaml foo.csv foo.csv.dvc model train.py
dvc.lock dvc.yaml data.csv data.csv.dvc model.h5 logs train.py

$ cat .gitignore
/data.csv
/model.h5
/logs
```

Using `dvc remove` on the stage name will remove that entry from `dvc.yaml`, and
its outputs from `.gitignore`. With the `--outs` option, its outputs are also
deleted (just the `model` file in this example):
`dvc remove` can also be used on **individual** <abbr>outputs</abbr> of a
stage (by file name):

```dvc
$ dvc remove train --outs
$ dvc remove model.h5

$ ls
dvc.lock dvc.yaml foo.csv foo.csv.dvc train.py
dvc.lock dvc.yaml data.csv data.csv.dvc logs train.py

$ cat .gitignore
/data.csv
/logs
```

> Notice that the dependencies (`data.csv` and `train.py`) are not deleted.
`model.h5` file is removed from the <abbr>workspace</abbr> and `.gitignore`,
but note that `dvc.yaml` is not updated.

## Example: remove specific data

Assuming we have the same initial <abbr>workspace</abbr> as before:

```yaml
train:
cmd: python train.py data.csv
deps:
- data.csv
- train.py
outs:
- logs
- model.h5
```

```dvc
$ ls
dvc.lock dvc.yaml data.csv data.csv.dvc model.h5 logs train.py

$ cat .gitignore
/data.csv
/model.h5
/logs
```

Using `dvc remove` on a tracked file name will remove the corresponding `.dvc`
file and `gitignore` entry:

```dvc
$ dvc remove data.csv

$ ls
dvc.lock dvc.yaml data.csv model.h5 logs train.py

$ cat .gitignore
/model.h5
/logs
```

> The same procedure applies to tracked directories.
2 changes: 1 addition & 1 deletion content/docs/user-guide/how-to/stop-tracking-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ corresponding `.gitignore` entry). The data file is now no longer being tracked
after this:

```dvc
$ dvc remove data.csv.dvc
$ dvc remove data.csv

$ git status
Untracked files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it won't be untracked, right? it will become missing?

Expand Down