Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove: output file/dir by path #2357

Closed
wants to merge 8 commits into from
Closed

Conversation

daavoo
Copy link
Contributor

@daavoo daavoo commented Apr 6, 2021

Docs for: iterative/dvc#5757

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @daavoo ! That's an important change that we never had time to start making.

From reading this update (and not taking a look at the DVC PR) the way end users would read - I have a few questions. Mostly, it's not clear what semantics this operation has. I put some comments in the PR, it would be great if we could go and improve the docs/terminology.

@daavoo
Copy link
Contributor Author

daavoo commented Apr 8, 2021

Thanks for the review, I realized that I need to change many more parts of the doc page in order to better explain the change.

Comment on lines 109 to 112
## Example: remove a single output file by name

Imagine the same <abbr>workspace</abbr> as before but now we have multiple
`outs` for the `train` stage:
Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Example: remove a single output file by name
Imagine the same <abbr>workspace</abbr> as before but now we have multiple
`outs` for the `train` stage:
## Example: remove a stage output by file name
Imagine the we have multiple <abbr>outputs</abbr> for a `train` stage:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why focus the example on stage outputs? What about making it about tracked data in general? Showing both removing an added dataset (dir), as well as a stage output (a model), for example. And/or update a previous example to mention the alternative to targeting .dvc files.

Copy link
Contributor Author

@daavoo daavoo Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are totally right, I was too focused on stage outputs because that was the "use case" described in the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happens. And we can go with this for now (just needs some fixes as indicated by Ivan) and I'll change it later. Unless you don't mind reworking it in a more general form along the way 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenting this change is being definitely more challenging than implementing it xD.

@jorgeorpinel After reading iterative/dvc#2575 (comment) , I'm thinking that maybe we should remove the example dvc remove data.csv.dvc in favor of dvc remove data.csv, given that the behavior is now the same but new users might don't be aware of the concept of .dvc files?

Copy link
Contributor Author

@daavoo daavoo Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing section ## Example: remove a tracked file kind of overlaps in content with the separate page # How to Stop Tracking Data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @daavoo, I can wrap this up once iterative/dvc#5757 is merged.

@daavoo
Copy link
Contributor Author

daavoo commented Apr 13, 2021

After trying to better explain the semantics I have now the feeling that the underlying problem might be that the current ui of dvc remove is capable of doing too many different things, as described by @jorgeorpinel in iterative/dvc#5791

/logs
```

The <abbr>output</abbr> file is actually being removed from the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo besides removing it from the .gitignore, does it mean that we have pretty much rm -f?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently. Is that a bug @shcheklein ? It feels like it should be smart enough to edit dvc.yaml in this case.

@@ -30,7 +30,7 @@ corresponding `.gitignore` entry). The data file is now no longer being tracked
after this:

```dvc
$ dvc remove data.csv.dvc
$ dvc remove data.csv

$ git status
Untracked files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it won't be untracked, right? it will become missing?

Note that, when using stage name as target, the actual <abbr>output</abbr> files
or directories of the stage (`outs` field) are not removed by this command,
unless the `--outs` option is used which will remove **all** of them.
Alternatively, you can the names of individual <abbr>output</abbr> files or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but in this case we don't do anything to the .dvc or dvc.yaml?

this mixed semantics confuses me a bit to be honest

with some targets we touch dvc.yaml, but not data ... with some targets we touch data, but not DVC files

also, --outs that is not clear how it should be have when file/directory path is provided

It feels to me that even if we merge this (and the DVC core one), we'll need to get back to the whiteboard pretty soon with this dvc remove that we are long overdue cc @efiop @dberenbaum

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, not sure if this has enough value to do this now at least - it seems it's pretty much a replacement to rm -rf (+git command to remove .gitignore), unless I'm missing something? @daavoo what is your use case that you had in mind?

sorry, that we didn't get back to you initially in the DVC core ticket (or may be I missed the discussion).

Copy link
Contributor Author

@daavoo daavoo Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, not sure if this has enough value to do this now at least - it seems it's pretty much a replacement to rm -rf (+git command to remove .gitignore), unless I'm missing something? @daavoo what is your use case that you had in mind?

sorry, that we didn't get back to you initially in the DVC core ticket (or may be I missed the discussion).

I agree with the problem of mixed semantics @shcheklein . Sadly I don't think I'm in position to explain why (an intended use case) but rather how (what the new behavior is); I just picked a Good First Issue ticket to force myself to dive into the source code 😅 .

I think the original use cases were:

But maybe I misinterpreted the use cases along the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry for the lack of clarity here after you've already put in so much work @daavoo.

but in this case we don't do anything to the .dvc or dvc.yaml?

I just tested, and it looks like .dvc is removed with this command but dvc.yaml won't be touched. Is that right @daavoo?

There might be enough straightforward utility here to merge the PR if it is limited to making dvc remove data.xml equivalent to dvc remove data.xml.dvc to support iterative/dvc#2575 (comment).

The rest might be valuable, too, but it seems like the semantics need to be straightened out in iterative/dvc#5791 first.

Copy link
Contributor Author

@daavoo daavoo Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you consider that we need to take a step back and put this and the dvc P.R. on pause or even discarding them in favor of waiting to find the right semantics, I would not mind it at all.

I wanted to dive into the source code and your thoughtful development process and I did; don't feel obligated to try to merge this just because of the work i already put.

Comment on lines 21 to 23
> `dvc remove` doesn't remove files from the DVC <abbr>cache</abbr> or
> [remote storage](/doc/command-reference/remote). Use `dvc gc` for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great to have an example (maybe just extending the one below) that shows how to remove the data from the cache, although I don't know that current gc functionality is sufficient to really be useful in this scenario.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy edits on the examples (will commit these)

content/docs/command-reference/remove.md Outdated Show resolved Hide resolved
content/docs/command-reference/remove.md Outdated Show resolved Hide resolved
content/docs/command-reference/remove.md Outdated Show resolved Hide resolved
content/docs/command-reference/remove.md Outdated Show resolved Hide resolved
content/docs/command-reference/remove.md Outdated Show resolved Hide resolved
content/docs/command-reference/remove.md Outdated Show resolved Hide resolved
Comment on lines -77 to +95
## Example: remove a stage and its output
## Example: remove a specific stage output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need this example AND the next one ("remove specific data" from .dvc files)

@jorgeorpinel
Copy link
Contributor

We can take this over once iterative/dvc#5757 is done. @daavoo if you don't mind just reminding us when/if that happens. Thanks!

@dberenbaum
Copy link
Contributor

@jorgeorpinel Thanks for taking a look! I think this PR is on pause pending iterative/dvc#5791 (see #2357 (comment)).

@jorgeorpinel jorgeorpinel changed the title remove: Add example about removing single output file by name remove: output file/dir by path Apr 28, 2021
@jorgeorpinel jorgeorpinel added the ⌛ status: wait-core-merge Waiting for related product PR merge/release label May 6, 2021
@shcheklein
Copy link
Member

Closing this as stale for now, let's get back to this when core repo is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⌛ status: wait-core-merge Waiting for related product PR merge/release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants