Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff: consider including modified git-tracked files #3385

Closed
shcheklein opened this issue Feb 23, 2020 · 12 comments
Closed

diff: consider including modified git-tracked files #3385

shcheklein opened this issue Feb 23, 2020 · 12 comments
Labels
feature request Requesting a new feature

Comments

@shcheklein
Copy link
Member

shcheklein commented Feb 23, 2020

If the purpose of the dvc diff is to take a glance at what has changed in my iteration vs HEAD, or compare two experiments, it would be really convenient to include output of git diff --name-only in the output.

E.g. when I run dvc diff baseline-experiment bigrams-experiment, I get output like:

Modified:
    auc.metric
    data/features/
    data/features/test.pkl
    data/features/train.pkl
    model.pkl

files summary: 0 added, 0 deleted, 4 modified

it's fine, but can be misleading or not very informative. In this specific case there is also a change in the script train.py itself which is an essential part of the pipeline. I think, it would make to see at the name here.

Any though @iterative/engineering ?

UPDATE: Giving a second thought to this, I see that it's not even about DVC-tracked (cached) files vs Git-tracked files but DVC outputs vs outputs+dependencies vs all files (DVC and Git).

So, three options:

  1. Only changes to DVC "outputs" (e.g. including the Git-tracked metric file like in the example) above;
  2. Changes to DVC outputs and dependencies (that we take from DVC-files);
  3. All Git and DVC files.
@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Feb 23, 2020
@casperdcl
Copy link
Contributor

casperdcl commented Feb 23, 2020

I'd vote 2.

3 if specified via a --git flag, embellishing git-tracked files with an asterisk/colour

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 24, 2020

I'd vote for 2. as well, but maybe also 3. if dvc install has been run ("full Git integration mode").


Other possibly unnecessary comments

the purpose of the dvc diff is to take a glance at what has changed in my iteration vs HEAD

This seems like the tricky case: the default behavior, where dvc diff (without args) compares current workspace (working tree, including "local changes that you have made but not yet committed") -minus- HEAD.

it's not even about DVC-tracked (cached) files vs Git-tracked files but DVC outputs vs outputs+dependencies vs all files (DVC and Git).

I think it's actually about [ DVC-tracked but (Git) uncommitted changes in current workspace ] -minus- HEAD (current behavior)
vs. any of the other options (2. and 3.)

This is in contrast to dvc status which checks workspace changes not yet tracked by DVC.
Related comment: iterative/dvc.org#953 (comment)

@dmpetrov dmpetrov added feature request Requesting a new feature and removed triage Needs to be triaged labels Feb 24, 2020
@dmpetrov
Copy link
Member

Now we have:
0. Diff only for data files and dirs changes.

All the options (probably exclude (1)) might be very convenient in different cases. The major question is - what should be used as a default. If we include Git file changes in the default output then we need to provide clear marks in the output what is data file/dir and what is not. If the default is (0) then it might be ok not to supply this mark (in JSON-format it might be still valuable).

@dmpetrov
Copy link
Member

I'd vote for (2) if we come up with a good formating for Git/Data files. Otherwise - (0).
In any case, it worth to keep all (probably except (1) ) through options.

@shcheklein
Copy link
Member Author

@dmpetrov 0 - meaning no changes to metric files (no matter cached or not)?

@dmpetrov
Copy link
Member

@dmpetrov 0 - meaning no changes to metric files (no matter cached or not)?

It supposed to work for changed data files including metrics, right?

@shcheklein
Copy link
Member Author

@dmpetrov I'm just trying to understand what is the difference between 0 and 1 then.

@dmpetrov
Copy link
Member

0 is the current behavior

@shcheklein
Copy link
Member Author

@dmpetrov yep, that's what I meant by 1 :) may be I missed something important about the current behavior though, but I didn't mean to change it - 1 is the current one, at least the way I understood it initially.

@efiop
Copy link
Contributor

efiop commented Feb 28, 2020

In this specific case there is also a change in the script train.py itself which is an essential part of the pipeline

But it is not part of the "data" part of dvc, it is part of the pipelines, for which dvc status would show that it has changes. I think you are mixing up git status and git diff here, they are not the same things.

@shcheklein
Copy link
Member Author

But it is not part of the "data" part of dvc

What is "data" part of DVC though? Is a metric (non cached) file is? Will new config files be?

I think you are mixing up git status and git diff

one of the biggest difference between them is their purpose which also dictates their CLI interface for example. status is a summary of a default diff (workspace agains last commit), you can put a simplification like this. But with diff you can compare (and even show as a summary) to arbitrary objects. My point that even for git they might overlap but have specific needs and different interfaces, outputs, etc.

I do see what you want to achieve with this - keep dvc diff tight only to the pure data management, I'm just not sure if it exactly what we want and need from a product perspective. Thus I had the question above - what is the use case for CLI? what do I want to see when I compare two tags? (and obviously status can not do that by definition).

@efiop
Copy link
Contributor

efiop commented Dec 8, 2023

Closing as stale. Doesn't seem like there is any demand for this after vscode found a workaround.

@efiop efiop closed this as not planned Won't fix, can't repro, duplicate, stale Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

5 participants