Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of .dvc files? #4278

Closed
jorgeorpinel opened this issue Jul 24, 2020 · 25 comments
Closed

Get rid of .dvc files? #4278

jorgeorpinel opened this issue Jul 24, 2020 · 25 comments
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jul 24, 2020

TLDR: Skip rant and go to the last bullet and code blocks.

If we're planning a non-backward compatible change in the near future, maybe it would be a good time to consider completely moving from .dvc files to dvc.yaml/dvc.lock. Or implement and leave the old .dvc file approach as optional for backward compatibility (for some time at least).

Some context in iterative/dvc.org#1384 (review)

My motivation to suggest this is mostly conceptual right now but maybe it has some very practical implications too? Leaving that open to discussion (cc @iterative/engineering).

In DVC 1.x we created the pipelines file dvc.yaml which contains all the stages. From that point on .dvc files stopped being "stage files" and they only remain as placeholders for data files. They're no longer considered any kind of stage (we even removed the terms "stage file" and "orphan stage" from docs already). This has caused headaches in docs when explaining commands that use or affect both .dvc AND dvc.yaml or .lock files such as status, checkout, repro (pretty important ones), because we constantly need to mention both "stages and .dvc files" or "dvc.yaml and .dvc file", etc.

The options I see here are:

  • Create some concept that encompasses both .dvc files and stages — already discussed with Ivan and we came up with "DVC file" but we don't love it because it's too similar to ".dvc file" so will probably cause confusion. This solution is purely docs-related and implies no action here (close this issue).

  • Rethink some of these commands a little and make sure by default they only use/affect stages, and have a separate set of commands, or explicit option requirements for them to use .dvc files. To me this seems overkill

  • Get rid of .dvc files! Why can't dvc.yaml (and lock) be used for this? It's a matter of introducing a new top section. Example below

dvc.yaml

data:
- corpus.csv
- dataset/

stages:
  cleanup:
    cmd: python clean.py corpus.csv df.h5
    deps:
    - corpus.csv
    outs:
    - df.h5
  ...

dvc.lock

outs:
- md5: 6137cde4893c59f76f005a8123d8e8e6
  path: df.h5
- md5: cde4876f0r5137c59f8e6a8423d8e936.dir
  path: dataset/

cleanup:
  cmd: python clean.py corpus.csv df.h5
  deps:
  - path: corpus.csv
    md5: 6137cde4893c59f76f005a8123d8e8e6
  outs:
  - path: df.h5
    md5: f40e3db3e1aa25562945045864a28deb
  ...

Something like that.

An additional advantage of this would be that it reduces the possible confusions between .dvc/ (internal dir) and .dvc (files).

@jorgeorpinel jorgeorpinel added feature request Requesting a new feature discussion requires active participation to reach a conclusion labels Jul 24, 2020
@skshetry
Copy link
Member

I have already suggested this on #3936 with a clear deprecation plan, with the same motivation as yours (simplifying ui/ux and docs).

@skshetry
Copy link
Member

@jorgeorpinel, could also say dvc.yaml is a workflow, and .dvc is a DVC file?

@jorgeorpinel
Copy link
Contributor Author

Ah yes, could be seen as a duplicate of that one Saugat indeed, thanks for pointing it out.

could also say dvc.yaml is a workflow, and .dvc is a DVC file?

We could but the point is I'm trying to avoid having to mention 2 things every time we're explaining the commands that affect both dvc.* and .dvc files... If there was a term that can encompass both (only idea so far is "DVC files") that would be ideal for now. Something like "internal files" or something...

@shcheklein
Copy link
Member

My thoughts on this. No matter how do I feel about two types of files and the pain dealing with them, I would say that "I'm trying to avoid having to mention 2 things every time we're explaining the commands that affect both dvc.* and .dvc files" should not be a strong argument when we make DVC core decisions. Usability, use case, user workflow (including things like merge, working with data only) - should come first.

If for some reason it's better to have explicit .dvc when people deal with data - let's keep it. And find some elegant way to describe them in docs.

@jorgeorpinel
Copy link
Contributor Author

I agree. I don't think that simplifying docs writing itself is a reason at all. But it tends to signal that something conceptually may need more work i.e. the workflow can be simplified for users.

If for some reason it's better to have explicit .dvc when people deal with data - let's keep it.

Great question. I guess it's a well known strategy e.g. used by Git-LFS and others (replace large files with placeholders) and it makes the data somewhat visible when you do ls even before dvc pull or similar.

On the other hand that's what we have dvc list . for too. And since we already went for the dvc.yaml file we could totally commit to that strategy and centralize everything in that almost single metafile to codify the entire DS project.

@jorgeorpinel
Copy link
Contributor Author

Reopening for now as discussion seems active and more specific here.

@antonkulaga
Copy link

I am in favour of having everything in dvc yaml and dvc lock, otherwise you have a state where you have some files tracked by dvc.lock/yaml system and some by .dvc files. Also, .dvc files create a lot of headache for the newcomers to which I share my ML workflow as they often con

@bobertlo
Copy link
Contributor

bobertlo commented Aug 2, 2020

I personally like having the .dvc files. It makes sense in my head for the workflow. I wish there was like a -g flag in dvc add to automatically add them to the pending commit.

I track a lot of different datasets and I think git mechanics when reverting/merging should be considered.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Aug 16, 2020

Good points. But currently we have a strange mix of placeholders (.dvc files) for basic tracking of *some* data (as Anton mentioned), with config files (dvc.yaml) that describe a workflow, AND lockfiles to track the *other* data (dvc.lock).

I think we should either go all in for dvc.yaml/lock, or create a new lockfile for .dvc files (not storing file hashes in them). Perhaps the same dvc.lock file could contain everything — but at that point why not get rid of .dvc files altogether? Alternatively, getting rid of dvc.lock and creating .dvc files for all data specified in dvc.yaml would also be more consistent.

I wish there was like a -g flag in dvc add to automatically add them to the pending commit.

Good idea, you can create a feature request @bobertlo 🙂

git mechanics when reverting/merging should be considered.

dvc.yaml files are easy to merge. Probably easier than .dvc files right now. Lock files can always be just regenerated.
p.s. Ruslan is writing a guide about that, actually, check it out/ leave feedback if you'd like: iterative/dvc.org#1684

@casperdcl
Copy link
Contributor

casperdcl commented Aug 16, 2020

about the -g, --git flag: seems duplicate of #3446 and #4330. I recall having a lot of discussion about adding this flag to some commands (add, diff, push, ...) but not sure if there was any conclusion.

@bobertlo
Copy link
Contributor

@jorgeorpinel I hate to admit but I think I did musunderstand the relationship between all these files slightly. I must admit that moving a single hash from the .dvc file to the dvc.yaml file would not cause much, if any, distress for users based on git mechanics.

@casperdcl I opened one of those issues after mentioning it here. I think there is another older issue with this mentioned that I saw later but it appeared to have been closed for lack of interest/feedback iirc?

@casperdcl
Copy link
Contributor

I opened one of those issues after mentioning it here. I think there is another older issue with this mentioned that I saw later but it appeared to have been closed for lack of interest/feedback iirc?

yes it's fine to open a new discussion.

@karajan1001
Copy link
Contributor

karajan1001 commented Sep 13, 2020

In my opinion,

  1. .dvc files are placeholders of data, they are nodes of the computational graph.
  2. While stages represent relationships between data, they are edges of the computational graph.
  3. Both of the nodes and edges are stable in most of my use cases, The only changing thing is the value of the nodes.

Two solutions:

  1. Deprecate dvc.lock, nodes info only stored in .dvc files.
  2. In dvc.yaml, values of deps or outs changed to .dvc nodes.
    deps:
    - corpus.csv
    outs:
    - df.h5

--->

    deps:
    - corpus.csv.dvc
    outs:
    - df.h5.dvc

or

  1. Deprecate .dvc files, nodes info all stored in dvc.lock.
  2. In dvc.yaml, values of deps or outs must to be values in dvc.lock.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Sep 15, 2020

Hmmm interesting analysis @karajan1001. I think the intention is that stages the nodes in DAGs that represent data pipelines though. They're connected by their outputs and dependencies, so that relationship would be the edges.

In any case, the problem persists that dependencies and outputs can be defined in dvc.yaml + dvc.lock, or fully in .dvc files, which is kind of inconsistent, and your solutions do address this 👍

But your first solution idea would basically spread dvc.yaml+.lock into dvc.yaml+multiple lockfiles which seems also somewhat inconsistent.

So I still vote for the original suggestion (your 2nd solution above): let's get rid of .dvc files 😃

  1. In dvc.yaml, values of deps or outs must to be values in dvc.lock.

p.s. This is already the case 🙂

@shcheklein
Copy link
Member

In any case, the problem persists that dependencies and outputs can be defined in dvc.yaml + dvc.lock, or fully in .dvc files

It's not exactly correct though to my mind. .dvc should not be defining outputs and dependencies. They define files and directories.

@jorgeorpinel
Copy link
Contributor Author

Well, we call them outputs and dependencies though, e.g. external outputs (dvc add --external) and external dependencies (dvc import).

@shcheklein
Copy link
Member

Well, we call them outputs and dependencies though, e.g. external outputs (dvc add --external) and external dependencies (dvc import).

Yep, I guess, that's the problem of the terminology and docs.

@karajan1001
Copy link
Contributor

Well, we call them outputs and dependencies though, e.g. external outputs (dvc add --external) and external dependencies (dvc import).

They are placeholders or nodes, and a node can be at both sides of a directed edge, (dependencies or outputs)

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Sep 15, 2020

.dvc should not be defining outputs and dependencies. They define files and directories.
They are placeholders

So even if we revisit the terminology and make it clear that .dvc files are simply placeholders for tracked files/dirs, they still don't cover all tracked files/dirs. Many (often most) are defined in dvc.yaml+lock (as outputs), and again, that's the confusing inconsistency I see which could be easily fixed by adding a raw data section in dvc.yaml/lock instead of using .dvc files.

@karajan1001
Copy link
Contributor

We have only two kinds of entities, nodes or edges. They used to be all stored in .dvc file mixed. Currently, some of the nodes are stored in .dvc file and some of the nodes are in dvc.lock while all of the edges are in dvc.yaml. So we have to make the nodes storing consistent.

One of the advantages of saving them in a .dvc files is that .dvc files stay at the path where the original data stay. It helps to keep the whole project structure (without these placeholders some path might be empty and can't add to Git) and is more convenient for us to image the structure and explore the paths without downloading all of the data or in a Github or Gitlab website.

@shcheklein
Copy link
Member

Many (often most) are defined in dvc.yaml+lock (outputs), and again, that's the confusing inconsistency.

Yes, we made decision somewhat deliberately - kinda trading engineering elegancy (keep everything in the dvc.lock or keep all files/dirs in .dvc) to pipelines functionality simplicity. We decided to keep .dvc still to keep it simple and explicit for users that do not use/need pipelines.

There is a clear difference though between those files/directories - those involved into pipelines are outputs.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Sep 15, 2020

OK. And doesn't a data: section in dvc.yaml achieve the same goal? I.e. do you oppose the proposed change, @shcheklein?

If it achieves the same result, and improves the consistency, making DVC less confusing and easier to understand, I see that as an improvement to be considered.

The one counter that has been expressed is about folder structure but again, we already don't reflect the original data structure when it comes to outputs, which are concentrated in dvc.yaml files that can be anywhere 🙂

@shcheklein
Copy link
Member

OK. And doesn't a data: section in dvc.yaml achieve the same goal? I.e. do you oppose the proposed change, @shcheklein?

It is a possible solution, but it complicates the data management (non-pipeline) usage. Imagine a simple case: a single data.xml file and you don't care about pipelines whatsoever. Now you have a single explicit placeholder data.xml.dvc right in the place as you would have data.xml as @karajan1001 mentioned.

If we decide to go with dvc.yaml - we'll have two files instead. Also, since we allow multiple dvc.yamls (one per directory), which one should we modify on dvc add data.xml? It becomes less explicit. It creates a bigger surface for merge conflicts when you deal with multiple independent artifacts.

There are some benefits to this approach as well (e.g. consistency), not saying that it doesn't have any ground to it. My point, probably, is that we should be doing this change for some strong reason if we decide to do this.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Sep 15, 2020

it complicates the data management (non-pipeline) usage. Imagine a simple case

Agree. But we wouldn't deprecate .dvc files, they would be left as optional mechanism both for simple scenarios and for backwards compatibility.

💡 We could even create a command to move data from .dvc files to dvc.yaml/lock (and vice versa?) — which could double as an auto-migration command for 0.x users (I believe there's even a feature request for that somewhere).

since we allow multiple dvc.yamls (one per directory), which one should we modify on dvc add data.xml?

Good Q. Either the one in the cwd, or the one in the project's root would seem like intuitive default behavior to me. An option could modify this.

we should be doing this change for some strong reason

Sure. That's pretty subjective to me and you and/or a team consensus are much more able to judge on this. Feel free to review my rant in the description of this issue as well as Saugat's #3936. Also, I'm not suggesting we go for this immediately, just something to seriously consider and possibly bring to the attention of the team.

@jorgeorpinel
Copy link
Contributor Author

Alright well, since the discussion here seems to have come to some conclusions, I'll actually close this now. I added a summary about it in #3936 (comment). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

7 participants