-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get rid of .dvc files? #4278
Comments
I have already suggested this on #3936 with a clear deprecation plan, with the same motivation as yours (simplifying ui/ux and docs). |
@jorgeorpinel, could also say |
Ah yes, could be seen as a duplicate of that one Saugat indeed, thanks for pointing it out.
We could but the point is I'm trying to avoid having to mention 2 things every time we're explaining the commands that affect both dvc.* and .dvc files... If there was a term that can encompass both (only idea so far is "DVC files") that would be ideal for now. Something like "internal files" or something... |
My thoughts on this. No matter how do I feel about two types of files and the pain dealing with them, I would say that "I'm trying to avoid having to mention 2 things every time we're explaining the commands that affect both dvc.* and .dvc files" should not be a strong argument when we make DVC core decisions. Usability, use case, user workflow (including things like merge, working with data only) - should come first. If for some reason it's better to have explicit .dvc when people deal with data - let's keep it. And find some elegant way to describe them in docs. |
I agree. I don't think that simplifying docs writing itself is a reason at all. But it tends to signal that something conceptually may need more work i.e. the workflow can be simplified for users.
Great question. I guess it's a well known strategy e.g. used by Git-LFS and others (replace large files with placeholders) and it makes the data somewhat visible when you do On the other hand that's what we have |
|
I am in favour of having everything in dvc yaml and dvc lock, otherwise you have a state where you have some files tracked by dvc.lock/yaml system and some by .dvc files. Also, .dvc files create a lot of headache for the newcomers to which I share my ML workflow as they often con |
I personally like having the .dvc files. It makes sense in my head for the workflow. I wish there was like a I track a lot of different datasets and I think git mechanics when reverting/merging should be considered. |
Good points. But currently we have a strange mix of placeholders (.dvc files) for basic tracking of *some* data (as Anton mentioned), with config files (dvc.yaml) that describe a workflow, AND lockfiles to track the *other* data (dvc.lock). I think we should either go all in for dvc.yaml/lock, or create a new lockfile for .dvc files (not storing file hashes in them). Perhaps the same dvc.lock file could contain everything — but at that point why not get rid of .dvc files altogether? Alternatively, getting rid of dvc.lock and creating .dvc files for all data specified in dvc.yaml would also be more consistent.
Good idea, you can create a feature request @bobertlo 🙂
dvc.yaml files are easy to merge. Probably easier than .dvc files right now. Lock files can always be just regenerated. |
@jorgeorpinel I hate to admit but I think I did musunderstand the relationship between all these files slightly. I must admit that moving a single hash from the @casperdcl I opened one of those issues after mentioning it here. I think there is another older issue with this mentioned that I saw later but it appeared to have been closed for lack of interest/feedback iirc? |
yes it's fine to open a new discussion. |
In my opinion,
Two solutions:
--->
or
|
Hmmm interesting analysis @karajan1001. I think the intention is that stages the nodes in DAGs that represent data pipelines though. They're connected by their outputs and dependencies, so that relationship would be the edges. In any case, the problem persists that dependencies and outputs can be defined in dvc.yaml + dvc.lock, or fully in .dvc files, which is kind of inconsistent, and your solutions do address this 👍 But your first solution idea would basically spread dvc.yaml+.lock into dvc.yaml+multiple lockfiles which seems also somewhat inconsistent. So I still vote for the original suggestion (your 2nd solution above): let's get rid of .dvc files 😃
p.s. This is already the case 🙂 |
It's not exactly correct though to my mind. |
Well, we call them outputs and dependencies though, e.g. external outputs ( |
Yep, I guess, that's the problem of the terminology and docs. |
They are placeholders or nodes, and a node can be at both sides of a directed edge, (dependencies or outputs) |
So even if we revisit the terminology and make it clear that .dvc files are simply placeholders for tracked files/dirs, they still don't cover all tracked files/dirs. Many (often most) are defined in dvc.yaml+lock (as outputs), and again, that's the confusing inconsistency I see which could be easily fixed by adding a raw data section in dvc.yaml/lock instead of using .dvc files. |
We have only two kinds of entities, nodes or edges. They used to be all stored in One of the advantages of saving them in a |
Yes, we made decision somewhat deliberately - kinda trading engineering elegancy (keep everything in the dvc.lock or keep all files/dirs in .dvc) to pipelines functionality simplicity. We decided to keep .dvc still to keep it simple and explicit for users that do not use/need pipelines. There is a clear difference though between those files/directories - those involved into pipelines are outputs. |
OK. And doesn't a If it achieves the same result, and improves the consistency, making DVC less confusing and easier to understand, I see that as an improvement to be considered. The one counter that has been expressed is about folder structure but again, we already don't reflect the original data structure when it comes to outputs, which are concentrated in dvc.yaml files that can be anywhere 🙂 |
It is a possible solution, but it complicates the data management (non-pipeline) usage. Imagine a simple case: a single data.xml file and you don't care about pipelines whatsoever. Now you have a single explicit placeholder If we decide to go with There are some benefits to this approach as well (e.g. consistency), not saying that it doesn't have any ground to it. My point, probably, is that we should be doing this change for some strong reason if we decide to do this. |
Agree. But we wouldn't deprecate .dvc files, they would be left as optional mechanism both for simple scenarios and for backwards compatibility. 💡 We could even create a command to move data from .dvc files to dvc.yaml/lock (and vice versa?) — which could double as an auto-migration command for 0.x users (I believe there's even a feature request for that somewhere).
Good Q. Either the one in the cwd, or the one in the project's root would seem like intuitive default behavior to me. An option could modify this.
Sure. That's pretty subjective to me and you and/or a team consensus are much more able to judge on this. Feel free to review my rant in the description of this issue as well as Saugat's #3936. Also, I'm not suggesting we go for this immediately, just something to seriously consider and possibly bring to the attention of the team. |
Alright well, since the discussion here seems to have come to some conclusions, I'll actually close this now. I added a summary about it in #3936 (comment). Thanks! |
If we're planning a non-backward compatible change in the near future, maybe it would be a good time to consider completely moving from .dvc files to dvc.yaml/dvc.lock. Or implement and leave the old .dvc file approach as optional for backward compatibility (for some time at least).
My motivation to suggest this is mostly conceptual right now but maybe it has some very practical implications too? Leaving that open to discussion (cc @iterative/engineering).
In DVC 1.x we created the pipelines file dvc.yaml which contains all the stages. From that point on .dvc files stopped being "stage files" and they only remain as placeholders for data files. They're no longer considered any kind of stage (we even removed the terms "stage file" and "orphan stage" from docs already). This has caused headaches in docs when explaining commands that use or affect both .dvc AND dvc.yaml or .lock files such as status, checkout, repro (pretty important ones), because we constantly need to mention both "stages and .dvc files" or "dvc.yaml and .dvc file", etc.
The options I see here are:
Create some concept that encompasses both .dvc files and stages — already discussed with Ivan and we came up with "DVC file" but we don't love it because it's too similar to ".dvc file" so will probably cause confusion. This solution is purely docs-related and implies no action here (close this issue).
Rethink some of these commands a little and make sure by default they only use/affect stages, and have a separate set of commands, or explicit option requirements for them to use .dvc files. To me this seems overkill
Get rid of .dvc files! Why can't dvc.yaml (and lock) be used for this? It's a matter of introducing a new top section. Example below
dvc.yaml
dvc.lock
An additional advantage of this would be that it reduces the possible confusions between
.dvc/
(internal dir) and.dvc
(files).The text was updated successfully, but these errors were encountered: