-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc diff or some dataset management tooling #770
Comments
Hi @villasv ! Great idea! We will be sure to get to implementing it after 0.9.8. Thanks, |
Thank you @villasv. The idea is great! But we should understand that in many cases dvc works in Gb size scale where diff has no meaning. The command might be abbused and many users can come up with a conclusion that Let's check file size before How do you guys think about introducing the limit? |
@dmpetrov I thought about introducing a limit, but there is actually no good reason to do that, since we can simply leave it for user to decide whether he wants to kill |
btw... there are no just throwing out some idea - hide the FS specific and SCM\Git specific commands under a special command like |
I think that performance is a valid concern, and probably some aggressive warnings should be issued beforehand since dvc knows the file sizes instead of simply forbidding a certain file size. Perhaps some people will still want to One extra care is that the previous revision from the |
Relevant new development: https://youtu.be/fw6P6VFPo24 |
@villasv I agree with you. File size limitation is definitely not a way to go. We might add a warning, but considering other tools(e.g. plain diff and git diff) don't do that, I think that we are pretty safe not printing anything as well. If the operation takes too long user could just CTRL+C it as usual :) Missing local cache is also a great point and we will be sure to account for that(e.g. maybe something like proposed --fetch for Thanks for the link! |
Just a heads up. I investigated the library I mentioned (tdda) but it wasn't of much help. I made a small script that achieves what I want for CSV and JSONLines. I'm pretty sure if this was incorporated in DVC it would be a bit cleaner, because I woudn't need to invoke so many subprocesses, but I decided to implement it standalone instead of inside a PR just for a Proof of Concept, and maybe we don't really want to put something so file-type-specific and formatting-opinionated into DVC (yet). Perhaps a dvc plugin? The first of its kind? My goal was to inspect each line separately and aggregate additions ( Examples using my own project:
The script:
|
Unfortunately, because I invoke
:-) but that's really minor |
Hi @villasv ! This looks amazing! I see no reason to make it a plugin, since it looks very suiting to be a part of core dvc functionality. Would you like to file a PR with Thanks, |
@villasv Btw, we have introduced a community chat recently at https://dvc.org/chat and we would be very honored to have you there 🙂 |
Big thanks to @django-kz for contributing |
Hi, I just discovered this issue mentioned on Discord. Here's an idea: Not ideal but we could provide a short guide in the docs for now on how to use |
That would be amazing @jorgeorpinel ! Great idea! |
Don't like to put this hack into the official docs to be honest. Especially if we are thinking about implementing this in the future. I would document this workaround in this ticket and may be put a link from the doc. |
Or a trick I’ve seen around: a question and semi-official-but-ok-to-become-dated answer on Stack Overflow |
OK here's a quick example on how to do it based on https://github.com/iterative/example-get-started (assumes the project has been cloned and user moved into the local repo). Also, we'll focus on files EDIT: The behavior of $ dvc pull -aT
Preparing to download data from 'https://remote.dvc.org/get-started'
...
$ dvc diff HEAD^
dvc diff from 30a96ce to 6c73875
...
diff for 'model.pkl'
-model.pkl with md5 a66489653d1b6a8ba989799367b32c43
+model.pkl with md5 3863d0e317dee0a55c4e59d2ec0eef33
...
diff for 'auc.metric'
-auc.metric with md5 f58e5ccd66bf1195b53f458e7f619ab8
...
$ mkdir /tmp/a
$ cp model.pkl /tmp/a/model.pkl
$ cp auc.metric /tmp/a/auc.metric
$ git checkout HEAD^
...
$ dvc checkout
[##############################] 100% Checkout finished!
$ mkdir /tmp/b
$ cp model.pkl /tmp/b/model.pkl
$ cp auc.metric /tmp/b/auc.metric
$ diff /tmp/a/model.pkl /tmp/b/model.pkl
Binary files /tmp/a/model.pkl and /tmp/b/model.pkl differ Finally, use $ diff /tmp/a/auc.metric /tmp/b/auc.metric
1c1
< 0.602818
---
> 0.588426 UPDATE: Link to this comment added to https://dvc.org/doc/command-reference/diff in iterative/dvc.org@9ce552a
|
|
Hi, I have read the docs and this thread but I still have problems with I have a directory with several binary files under control of dvc. Whenever I execute
The output is not really helpful because it doesn't provide any information what exactly has changed in the directory. Of course it shows me that some files have changed but I knew it anyway. Since I work with binary files I don't think I need line-by-line diff but rather list of changed binary files. The possible output for
P.S.: |
@nik123 If the directory was |
Copied from Discord, qna channel:
|
Hi @ammarasmro
Right, this is discussed above. There are probably thousands of data formats so this is not easy as with regular diff which only works on plain text files. Do you have a more specific use case? E.g. a certain data format or a set of formats? Thanks |
So the particular use case that raised this issue was a text dataset. The process we, the ML team, have used, is that after the data team gets data in any format, they process it and we get it as CSV, TSV, TXT, JSON, .md files. So alot of our experiments at least start with a text data file. We also use other formats like |
Would be great to have line-by-line comparison at least for "text-based" files (TXT, CSV, JSON, MD, YAML, TOML, XLM, ...). By the way, great job with |
@ammarasmro and @stefanocoretta for now you can refer to the #770 (comment) above for a procedure to do that. It's not clear that we want such a feature for really large files, even when they're plaintext. But yes, maybe! |
a request from a high-priority client:
|
Should it fall within DVC's feature set? Maybe we can find better diff tools to recommend instead. |
The request was mostly about an easy way to compare any text files versioned by DVC and not so much about showing data format-specific diff. |
Maybe new users who are familiar with Git. But would DS/ ML Engs expect a printed line-to-line comparison of huge dataset versions? Maybe |
It makes sense to me as a feature request. Probably not as the default, but if people specifically want to see the diff of the contents of the files, I see no reason we shouldn't let them. It's more an issue of warning the users it could be expensive and/or unhelpful, and I'm not sure when we could prioritize this since there are pretty straightforward workarounds. |
In terms of implementation, maybe some sort of hidden DVC experiments (custom Git refs) can be used and then literally |
We shouldn't be staging things in git that are supposed to be gitignored, even in hidden refs. I think what we want here is just something equivalent to git's .gitattributes + difftool support: https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes So you would tell DVC what external diff tool you want it to use for what file extensions. And then based on file extension, we would just automatically call the correct tool depending on what file the user is trying to diff. This also lets you extend support for diffing beyond text files. So you could do things like pass image formats into gui image diffing tools, and arbitrary binary files into binary diff tools, etc |
@alex000kim If these are only single-entry Create import ruamel.yaml as yaml
import sys
from dvc.repo import Repo
dvcfile = sys.argv[1]
with open(dvcfile) as f:
content = yaml.safe_load(f)
md5 = content["outs"][0]["md5"]
repo = Repo()
path = repo.odb.local.oid_to_path(md5)
with open(path) as f:
print(f.read()) The add to
And finally add to
This is hacky and won't work for directories, but a dev could probably make it robust and working with directories without too much effort. We could consider providing external diff drivers for git that diff .dvc and dvc.lock files, starting with showing output similar to what |
Great, thanks for the example, however hacky :)
Understood. |
A relatively simple way to support tooling like this without worrying about plugins for DVC itself would be to allow dvc get --rev REV-HASH -o /dev/stdout . data/output.json --force | jd data/output.json Right now this command fails because it's attempting to write a temp file to /dev and can't. It seems like it could be made to work without relying on |
dvc
plays really well withgit
, but one thing that I still miss in a data version control system that I really value in source version control systems is the tooling to inspect patches. Because data should be a deterministic reproducible output from the source code, almost all important changes are found in the code history. But frequently I also want to inspect what changed in the data.Concrete scenario: let's say that I have an input csv and a transformation script that outputs a sanitized version of that input file. Then, I make a very small change in the sanitizing strategy, and run the command again. I can see that the output file changed hash, so I know that my code indeed changed behavior. But 99.999% of the outputs data points stayed the same, just a very minor portion of the file changed.
How can I inspect the data points that changed? Or the files in a very large output directory that changed? I can't
git diff
those files anymore because they're ignored. I believe that this is achievable with dvc.The text was updated successfully, but these errors were encountered: