-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote: base: don't checkout existing file #2358
Conversation
@shcheklein
What do you think? |
@pared so, I'm not sure the "remove" term is correct here. From the user perspective they should see the same set of files. The only difference is that for some of them we don't copy/link them from the cache again. Can we just do checkout and protecting as two different steps? Protection/linking can be run afterwards for files that are clearly not protected/linked. And only if it's needed during that phase we can write a message (not sure about WARNING) that we are replacing the file with a link (effectively removing in the workspace). It should be clear that we are not completely removing it. It's just in certain cases we are replacing it with a link. |
I think we are drifting a bit from original problem. In this PR, we are supposed to prevent unnecessary removing and checking out file that already exists. The special case we have to handle is, I don't think discussion if we could protect after checkout is scope of this issue, since logic related to What I think we could do, is to try to detect if particular file we are supposed to checkout is protected, and if it is not, don't remove it, but protect it. In that case, such file would not behave as in |
@pared I think I was fine with your initial logic (detect unprotected files and restore an appropriate link - even if requires removing it from the workspace). My concern that this ticket is not only about optimization but also about a misleading warning in two different cases. Would be great to use a separate message in this case - like protecting the file, or relinking the file. Btw, should the same apply to the unprotected DVC project, when we have a copy of the file not a link? Should we go and restore the link in this case as part of the Then it becomes a more general check, not only for the sake of protecting files, but saving some space for example. It can be useful, when we change cache type. We are "drifting" a bit again, but I think it's fine if it gives more insights or leads us to a better implementation :) |
It was actually part of the reason why this change wasn't implemented earlier. Currently we have a naive approach where checkout simply re-does the link, but after this patch it won't. So, as you've said earlier, there is indeed more to this ticket if we want to solve it properly. This patch will make dvc not remove duplicates, which would be an issue. So I wouldn't merge this as is. |
@efiop Possible cases:
|
@pared Yes, correct. And the main thing that we need to add to your current patch is link type detection, it looks like, so we could correctly decide if we need to relink the file. |
@efiop So I guess, in 3rd case, I should detect whether already existing file is of same link type as cache:
|
A little summary to be clear:
Need to check if link type matches, if it is a copy but we have links, relink with a link to avoid dublication. Also need to verify that the file is no longer protected.
Same as before, need to check link types to avoid duplication. Need to verify that the file is protected.
And this one becomes a regular case for 2). |
@efiop, ok, my previous response was wrong. Ill work on adjusting this change to handle link types. |
3f29575
to
5a3731c
Compare
Co-Authored-By: Ruslan Kuprieiev <kupruser@gmail.com>
dvc/remote/local/__init__.py
Outdated
# `hardlink` or a `symlink`, we don't care about reflinks, because | ||
# they are indistinguishable from copy anyway. | ||
|
||
if not self.changed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, this function would work perfectly fine for non-local remotes too, it is just that _is_same_link_as_cache
would return True
as all they support currently is copy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! A few more comments down below.
Co-Authored-By: Ruslan Kuprieiev <kupruser@gmail.com>
Co-Authored-By: Ruslan Kuprieiev <kupruser@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks! |
This reverts commit 5ec1d27. Save tests to try make them work, and `_get_cache_type()` to be used later.
Fix iterative#2016 without all the complications of iterative#2358.
Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.
Fixes #2016