Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: base: don't checkout existing file #2358

Merged
merged 33 commits into from
Sep 6, 2019
Merged

Conversation

pared
Copy link
Contributor

@pared pared commented Aug 2, 2019

  • Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.


Fixes #2016

dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
@pared
Copy link
Contributor Author

pared commented Aug 8, 2019

@shcheklein
I've been thinking what should happen when dealing with protected cache.
We have 2 options:

  • we have protected link, in such case we probably should not remove it.
  • the link was unprotected, so now it is copy of file in cache, and should probably be removed as
    dvc add shows intention of re-adding file under dvc control, so it should be handled as "new" file.

What do you think?

@pared pared requested a review from shcheklein August 9, 2019 09:58
@shcheklein
Copy link
Member

@pared so, I'm not sure the "remove" term is correct here. From the user perspective they should see the same set of files. The only difference is that for some of them we don't copy/link them from the cache again. Can we just do checkout and protecting as two different steps? Protection/linking can be run afterwards for files that are clearly not protected/linked.

And only if it's needed during that phase we can write a message (not sure about WARNING) that we are replacing the file with a link (effectively removing in the workspace). It should be clear that we are not completely removing it. It's just in certain cases we are replacing it with a link.

@pared
Copy link
Contributor Author

pared commented Aug 12, 2019

@shcheklein

The only difference is that for some of them we don't copy/link them from the cache again. Can we just do checkout and protecting as two different steps?

I think we are drifting a bit from original problem. In this PR, we are supposed to prevent unnecessary removing and checking out file that already exists. The special case we have to handle is, what should happen when we have identical file, but it is unprotected?

I don't think discussion if we could protect after checkout is scope of this issue, since logic related to protect is embedded inside RemoteLOCAL._do_link and changing that would require some other changes.

What I think we could do, is to try to detect if particular file we are supposed to checkout is protected, and if it is not, don't remove it, but protect it. In that case, such file would not behave as in normal checkout case.
Why? Because in normal checkout, when we use some linking from cache, file is linked to its counterpart in cache. In case described above (unprotected file) this file is no longer a link, its copy of file from cache. So, protecting it will make it protected but it will not be a link from cache, which it should be in this case.

@shcheklein
Copy link
Member

@pared I think I was fine with your initial logic (detect unprotected files and restore an appropriate link - even if requires removing it from the workspace). My concern that this ticket is not only about optimization but also about a misleading warning in two different cases. Would be great to use a separate message in this case - like protecting the file, or relinking the file.

Btw, should the same apply to the unprotected DVC project, when we have a copy of the file not a link? Should we go and restore the link in this case as part of the dvc checkout? It feels reasonable to me. @efiop can you give more insights if we are doing this already and what do you think about this?

Then it becomes a more general check, not only for the sake of protecting files, but saving some space for example. It can be useful, when we change cache type. We are "drifting" a bit again, but I think it's fine if it gives more insights or leads us to a better implementation :)

@efiop
Copy link
Contributor

efiop commented Aug 13, 2019

@shcheklein

Btw, should the same apply to the unprotected DVC project, when we have a copy of the file not a link? Should we go and restore the link in this case as part of the dvc checkout? It feels reasonable to me. @efiop can you give more insights if we are doing this already and what do you think about this?

It was actually part of the reason why this change wasn't implemented earlier. Currently we have a naive approach where checkout simply re-does the link, but after this patch it won't. So, as you've said earlier, there is indeed more to this ticket if we want to solve it properly. This patch will make dvc not remove duplicates, which would be an issue. So I wouldn't merge this as is.

@pared
Copy link
Contributor Author

pared commented Aug 13, 2019

@efiop
Is my understanding of this issue right?
Problem: We don't want to remove already existing file on checkout, if we know that it is already under dvc control.

Possible cases:

  • We have unprotected repo -> no need to checkout, maybe inform user that file already exists
  • Protected repo, user wants to checkout protected file -> same as above
  • protected repo, unprotected file -> we should relink file from cache, so it has same properties as newly added file

@efiop
Copy link
Contributor

efiop commented Aug 13, 2019

@pared Yes, correct. And the main thing that we need to add to your current patch is link type detection, it looks like, so we could correctly decide if we need to relink the file.

@pared
Copy link
Contributor Author

pared commented Aug 13, 2019

@efiop
Ok, so we are on the same page.

So I guess, in 3rd case, I should detect whether already existing file is of same link type as cache:

  • if it is, we don't have to checkout
  • if its not, we need to relink the file

@efiop
Copy link
Contributor

efiop commented Aug 13, 2019

A little summary to be clear:

We have unprotected repo -> no need to checkout, maybe inform user that file already exists

Need to check if link type matches, if it is a copy but we have links, relink with a link to avoid dublication. Also need to verify that the file is no longer protected.

Protected repo, user wants to checkout protected file -> same as above

Same as before, need to check link types to avoid duplication. Need to verify that the file is protected.

protected repo, unprotected file -> we should relink file from cache, so it has same properties as newly added file

And this one becomes a regular case for 2).

@pared
Copy link
Contributor Author

pared commented Aug 13, 2019

@efiop, ok, my previous response was wrong. Ill work on adjusting this change to handle link types.

@pared pared force-pushed the 2016 branch 4 times, most recently from 3f29575 to 5a3731c Compare August 16, 2019 10:41
@pared pared changed the title [WIP] remote: base: don't checkout existing file remote: base: don't checkout existing file Aug 16, 2019
dvc/remote/local/__init__.py Outdated Show resolved Hide resolved
dvc/remote/local/__init__.py Outdated Show resolved Hide resolved
dvc/remote/local/__init__.py Outdated Show resolved Hide resolved
dvc/remote/local/__init__.py Outdated Show resolved Hide resolved
dvc/remote/local/__init__.py Outdated Show resolved Hide resolved
dvc/remote/local/__init__.py Outdated Show resolved Hide resolved
dvc/remote/local/__init__.py Outdated Show resolved Hide resolved
# `hardlink` or a `symlink`, we don't care about reflinks, because
# they are indistinguishable from copy anyway.

if not self.changed(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, this function would work perfectly fine for non-local remotes too, it is just that _is_same_link_as_cache would return True as all they support currently is copy.

dvc/remote/local/__init__.py Outdated Show resolved Hide resolved
dvc/remote/local/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few more comments down below.

pared and others added 2 commits September 6, 2019 09:28
Co-Authored-By: Ruslan Kuprieiev <kupruser@gmail.com>
dvc/remote/base.py Outdated Show resolved Hide resolved
Co-Authored-By: Ruslan Kuprieiev <kupruser@gmail.com>
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@pared pared changed the title [WIP] remote: base: don't checkout existing file remote: base: don't checkout existing file Sep 6, 2019
@efiop
Copy link
Contributor

efiop commented Sep 6, 2019

Thanks!

@efiop efiop merged commit 5ec1d27 into iterative:master Sep 6, 2019
Suor added a commit to Suor/dvc that referenced this pull request Dec 2, 2019
This reverts commit 5ec1d27.

Save tests to try make them work, and `_get_cache_type()` to be used
later.
Suor added a commit to Suor/dvc that referenced this pull request Dec 2, 2019
@pared pared deleted the 2016 branch December 17, 2019 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misleading warning (Removing before checkout)
4 participants