Skip to content

Conversation

@fabiosantoscode
Copy link
Contributor

@fabiosantoscode fabiosantoscode commented Jan 16, 2020

This makes dvc update update files from git repositories. git-tracked imported files were actually already handled.

Fixes #2976


  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • 📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@fabiosantoscode fabiosantoscode changed the title update: handle git-tracked imported files update: handle imported files from non-DVC git repositories Jan 16, 2020
@fabiosantoscode fabiosantoscode requested a review from efiop January 16, 2020 20:31
@shcheklein shcheklein requested a review from skshetry January 16, 2020 21:15

def test_update_git_tracked(tmp_dir, dvc, erepo_dir):
with erepo_dir.chdir():
erepo_dir.scm.repo.index.remove([".dvc"], r=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the need of removing .dvc from the index itself? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this issue is to check that we can dvc import and then dvc update from a git repository. So I'm purging DVC and committing that. My reasoning is that when the repo is cloned so that it can be imported or updated, its .dvc directory being in the index makes a difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using new git_dir dir helper now.

with erepo_dir.chdir():
erepo_dir.scm.repo.index.remove([".dvc"], r=True)
shutil.rmtree(".dvc")
erepo_dir.scm_gen("file", "first version")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could do:

Suggested change
erepo_dir.scm_gen("file", "first version")
erepo_dir.scm_gen("file", "first version", commit="first version")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that but would still need a commit to remove .dvc in a way that can be cloned as a regular git repo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also simplify this?

Suggested change
erepo_dir.scm_gen("file", "first version")
erepo_dir.scm_gen("file", "first version", commit="first version")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now done through git_dir fixture

stage = dvc.imp(fspath(erepo_dir), "file", "file")

# Just to make sure it doesn't crash
dvc.update(stage.path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would this crash?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably defensive testing, to check that there's no error even if the file is unchanged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. The purpose here is to increase coverage a bit.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiosantoscode , I think the test could be simpler, without the need of removing files:

def test_update_git_tracked(tmp_dir, dvc, erepo_dir):
    with erepo_dir.chdir():
        erepo_dir.scm_gen("file", "first version", commit="first")

    dvc.imp(fspath(erepo_dir), "file", "file")

    assert (tmp_dir / "file").read_text() == "first version"

    with erepo_dir.chdir():
        erepo_dir.scm_gen("file", "second version", commit="second")

    clean_repos()

    dvc.update("file.dvc")

    assert (tmp_dir / "file").read_text() == "second version"

I might be wrong, tho)
would appreciate hearing your comments about it.

Copy link
Collaborator

@skshetry skshetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I agree with @MrOutis regarding test being a bit convoluted (see his example). I tried it locally and works amazingly well. 👍


with erepo_dir.chdir():
erepo_dir.scm.repo.index.remove(["file"])
os.remove("file")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to remove file, .scm_gen() overwrites.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining this, @skshetry)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know. I'll try that out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

stage = dvc.imp(fspath(erepo_dir), "file", "file")

# Just to make sure it doesn't crash
dvc.update(stage.path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably defensive testing, to check that there's no error even if the file is unchanged?

@skshetry
Copy link
Collaborator

skshetry commented Jan 17, 2020

The following is failing for me for some reasons:

#! /usr/bin/env sh

set -ex

repo=$(mktemp -d)
upstream_repo=$(mktemp -d)

cd $repo
git init && dvc init

cd $upstream_repo
git init && dvc init
echo "foo" >> foo
git add foo
git commit -m "first version"

cd $repo
dvc import $upstream_repo foo

cd $upstream_repo
echo "foo" >> foo
git commit -am "second version"

cd $repo
dvc update foo.dvc
dvc update foo.dvc

with following error (at the end, i.e last dvc update):

ERROR: failed to update 'foo.dvc'. - config file error: Section 'remote "auto-generated-upstream"' already exists. Use `-f|--force` to overwrite section with new value.

If i do not dvc init in the upstream repo, then, it works perfectly fine.
cc @MrOutis, probably, I am doing something wrong?

@ghost
Copy link

ghost commented Jan 17, 2020

#3172 (comment)

@skshetry , interesting! I guess that's another error, since it happens on DVC initialized repositories.
I guess the error comes while trying to checkout the file.

To be honest, not sure what's going on here, @skshetry, I would need to take a look later :)

@efiop
Copy link
Contributor

efiop commented Jan 17, 2020

@skshetry That is a nice catch! Might be some issues with our caching mechanism, where it tries to auto-create a remote twice here https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L91 . Haven't looked deeply into it yet though.

@skshetry
Copy link
Collaborator

skshetry commented Jan 17, 2020

@efiop, how about this patch? Should _external_repo() complain about this?

diff --git a/dvc/external_repo.py b/dvc/external_repo.py
index 9ff2f2a4..dd4c30ce 100644
--- a/dvc/external_repo.py
+++ b/dvc/external_repo.py
@@ -92,6 +92,7 @@ def _external_repo(url=None, rev=None, cache_dir=None):
                         "auto-generated-upstream",
                         original_repo.cache.local.cache_dir,
                         default=True,
+                        force=True,
                         level=Config.LEVEL_LOCAL,
                     )
                 finally:

Another question would be, how come the cloned repo got reused twice? It's two different repo.

@efiop
Copy link
Contributor

efiop commented Jan 17, 2020

@skshetry Yep, that would work 🙂 But that might be us covering up a bug, will need to look why that happened in the first place.

@fabiosantoscode
Copy link
Contributor Author

@efiop want me to have a look in scope of this PR?

@fabiosantoscode
Copy link
Contributor Author

@skshetry nice catch!

I had a quick look and if I change dvc.external_repo._default_remote_set to always return True this doesn't happen. Probably means that this function is not finding the default remote, and then it fails to add a default remote.

Maybe we can try to add the remote and catch the resulting error instead of doing an if and checking for the default remote existing? It's probably more pythonic and less prone to race conditions.

@fabiosantoscode
Copy link
Contributor Author

@MrOutis I simplified the test with most of your example, but still need to remove the .dvc directory since this test is for pure-git repositories. DVC repositories are already tested in the same file.

Maybe we could have a new fixture erepo_raw_git_dir?

@skshetry
Copy link
Collaborator

@fabiosantoscode, I guess, the fix is out of the scope of this PR (and, not introduced by the changes).

@fabiosantoscode
Copy link
Contributor Author

@skshetry makes sense. I'll stop investigating for now.

For the record, if you add this line to the top of DependencyRepo.status() it also fixes the problem:

from dvc.external_repo import clean_repos; clean_repos()

@skshetry
Copy link
Collaborator

skshetry commented Jan 17, 2020

Maybe we could have a new fixture erepo_raw_git_dir?

I think, it makes sense to create a git-only repo fixture, as we are supporting more and more non-DVC git repositories.

But, for now, as it's only a single test, I guess, you could do the following:

from tests.dir_helpers import TmpDir
path = TmpDir()

with path.chdir():
    path.scm_gen({"file": "file"}, "first version", commit="first version")

This is done here, you could take it as an example:
https://github.com/iterative/dvc/blob/049367e4751238363ee110d6c867d5b74de67009/tests/func/test_ignore.py#L136-L139

@fabiosantoscode
Copy link
Contributor Author

fabiosantoscode commented Jan 17, 2020

@skshetry this doesn't work. TmpDir doesn't get an scm by default, the only one that does is tmp_dir. I'll add a new fixture because it's easier, and there's at least one other test where it can be used.

edit: I get TypeError: Can't use scm for this temporary dir. Did you forget to use "scm" fixture?

@skshetry
Copy link
Collaborator

skshetry commented Jan 17, 2020

@skshetry this doesn't work. TmpDir doesn't get an scm by default, the only one that does is tmp_dir. I'll add a new fixture because it's easier, and there's at least one other test where it can be used.

@fabiosantoscode, sorry for misleading. Didn't really try.
TmpDir does not have access to scm as we just scm when scm fixture is present.

@fabiosantoscode
Copy link
Contributor Author

@skshetry @MrOutis should be a lot cleaner now with the git_dir fixture yielding a simple git repository.

DeepSource is complaining about me changing the .scm property of the TmpDir, does it like java-style setters?

@skshetry
Copy link
Collaborator

skshetry commented Jan 17, 2020

DeepSource is complaining about me changing the .scm property of the TmpDir, does it like java-style setters?

@fabiosantoscode, looked through it, it's fine. Deepsource complains old stuff from the file that we have changed. We do the same scm injection on other fixtures.



@pytest.fixture
def git_dir(tmp_path_factory, monkeypatch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest renaming it so that we know its "external git dir".
Also, if we are already initializing git repository here, how about refactoring erepo_dir so that it uses this fixture? If you consider my point valid, path generation should have logic basing on request.fixturenames (e.g. https://github.com/iterative/dvc/blob/482473aa69fd97ebde8bdd2ae23b0c703158f765/tests/dir_helpers.py#L223)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's very helpful to make this one react to other things. For example, the test test_update_git_tracked which I've added, needs DVC on the tmp_dir, but not on the external git dir. This would make this fixture useless for that test, since it would have DVC anyway.

We could refactor erepo_dir to use this fixture, but that would mean two commits, and the tests are pretty slow as it is right now.

erepo_dir.scm.add([".dvc", "file.dvc"])
erepo_dir.scm.commit("version with dvc")
new_rev = erepo_dir.scm.get_rev()
external_git_dir.dvc_gen("file", "second version")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the commit kwarg is equivalent to add + commit.

Suggested change
external_git_dir.dvc_gen("file", "second version")
external_git_dir.dvc_gen("file", "second version", commit="version with dvc")

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of introducing a whole new dir_helper for a single test, but this goes far from the scope of this PR. As for now, I can't offer any immediate alternative, so let's roll with what we already have)

Left a small comment that you don't need to address, but take it into account for future contributions

@ghost ghost requested a review from Suor January 18, 2020 00:17
@Suor
Copy link
Contributor

Suor commented Jan 18, 2020

I suggest stopping on this until #3124 is resolved, at least external_repo() and cached_clone() are somehow unified, that one will either solve this issue automatically or make fixing it a no-brainer.

Trying to fix it using cached_clone() is a way to multiply a mess we have, i.e. counterproductive. So I suggest closing this PR.

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said above, let's close it. A note on a new fixture stays relevant though.



@pytest.fixture
def external_git_dir(tmp_path_factory, monkeypatch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a new fixture for a single test is an overkill, doing some copy-paste inside is even worse. You may simply use erepo_dir and remove .dvc dir there.

@efiop
Copy link
Contributor

efiop commented Jan 19, 2020

Closing in favor of #3190 . Sorry for wasted effort, @fabiosantoscode !

@efiop efiop closed this Jan 19, 2020
Suor added a commit to Suor/dvc that referenced this pull request Jan 19, 2020
Some things are fixed along the way:
- no cache shared between incompatible funcs
- some incosistent exceptions are now consistent
- import updates for git files now works both for dvc and git repos
- "auto-generated-upstream already exists" fixed
- dvc.api.open()/read() works with git repos now

Fixes treeverse#3124, treeverse#2976. Closes treeverse#3172.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

update: Handle Git-tracked import-ed files

5 participants