Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix import issue without remote config in the target #3120

Merged
merged 1 commit into from
Jan 19, 2020

Conversation

sharidas
Copy link

@sharidas sharidas commented Jan 12, 2020

Fix the import issue when target does not have
remote config.

Signed-off-by: Sujith H sujith.h@gmail.com

  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • 📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@sharidas sharidas changed the title Fix import issue without remote config in the target [WIP] Fix import issue without remote config in the target Jan 12, 2020
@sharidas
Copy link
Author

This PR is still a work in progress. In this PR I have tried to capture the NoRemoteError execption and then look for the file/dir inside the dvc repo where we are importing. If the file/dir is found then no further action required as its available in the cache.

I have not added condition to check: what if the cache is not available in the reciever of the import. I need to check the same.

Any feedback regarding the approach is appreciated.

dvc/dependency/repo.py Outdated Show resolved Hide resolved
@sharidas sharidas changed the title [WIP] Fix import issue without remote config in the target Fix import issue without remote config in the target Jan 15, 2020
@sharidas
Copy link
Author

Would appreciate review for this PR.

@shcheklein
Copy link
Member

@sharidas should we add a test for this?

try:
repo.cloud.pull(out.get_used_cache())
except NoRemoteError:
# Check if the file is available in the cache
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is describing what the code is doing, but I would appreciate more a comment about why it is doing such thing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the comment. Let me know if this looks ok?

@sharidas
Copy link
Author

@sharidas should we add a test for this?

Updated the PR by adding test.

erepo_dir.scm_gen({src: "hello"}, commit="add a regular file")
os.remove(erepo_dir.dvc.config.config_file)

with tmp_dir.chdir():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are already inside of a tmp_dir, no need to chdir. 🙂

Comment on lines 67 to 70
f = open(dst, "w")
f.write("hello")
f.close()
tmp_dir.dvc.add([fspath(tmp_dir / dst)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are adding by full path, which will make dvc treat it as an external output. What you want is:

tmp_dir.dvc_gen({dst: "hello"})

looks like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: using open + write + close is a bad style even without pathlib, where you should use with open() instead.

f.write("hello")
f.close()
tmp_dir.dvc.add([fspath(tmp_dir / dst)])
os.remove(tmp_dir / dst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
os.remove(tmp_dir / dst)
(tmp_dir / dst).unlink()

except NoRemoteError:
# It would not be good idea to raise exception if the
# file is already present in the cache
if os.path.exists(self.repo.cache.local.get(out.checksum)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use

Suggested change
if os.path.exists(self.repo.cache.local.get(out.checksum)):
if not self.repo.cache.local.changed_cache(out.checksum):

to make it fully work with directories and be more robust in general. There is nothing wrong with your implementation, just an obscure internal stuff that is more appropriate here.

dst = "some_file_imported"

erepo_dir.scm_gen({src: "hello"}, commit="add a regular file")
os.remove(erepo_dir.dvc.config.config_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't think this affects the test at all, because dvc import clones erepo and non-commited changes won't be cloned anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens in this test currently is that dvc automatically sets default remote to erepo_dir's .dvc/cache and pulls from there. https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L91 Need to somehow disable it for this test or workaround it. 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop If my understanding is correct https://github.com/iterative/dvc/blob/master/tests/dir_helpers.py#L282-L283 is the code which adds remote configuration. Or am I missing something. I didn't quite get the point here. I was thinking: if removal of config from the .dvc/config from the erepo would help to replicate the scenario, while writing the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sharidasan Oh, good point! I've totally forgotten about that one. We used to use that before https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L91 logic. But still, as you can see that file is commited and here you are only removing it from the workspace, not committing it, so clone of this erepo would still have that config intact.

@sharidas
Copy link
Author

@efiop I have updated the change again. In the latest change set I have introduced a new fixture ( similar to erepo_dir ) which differs from erepo_dir where no config data is present. So when the erepo is cloned an empty config file is created. Let me know if this looks kind of ok ...

Comment on lines 298 to 313
@pytest.fixture
def erepo_dir_no_remote_config(tmp_path_factory, monkeypatch):
from dvc.repo import Repo

path = TmpDir(fspath_py35(tmp_path_factory.mktemp("erepo")))

# Chdir for git and dvc to work locally
with path.chdir():
_git_init()
path.dvc = Repo.init()
path.scm = path.dvc.scm
path.scm.commit("init dvc")

path.dvc.close()

return path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately removing that config part still won't help, as https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L91 is still in place, which automatically sets it back when cloning 🙁

How about we simply mock cloud.pull to raise NoRemoteError in our test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am bit confused here. When I execute:

python -m tests tests/func/test_import.py::test_import_cached_file

I do see a config file under location /tmp/pytest-of-sujith/pytest-0/popen-gw0/erepo0/.dvc/config in my machine. The file is actually empty. So while cloning it sets back but the content is empty.. Kindly correct me if my obeservation is wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sharidasan It creates .dvc/config.local though, not .dvc/config.

Comment on lines +95 to +102
try:
repo.cloud.pull(out.get_used_cache())
except NoRemoteError:
# It would not be good idea to raise exception if the
# file is already present in the cache
if not self.repo.cache.local.changed_cache(out.checksum):
return out
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side note: we might later consider making pull in general not require remote if everything is already locally available. This is out of focus for this PR, current implementation works great for what we need it for 👍 So no actions required here.

except NoRemoteError:
# It would not be good idea to raise exception if the
# file is already present in the cache
if not self.repo.cache.local.changed_cache(out.checksum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a correct logic for dir outs, it should be just out.changed_cache().

However, I don't understand the overall approach, we just created that repo it has empty cache. What's the point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, this is for local import/shared cache case. I would say we should change .pull() implementation to check local cache first and only go for upstream when something is missing in local cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor please see #3120 (comment) . out.change_cache() would be better, true.



@pytest.fixture
def erepo_dir_no_remote_config(tmp_path_factory, monkeypatch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an overkill creating a fixture for single test. Any of the following would be better:

  • use existing erepo_dir, drop upstream in a test
  • change erepo_dir to not have upstream, fix any issues it causes (if that is not too hard)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor Def don't need that new fixture, but change erepo_dir to not have upstream won't help, see #3120 (comment)

except NoRemoteError:
# It would not be good idea to raise exception if the
# file is already present in the cache
if not self.repo.cache.local.changed_cache(out.checksum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, this is for local import/shared cache case. I would say we should change .pull() implementation to check local cache first and only go for upstream when something is missing in local cache.

@sharidas
Copy link
Author

I would say we should change .pull() implementation to check local cache first and only go for upstream when something is missing in local cache.

This case is related get/import failure when remote config is missing. Meaning when we get/import the file/dir from the remote url. I am sharing my understanding with respect to the ticket here :)

Fix the import issue when target does not have
remote config.

Signed-off-by: Sujith H <sujith.h@gmail.com>
@sharidas
Copy link
Author

I have updated the test. Also verified the expected code flow to trigger the excecption for NoRemoteError. Let me know if this change set looks ok? @efiop @shcheklein

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sharidas ! 🙏

@efiop efiop merged commit cf01357 into iterative:master Jan 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get/import: don't require default remote for get/import in certain cases
5 participants