-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix import issue without remote config in the target #3120
Conversation
This PR is still a work in progress. In this PR I have tried to capture the I have not added condition to check: what if the cache is not available in the reciever of the import. I need to check the same. Any feedback regarding the approach is appreciated. |
e3e9520
to
db76bc5
Compare
db76bc5
to
bf3b069
Compare
Would appreciate review for this PR. |
@sharidas should we add a test for this? |
dvc/dependency/repo.py
Outdated
try: | ||
repo.cloud.pull(out.get_used_cache()) | ||
except NoRemoteError: | ||
# Check if the file is available in the cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is describing what the code is doing, but I would appreciate more a comment about why it is doing such thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the comment. Let me know if this looks ok?
bf3b069
to
777a493
Compare
Updated the PR by adding test. |
tests/func/test_import.py
Outdated
erepo_dir.scm_gen({src: "hello"}, commit="add a regular file") | ||
os.remove(erepo_dir.dvc.config.config_file) | ||
|
||
with tmp_dir.chdir(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are already inside of a tmp_dir, no need to chdir. 🙂
tests/func/test_import.py
Outdated
f = open(dst, "w") | ||
f.write("hello") | ||
f.close() | ||
tmp_dir.dvc.add([fspath(tmp_dir / dst)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are adding by full path, which will make dvc treat it as an external output. What you want is:
tmp_dir.dvc_gen({dst: "hello"})
looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: using open + write + close is a bad style even without pathlib, where you should use with open()
instead.
tests/func/test_import.py
Outdated
f.write("hello") | ||
f.close() | ||
tmp_dir.dvc.add([fspath(tmp_dir / dst)]) | ||
os.remove(tmp_dir / dst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.remove(tmp_dir / dst) | |
(tmp_dir / dst).unlink() |
dvc/dependency/repo.py
Outdated
except NoRemoteError: | ||
# It would not be good idea to raise exception if the | ||
# file is already present in the cache | ||
if os.path.exists(self.repo.cache.local.get(out.checksum)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use
if os.path.exists(self.repo.cache.local.get(out.checksum)): | |
if not self.repo.cache.local.changed_cache(out.checksum): |
to make it fully work with directories and be more robust in general. There is nothing wrong with your implementation, just an obscure internal stuff that is more appropriate here.
tests/func/test_import.py
Outdated
dst = "some_file_imported" | ||
|
||
erepo_dir.scm_gen({src: "hello"}, commit="add a regular file") | ||
os.remove(erepo_dir.dvc.config.config_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I don't think this affects the test at all, because dvc import clones erepo and non-commited changes won't be cloned anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what happens in this test currently is that dvc automatically sets default remote to erepo_dir's .dvc/cache and pulls from there. https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L91 Need to somehow disable it for this test or workaround it. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop If my understanding is correct https://github.com/iterative/dvc/blob/master/tests/dir_helpers.py#L282-L283 is the code which adds remote configuration. Or am I missing something. I didn't quite get the point here. I was thinking: if removal of config from the .dvc/config from the erepo would help to replicate the scenario, while writing the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sharidasan Oh, good point! I've totally forgotten about that one. We used to use that before https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L91 logic. But still, as you can see that file is commited and here you are only removing it from the workspace, not committing it, so clone of this erepo would still have that config intact.
777a493
to
25a797a
Compare
@efiop I have updated the change again. In the latest change set I have introduced a new fixture ( similar to |
tests/dir_helpers.py
Outdated
@pytest.fixture | ||
def erepo_dir_no_remote_config(tmp_path_factory, monkeypatch): | ||
from dvc.repo import Repo | ||
|
||
path = TmpDir(fspath_py35(tmp_path_factory.mktemp("erepo"))) | ||
|
||
# Chdir for git and dvc to work locally | ||
with path.chdir(): | ||
_git_init() | ||
path.dvc = Repo.init() | ||
path.scm = path.dvc.scm | ||
path.scm.commit("init dvc") | ||
|
||
path.dvc.close() | ||
|
||
return path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately removing that config
part still won't help, as https://github.com/iterative/dvc/blob/master/dvc/external_repo.py#L91 is still in place, which automatically sets it back when cloning 🙁
How about we simply mock cloud.pull
to raise NoRemoteError in our test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am bit confused here. When I execute:
python -m tests tests/func/test_import.py::test_import_cached_file
I do see a config file under location /tmp/pytest-of-sujith/pytest-0/popen-gw0/erepo0/.dvc/config
in my machine. The file is actually empty. So while cloning it sets back but the content is empty.. Kindly correct me if my obeservation is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sharidasan It creates .dvc/config.local
though, not .dvc/config
.
try: | ||
repo.cloud.pull(out.get_used_cache()) | ||
except NoRemoteError: | ||
# It would not be good idea to raise exception if the | ||
# file is already present in the cache | ||
if not self.repo.cache.local.changed_cache(out.checksum): | ||
return out | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a side note: we might later consider making pull
in general not require remote if everything is already locally available. This is out of focus for this PR, current implementation works great for what we need it for 👍 So no actions required here.
except NoRemoteError: | ||
# It would not be good idea to raise exception if the | ||
# file is already present in the cache | ||
if not self.repo.cache.local.changed_cache(out.checksum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a correct logic for dir outs, it should be just out.changed_cache()
.
However, I don't understand the overall approach, we just created that repo it has empty cache. What's the point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, this is for local import/shared cache case. I would say we should change .pull()
implementation to check local cache first and only go for upstream when something is missing in local cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor please see #3120 (comment) . out.change_cache() would be better, true.
tests/dir_helpers.py
Outdated
|
||
|
||
@pytest.fixture | ||
def erepo_dir_no_remote_config(tmp_path_factory, monkeypatch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an overkill creating a fixture for single test. Any of the following would be better:
- use existing
erepo_dir
, drop upstream in a test - change
erepo_dir
to not have upstream, fix any issues it causes (if that is not too hard)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor Def don't need that new fixture, but change erepo_dir to not have upstream
won't help, see #3120 (comment)
except NoRemoteError: | ||
# It would not be good idea to raise exception if the | ||
# file is already present in the cache | ||
if not self.repo.cache.local.changed_cache(out.checksum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, this is for local import/shared cache case. I would say we should change .pull()
implementation to check local cache first and only go for upstream when something is missing in local cache.
This case is related get/import failure when remote config is missing. Meaning when we get/import the file/dir from the remote url. I am sharing my understanding with respect to the ticket here :) |
Fix the import issue when target does not have remote config. Signed-off-by: Sujith H <sujith.h@gmail.com>
25a797a
to
1b7eda2
Compare
I have updated the test. Also verified the expected code flow to trigger the excecption for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sharidas ! 🙏
Fix the import issue when target does not have
remote config.
Signed-off-by: Sujith H sujith.h@gmail.com
❗ Have you followed the guidelines in the Contributing to DVC list?
📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏