Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import/get directory with git files and dvc outputs #3087

Closed
JPFrancoia opened this issue Jan 8, 2020 · 16 comments
Closed

import/get directory with git files and dvc outputs #3087

JPFrancoia opened this issue Jan 8, 2020 · 16 comments
Assignees
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do research

Comments

@JPFrancoia
Copy link

DVC version: 0.80.0
DVC installed with Brew, on Mac OS

I'm trying to import/fetch external data.
I just did a:

dvc import https://github.com/my_company/my_repo data

which gives me a data folder and a data.dvc files in my current folder. Now, I'd like to pull the files contained in the data folder. The files are stored in a S3 bucket.

dvc pull -R data.dvc

But I immediately get this error:

ERROR: unexpected error - 'rev_lock'

Am I using the right workflow? How do I solve this issue?

Cheers

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Jan 8, 2020
@efiop
Copy link
Contributor

efiop commented Jan 8, 2020

Hi @JPFrancoia !

Your workflow is correct, looks like there is a bug on our side. Could you show log for dvc pull -R data.dvc -v, please? And dvc version log too, please.

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jan 8, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Jan 8, 2020
@JPFrancoia
Copy link
Author

JPFrancoia commented Jan 9, 2020

Hi @efiop ,

Here is the log for the pull:

DEBUG: PRAGMA user_version;
DEBUG: fetched: [(3,)]
DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
DEBUG: PRAGMA user_version = 3;
DEBUG: SELECT count from state_info WHERE rowid=?
DEBUG: fetched: [(12202,)]
DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?
ERROR: unexpected error - 'rev_lock'
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/Cellar/dvc/0.80.0/libexec/lib/python3.7/site-packages/dvc/main.py", line 48, in main
    ret = cmd.run()
  File "/usr/local/Cellar/dvc/0.80.0/libexec/lib/python3.7/site-packages/dvc/command/data_sync.py", line 30, in run
    recursive=self.args.recursive,
  File "/usr/local/Cellar/dvc/0.80.0/libexec/lib/python3.7/site-packages/dvc/repo/__init__.py", line 32, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/usr/local/Cellar/dvc/0.80.0/libexec/lib/python3.7/site-packages/dvc/repo/pull.py", line 28, in pull
    recursive=recursive,
  File "/usr/local/Cellar/dvc/0.80.0/libexec/lib/python3.7/site-packages/dvc/repo/fetch.py", line 45, in _fetch
    recursive=recursive,
  File "/usr/local/Cellar/dvc/0.80.0/libexec/lib/python3.7/site-packages/dvc/repo/__init__.py", line 266, in used_cache
    filter_info=filter_info,
  File "/usr/local/Cellar/dvc/0.80.0/libexec/lib/python3.7/site-packages/dvc/stage.py", line 1055, in get_used_cache
    cache.update(out.get_used_cache(*args, **kwargs))
  File "/usr/local/Cellar/dvc/0.80.0/libexec/lib/python3.7/site-packages/dvc/output/base.py", line 416, in get_used_cache
    cache.external[dep.repo_pair].add(dep.def_path)
  File "/usr/local/Cellar/dvc/0.80.0/libexec/lib/python3.7/site-packages/dvc/dependency/repo.py", line 37, in repo_pair
    return d[self.PARAM_URL], d[self.PARAM_REV_LOCK] or d[self.PARAM_REV]
KeyError: 'rev_lock'
------------------------------------------------------------


Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

And for the dvc version:

DVC version: 0.80.0
Python version: 3.7.6
Platform: Darwin-19.2.0-x86_64-i386-64bit
Binary: False
Package: brew
Cache: reflink - True, hardlink - True, symlink - True

But I also realized just now that my data folder in the original repo doesn't have a .dvc file, because I added the files contained in this folder with dvc add -R data/ (I want a dvc file for each file in this folder, not just one dvc file for the folder). Does it matter? If recursively importing a repository without a dvc file isn't supported, maybe a different error message would suffice?

@pared
Copy link
Contributor

pared commented Jan 9, 2020

Reproduction script:

#!/bin/bash

rm -rf src_repo repo storage
mkdir src_repo repo storage

maindir=$(pwd)

pushd src_repo
git init --quiet && dvc init -q

dvc remote add -q -d str $maindir/storage
mkdir data
echo 1 >> data/1
echo 2 >> data/2

dvc add -q -R data
dvc push -q
git add -A
git commit -am "commit"
popd

set -e
set -x

pushd repo
dvc init --no-scm --quiet
dvc import $maindir/src_repo data
dvc pull -R data

Expected behavior:
We should either prevent user from importing directories containing .dvc files or do not create .dvc file for imported directory and allow dvc pull -R for imported dir. Now we end up with data dir that contains .dvc files and data.dvc - that should not happen.

@pared
Copy link
Contributor

pared commented Jan 9, 2020

@JPFrancoia that definitely looks like bug on our side, I would recommend importing particular .dvc files one by one until we fix it. Would it be feasible for you?

@JPFrancoia
Copy link
Author

yes it is. I can also create a .dvc file for the directory in the original repo, it doesn't matter too much. I was just wondering why the feature wouldn't work.

or do not create .dvc file for imported directory and allow dvc pull -R for imported dir.

My preference would be this solution.

@pared
Copy link
Contributor

pared commented Jan 9, 2020

My preference would be this solution.

Got it, I can totally understand that it can be the desired behavior. I don't see any obvious reasons why shouldn't we allow that, @efiop, what do you think?

@pared
Copy link
Contributor

pared commented Jan 9, 2020

Side note: if we are to support this import -R we need to remember that .gitignore will need to be imported/created too, playing around with repro script shows that we might end up with data sources not ignored by git.

@efiop
Copy link
Contributor

efiop commented Jan 10, 2020

@pared Ok, so from formal standpoint that import is correct, as there is a git directory data and we support those. And the formal bug here is that dvc import for git file/directory doesn't record rev_lock, which is a 100% bug, which happened because flow for git files doesn't record rev_lock like we do for cached files in DependencyREPO.fetch(). So to solve it we need to do https://github.com/iterative/dvc/blob/0.80.0/dvc/dependency/repo.py#L69 for git files too in https://github.com/iterative/dvc/blob/0.80.0/dvc/dependency/repo.py#L98 .

Ok, now back to the general issue of dvc allowing you to import a git directory that contains some cached dvc outputs, but not bringing the outputs along with it. I would say that intuition tells me that it should download git files as well as cached dvc outputs, because that is a natural state of a dvc repo. But it is also understandable that maybe users would want to sometimes bring only git files or dv cfiles by that filter, for which we could introduce special flags later, but I wouldn't introduce those until requested.

Also, in retrospective, we might've been better off by only allowing get/import for git files when a special flag --git is specified, but that feels less intuitive for me than the solution I've described above, where you think of a repo that you are importing from as a fully checked-out dvc and git repo, where git and dvc-tracked files already present.

@efiop efiop added bug Did we break something? p1-important Important, aka current backlog of things to do labels Jan 10, 2020
@efiop
Copy link
Contributor

efiop commented Jan 10, 2020

or do not create .dvc file for imported directory and allow dvc pull -R for imported dir.

@pared Not sure I understand this one, could you elaborate, please?

@pared
Copy link
Contributor

pared commented Jan 10, 2020

@efiop That was wrong on my side, described behavior would be similar to import-url and is plainly wrong in the context of import. The whole point of import is to keep a note of .dvc files.

[EDIT]
@efiop, damn it, sorry, again wrong. When you dvc import {url} data that has been added with dvc add -R data it will create data.dvc which will describe data folder, which should not take place at all. Run my repro script without the last line and you will see.

@efiop efiop added p0-critical and removed p1-important Important, aka current backlog of things to do labels Jan 13, 2020
@efiop
Copy link
Contributor

efiop commented Jan 13, 2020

@efiop, damn it, sorry, again wrong. When you dvc import {url} data that has been added with dvc add -R data it will create data.dvc which will describe data folder, which should not take place at all.

@pared Not sure I follow, why it shouldn't take place? In my opinion dvc import url data should create data.dvc, that would have git and dvc-outputs inside.

@pared
Copy link
Contributor

pared commented Jan 13, 2020

@efiop The problem is that target directory conains *.dvc files inside, so I think that us creating .dvc file upon import for this dir is actually a bug.

@pared pared self-assigned this Jan 13, 2020
@efiop
Copy link
Contributor

efiop commented Jan 14, 2020

For the record: p0 part is :

@pared Ok, so from formal standpoint that import is correct, as there is a git directory data and we support those. And the formal bug here is that dvc import for git file/directory doesn't record rev_lock, which is a 100% bug, which happened because flow for git files doesn't record rev_lock like we do for cached files in DependencyREPO.fetch(). So to solve it we need to do https://github.com/iterative/dvc/blob/0.80.0/dvc/dependency/repo.py#L69 for git files too in https://github.com/iterative/dvc/blob/0.80.0/dvc/dependency/repo.py#L98 .

@efiop
Copy link
Contributor

efiop commented Jan 14, 2020

Submitted #3151 to unblock another ticket , but it is not solving the pull issue though. In repo_pair it expects url and rev and rev_lock to be present, which might not be the case (e.g. hand-edited dvc-file with intentionally removed rev_lock). Need to look into that.

efiop added a commit to efiop/dvc that referenced this issue Jan 21, 2020
URL should be required and rev or rev_lock should be optional.

Related to iterative#3087
@efiop efiop added p1-important Important, aka current backlog of things to do research and removed c5-half-a-day awaiting response we are waiting for your reply, please respond! :) labels Jan 21, 2020
@efiop efiop removed their assignment Jan 30, 2020
@pared
Copy link
Contributor

pared commented Feb 11, 2020

Note:
It's estimated as research because it is unclear how to handle update in case of importing
directory containing both git and dvc-tracked files.
Directories containing .dvc files should not be allowed to be import-ed and get-ed.

@efiop efiop changed the title Can't pull a .dvc file after dvc import import/get directory with git files and dvc outputs Feb 12, 2020
@efiop efiop self-assigned this Apr 21, 2020
@efiop efiop assigned pmrowla and unassigned efiop Apr 28, 2020
pmrowla added a commit to pmrowla/dvc that referenced this issue May 14, 2020
@pmrowla
Copy link
Contributor

pmrowla commented May 25, 2020

After the changes for #3811 we have proper support for directories with mixed git files and DVC outputs (and dvcfiles are filtered out on import). So with a slightly modified reproduction script:

#!/bin/bash

rm -rf src_repo repo storage
mkdir src_repo repo storage

maindir=$(pwd)

pushd src_repo
git init --quiet && dvc init -q

dvc remote add -q -d str $maindir/storage
mkdir data
echo 1 >> data/1
echo 2 >> data/2

dvc add -q -R data
dvc push -q

echo 3 >> data/3

git add -A
git commit -am "commit"
popd

set -e
set -x

pushd repo
dvc init --no-scm --quiet
dvc import $maindir/src_repo data
dvc pull -R data

with the source repo

src_repo
└── data
    ├── 1
    ├── 1.dvc
    ├── 2
    ├── 2.dvc
    └── 3

(where 3 is a git versioned file)

running dvc import path_to/src_repo data generates

repo
├── data
│   ├── 1
│   ├── 2
│   └── 3
└── data.dvc

so both DVC outs and the git-only file are imported, and we only generate data.dvc for the entire imported directory.

adding an import -R parameter should not be necessary with these changes, marking this as closed.

@pmrowla pmrowla closed this as completed May 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do research
Projects
None yet
Development

No branches or pull requests

4 participants