Bug: `dvc add` fails with a modified file (or directory) at the end of a list of files #2886

jaredsampson · 2019-12-03T18:35:48Z

Using DVC 0.71.0 on RHEL6, installed via conda, configured for system-wide hard links (not sure if it's relevant):

$ dvc --version
0.71.0

$ cat ~/.config/dvc/config
[cache]
protected = true
type = hardlink

I have come across what appears to be a bug, where attempting to dvc add a directory previously under DVC version control whose contents have changed, results in an error, but only when adding along with a list of files, and doesn't occur if the command is repeated (i.e. after all the other files have been added). I have reproduced this in several project directories and via the following minimal example (courtesy of @MrOutis):

dvc init --no-scm
mkdir data
echo "foo" > data/foo
dvc add data
echo "bar" > bar.txt
dvc unprotect data
echo "change" > data/foo
dvc add bar.txt data

This results in the following output:

WARNING: Output 'data' of 'data.dvc' changed because it is 'modified'
100%|██████████|Add                                       2.00/2.00 [00:02<00:00,  1.08s/file]
ERROR: file/directory 'data' is specified as an output in more than one stage: data.dvc
    data.dvc

But if the command is re-run:

$ dvc add bar.txt data
Stage is cached, skipping.
100% Add                                                  1.00/1.00 [00:01<00:00,  1.96s/file]

So it appears dvc is somehow mishandling the list of files. Of course, the expected behavior is that it will add the directory successfully on the first try.

Thanks for any effort to track down the source of the bug.

The text was updated successfully, but these errors were encountered:

ghost · 2019-12-03T18:37:03Z

Thanks for reporting this, @jaredsampson !

Including the stack trace:

ERROR: file/directory 'data' is specified as an output in more than one stage: data.dvc
    data.dvc
------------------------------------------------------------
Traceback (most recent call last):
  File "/dvc/command/add.py", line 25, in run
    fname=self.args.file,
  File "/dvc/repo/__init__.py", line 36, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/dvc/repo/add.py", line 50, in add
    repo.check_modified_graph(stages)
  File "/dvc/repo/__init__.py", line 177, in check_modified_graph
    self._collect_graph(self.stages + new_stages)
  File "/dvc/repo/__init__.py", line 327, in _collect_graph
    raise OutputDuplicationError(str(out), stages)
dvc.exceptions.OutputDuplicationError: file/directory 'data' is specified as an output in more than one stage: data.dvc
    data.dvc
------------------------------------------------------------

jaredsampson · 2019-12-03T18:44:51Z

I get the same behavior with a flat file foo.txt instead of data/foo as well:

dvc init --no-scm
mkdir data
echo "foo" > foo.txt
dvc add foo.txt
echo "bar" > bar.txt
dvc unprotect foo.txt
echo "change" > foo.txt
dvc add foo.txt bar.txt

yields

WARNING: Output 'foo.txt' of 'foo.txt.dvc' changed because it is 'modified'
100%|██████████|Add                                       2.00/2.00 [00:01<00:00,  1.51file/s]
ERROR: file/directory 'foo.txt' is specified as an output in more than one stage: foo.txt.dvc
    foo.txt.dvc

However, reversing the order of the files in the last command, everything works fine:

$ dvc add foo.txt bar.txt
WARNING: Output 'foo.txt' of 'foo.txt.dvc' changed because it is 'modified'
100% Add                                                  2.00/2.00 [00:01<00:00,  1.47file/s]

ghost · 2019-12-03T18:47:23Z

Thanks for the additional info, @jaredsampson !

The problem seems to be related to the way we collect the stages.

efiop · 2019-12-03T18:59:43Z

#!/bin/bash

set -e
set -x

rm -rf myrepo
mkdir myrepo
cd myrepo

dvc init --no-scm

mkdir data

echo "foo" > foo.txt
dvc add foo.txt

echo "bar" > bar.txt
dvc unprotect foo.txt

echo "change" > foo.txt
dvc add foo.txt bar.txt

doesn't reproduce it for me 🙁I suspect that the original issue might be cause by us not deduping the targets in dvc/repo/add.py, but I haven't looked into it closely enough.

ghost · 2019-12-03T19:01:57Z

@efiop , change the order of the last call and it should work:

- dvc add foo.txt bar.txt
+ dvc add bar.txt foo.txt

ghost · 2019-12-03T19:07:12Z

If someone is willing to give it a try, here's a test that you can play with:

diff --git a/tests/func/test_add.py b/tests/func/test_add.py
index 6d51f233..7328c84d 100644
--- a/tests/func/test_add.py
+++ b/tests/func/test_add.py
@@ -24,7 +24,7 @@ from dvc.system import System
 from dvc.utils import file_md5
 from dvc.utils import LARGE_DIR_SIZE
 from dvc.utils import relpath
-from dvc.utils.compat import range
+from dvc.utils.compat import range, pathlib
 from dvc.utils.stage import load_stage_file
 from tests.basic_env import TestDvc
 from tests.utils import get_gitignore_content
@@ -649,3 +649,10 @@ def test_escape_gitignore_entries(git, dvc_repo, repo_dir):
     dvc_repo.add(fname)
 
     assert ignored_fname in get_gitignore_content()
+
+
+def test_adding_several_files_after_one_has_been_modified(dvc_repo):
+    # https://github.com/iterative/dvc/issues/2886
+    dvc_repo.add('foo')
+    pathlib.Path('foo').write_text('change')
+    dvc_repo.add(['bar', 'foo'])

Might need revisiting the graph building process 😬

dmpetrov · 2020-02-03T05:32:18Z

@iterative/engineering it seems like an important bug. Should we make it p0?

pared · 2020-02-14T10:40:05Z

can confirm that problem still exists on master:

#!/bin/bash

rm -rf repo
mkdir repo

pushd repo
git init --quiet
dvc init -q

set -ex

mkdir data

echo foo>>data/foo
dvc add -q data

echo bar>>bar
echo change>>data/foo

dvc add bar data

pared · 2020-02-17T17:10:52Z

Ok, so little investigation helped me to narrow down the issue.
The bug shows up when we are re-add-ing existing DVC-tracked output, but only after calling
dvc.stages first. For example:
Latest version i checked with: 0.86.1

def test_re_add(tmp_dir, scm, dvc):
    tmp_dir.dvc_gen( {"file": "file content"})

    tmp_dir.gen(
            {"file": "modified content"}
    )
    dvc.stages
    dvc.add(["file"])

We will get the same OutputDuplicationError.

Why this is happening?
repo.stages is cached property, so repo is storing file.dvc. The problem is that when we are re-add-ing some file, the chain of calls is repo.add -> repo._create_stages -> Stage.create which detects that file.dvc exists and removes it, so it can be created again.

So now we don't have file.dvc because it was removed by Stage.create, we have file.dvc in memory as "new" stage that we are creating right now, and because repo's stages were not invalidated, we also have repo.stages==[file.dvc]. That is why we try to check modified graph here during add we have self.stages==[file.dvc] and new_stages==[file.dvc]. While, actually self.stages should not return it (since it was removed by Stage.create.

TLDR
When re-adding we remove stage file, but we don't invalidate cached repo.stages and end up with repo._collect_graph(['file.dvc', 'file.dvc']) which has to fail.

Solution: invalidate repo.stages when removing the stage file.

efiop · 2020-02-17T18:24:11Z

dvc.stages
dvc.add(["file"])

But we don't do that during the dvc add, right? So what specifically has caused the issue?

pared · 2020-02-17T18:48:46Z

@efiop

But we don't do that during the dvc add, right? So what specifically has caused the issue?

Well thats where the devil is. We call stages on repo.check_modified_graph that is why original issue was so strange:

If we re-add only modified artifact, stages will be called only after file.dvc was removed on Stage.create
If we do dvc add other_thingy file (where file exists and has been updated) repo.stages will be called when handling other_thingy, cached, and cause problems when handling updated file.

efiop · 2020-02-17T19:05:38Z

@pared Got it, makes sense now. So our issue is that we don't reset after creating a stage inside repo.add(), right? I suppose that issue was introduced while moving that multitarget logic from CLI to API.

pared · 2020-02-17T20:08:21Z

@efiop
not only repo.add()
To generalize discussion here and under #3349 :
Whenever we Stage.create pre-existing stage and then repo.check_modified_graph (add, imp, run) we are vulnerable to cached stages.
examples:

def test_run_update(tmp_dir, dvc):
    dvc.run(outs=["file"], cmd="echo content > file")
    dvc.stages
    dvc.run(outs=["file"], cmd="echo new_content > file")

def test_imp_update(tmp_dir, dvc, erepo_dir):
    with erepo_dir.chdir():
        erepo_dir.dvc_gen({"file": "file content"}, commit="commit")

    dvc.imp(fspath(erepo_dir), "file")

    dvc.stages
    dvc.imp(fspath(erepo_dir), "file")

pared · 2020-02-20T12:41:00Z

@jaredsampson the fix is done and should be included in next release

efiop · 2020-02-20T13:57:37Z

@pared @jaredsampson Released in 0.86.4, please upgrade and give it a try 🙂

jaredsampson · 2020-02-21T21:34:44Z

Terrific. Thanks for the fix!

Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see iterative#2886 and iterative#3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see iterative#6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.

* add: do not delete stage files before add Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see #2886 and #3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see #6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec. * add tests * make the test more specific

Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see iterative#2886 and iterative#3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see iterative#6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.

triage-new-issues bot added the triage Needs to be triaged label Dec 3, 2019

efiop added the p1-important Important, aka current backlog of things to do label Dec 3, 2019

triage-new-issues bot removed the triage Needs to be triaged label Dec 3, 2019

efiop added bug Did we break something? triage Needs to be triaged labels Dec 3, 2019

triage-new-issues bot removed the triage Needs to be triaged label Dec 3, 2019

jaredsampson changed the title ~~Bug: dvc add fails with a modified directory after a list of files.~~ Bug: dvc add fails with a modified file (or directory) at the end of a list of files Dec 3, 2019

weekly-digest bot mentioned this issue Dec 8, 2019

Weekly Digest (1 December, 2019 - 8 December, 2019) #2919

Closed

efiop added p0-critical Critical issue. Needs to be fixed ASAP. and removed p1-important Important, aka current backlog of things to do labels Feb 3, 2020

pared added the research label Feb 11, 2020

pared self-assigned this Feb 14, 2020

pared mentioned this issue Feb 17, 2020

Stage: create: reset repo when removing existing stage #3349

Merged

3 tasks

efiop closed this as completed in #3349 Feb 19, 2020

skshetry mentioned this issue Jun 28, 2021

add: do not delete stage files before add #6239

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: `dvc add` fails with a modified file (or directory) at the end of a list of files #2886

Bug: `dvc add` fails with a modified file (or directory) at the end of a list of files #2886

jaredsampson commented Dec 3, 2019

ghost commented Dec 3, 2019 •

edited by ghost

Loading

jaredsampson commented Dec 3, 2019

ghost commented Dec 3, 2019

efiop commented Dec 3, 2019

ghost commented Dec 3, 2019 •

edited by ghost

Loading

ghost commented Dec 3, 2019

dmpetrov commented Feb 3, 2020

pared commented Feb 14, 2020

pared commented Feb 17, 2020 •

edited

Loading

efiop commented Feb 17, 2020

pared commented Feb 17, 2020 •

edited

Loading

efiop commented Feb 17, 2020

pared commented Feb 17, 2020

pared commented Feb 20, 2020

efiop commented Feb 20, 2020

jaredsampson commented Feb 21, 2020

Bug: dvc add fails with a modified file (or directory) at the end of a list of files #2886

Bug: dvc add fails with a modified file (or directory) at the end of a list of files #2886

Comments

jaredsampson commented Dec 3, 2019

ghost commented Dec 3, 2019 • edited by ghost Loading

jaredsampson commented Dec 3, 2019

ghost commented Dec 3, 2019

efiop commented Dec 3, 2019

ghost commented Dec 3, 2019 • edited by ghost Loading

ghost commented Dec 3, 2019

dmpetrov commented Feb 3, 2020

pared commented Feb 14, 2020

pared commented Feb 17, 2020 • edited Loading

efiop commented Feb 17, 2020

pared commented Feb 17, 2020 • edited Loading

efiop commented Feb 17, 2020

pared commented Feb 17, 2020

pared commented Feb 20, 2020

efiop commented Feb 20, 2020

jaredsampson commented Feb 21, 2020

Bug: `dvc add` fails with a modified file (or directory) at the end of a list of files #2886

Bug: `dvc add` fails with a modified file (or directory) at the end of a list of files #2886

ghost commented Dec 3, 2019 •

edited by ghost

Loading

ghost commented Dec 3, 2019 •

edited by ghost

Loading

pared commented Feb 17, 2020 •

edited

Loading

pared commented Feb 17, 2020 •

edited

Loading