Stage: create: reset repo when removing existing stage #3349

pared · 2020-02-17T17:20:56Z

❗ Have you followed the guidelines in the Contributing to DVC list?
📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏
Fixes #2886

efiop · 2020-02-17T18:16:56Z

dvc/stage.py

@@ -599,6 +599,8 @@ def create(repo, accompany_outs=False, **kwargs):
                raise StageFileAlreadyExistsError(stage.relpath)

            os.unlink(path)
+            # Removing stage file makes `repo.stages` invalid, need to reset.
+            repo.reset()


I don't think we should do it here. We should repo._reset in things like repo.add/run/imp/etc. I know that we are using reset in the wrapper above, but that was justified and this one feels like spreading the context too much.

Or am I missing something?

I agree that it does not feel well to put it here, I will try to find another solution.

efiop · 2020-02-17T18:17:11Z

dvc/repo/__init__.py

@@ -489,7 +489,7 @@ def checkout(self, *args, **kwargs):
    def fetch(self, *args, **kwargs):
        return self._fetch(*args, **kwargs)

-    def _reset(self):
+    def reset(self):


Why make it "public" though?

No need for that.

efiop · 2020-02-17T18:17:53Z

@pared Looks like you've forgotten to methion the ticket that this is closing :)

efiop · 2020-02-17T18:18:48Z

Also, let's add a test. 🙂

efiop · 2020-02-17T19:07:39Z

From discussion in #2886 it seems like only dvc add for multiple targets is affected, right? Other things like import/run/etc are not? If so, looks like we only need to add a _reset() call in add().

pared · 2020-02-17T19:31:24Z

@efiop, that is actually not true, as I am inspecting the code, run seems to be affected too,
situation is the same: we run check_modified_graph, and if we call stages before, it will fail.
Example:

def test_run_update(tmp_dir, dvc):
    dvc.run(outs=["file"], cmd="echo content > file")
    dvc.stages
    dvc.run(outs=["file"], cmd="echo new_content > file")

Edit:
and I am pretty sure that I can also trigger that in imp_url

Edit2:

def test_imp_update(tmp_dir, dvc, erepo_dir):
    with erepo_dir.chdir():
        erepo_dir.dvc_gen({"file": "file content"}, commit="commit")

    dvc.imp(fspath(erepo_dir), "file")

    dvc.stages
    dvc.imp(fspath(erepo_dir), "file")

pared · 2020-02-17T19:54:07Z

So to summarize: the problem occurs when we are re-creating stage and then running check_modified_graph. Each case of this methods combination is vulnerable to cached stages. Need to think about how to handle that.

efiop · 2020-02-18T00:16:37Z

@pared Looks like there is a miscommunication here. You are showing tests that are not really replicating what happens in the original, issue, right?

repo.stages
repo.add(...)

is not the same as what we get in that error, where we repo.add(["bar", "data"]). I understand that you could trigger that behaviour with something synthetic like calling repo.stages before repo.run/add/imp, but I'm trying to understand what happened in the original issue.

efiop · 2020-02-18T00:38:19Z

Ok, so what happens is we have

for target in targets:
    stages = _create_stages...
    repo.check_modified_graph(stages)

and when data.dvc already exists and you pass ["bar", "data"] as targets, dvc collects repo.stages(incuding pre-existing data.dvc) in check_modified_graph when processing bar and on the next run, when processing data, it sees that it already has data.dvc in repo.stages and also stages contains data.dvc, and so it detects the collision. And if data is specified as the first entry in targets, that bug doesn't happen, because repo.stages gets collected after _create_stages when data.dvc doesn't exist(because it gets deleted if it already exists) and so repo.stages don't contain data.dvc and check_modified_graph(["data.dvc"]) doesn't detect any issues.

And so this particular issue doesn't happen on run/imp/etc because they don't accept list of targets.

pared · 2020-02-18T09:50:15Z

@efiop

You are showing tests that are not really replicating what happens in the original, issue, right?

I believe I do. Yes, you are right, that they do not accept targets, but it is still possible to replicate same
problem (OverlappingOutputPaths) even without having multiple targets. The original cause is that we cache the repo.stages and don't invalidate them when we are deleting stage file upon Stae.create. This is the root of the problem and needs to be fixed, because now, simply calling repo.stages somewhere in code makes repo.check_modified_graph unpredictible.

efiop · 2020-02-18T13:54:54Z

@pared Ah, i see. I guess the miscommunication is that I'm talking about CLI and in it you won't be able to replicate the issue anywhere except in dvc add target1 target2, right? But your point of invalidating stages when something gets deleted/overwritten is great, that would indeed make the API more robust. Wonder if the current solution could be improved to not mixup the context or if it is inevitable. Probably it is caused by the create method doing too much stuff.

efiop · 2020-02-18T13:57:13Z

@pared The reason I'm trying to clarify that your tests are not strictly the same as the original issue is because your test cases could be solved by resetting in locked decorator before grabbing the lock, which would be an elegant solution for it. And the original issue won't be solved by that, because there are other things going on, as we are talking about method internals.

pared · 2020-02-18T17:59:03Z

@efiop, ah that is right, for the original issue resetting in locked decorator will not solve the problem.
It seems to me that the most "universal" solution is the one I proposed originally, that is resetting the repo when unlinking the stage file. It takes care of both situations, at the cost of entangling Repo with Stage. Maybe we should reconsider the amount of responsibility that Stage has? I would need to take a deeper look at it to propose some alternatives.

efiop · 2020-02-18T20:07:51Z

@pared Or maybe we could combine reset() in locked with reset inside add()? 🙂 That would not spread the context and would solve both issues. What do you think?

Though, reset in there(in repo.add) will create a performance hit too. Another way is to search and replace the modified stage in repo.stages in repo.add(). Probably faster than resetting and might be a good solution in addition to reset() in locked decorator.

pared · 2020-02-19T15:15:39Z

@efiop I think you are right about resetting during add. Basically any call for Stage.create invalidates current repo.stages. Because, well, we are creating a new stage.

efiop

Thanks!

Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see iterative#2886 and iterative#3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see iterative#6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.

* add: do not delete stage files before add Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see #2886 and #3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see #6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec. * add tests * make the test more specific

Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see iterative#2886 and iterative#3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see iterative#6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.

efiop reviewed Feb 17, 2020

View reviewed changes

pared mentioned this pull request Feb 17, 2020

Bug: dvc add fails with a modified file (or directory) at the end of a list of files #2886

Closed

pared force-pushed the 2886_add_bug branch from 98bcc6e to d688367 Compare February 19, 2020 15:15

Stage: create: reset repo after new stage creation

0a6d27d

pared force-pushed the 2886_add_bug branch from d688367 to 0a6d27d Compare February 19, 2020 16:44

efiop approved these changes Feb 19, 2020

View reviewed changes

efiop merged commit 9b2ef00 into iterative:master Feb 19, 2020

weekly-digest bot mentioned this pull request Feb 23, 2020

Weekly Digest (16 February, 2020 - 23 February, 2020) #3387

Closed

pared deleted the 2886_add_bug branch March 24, 2020 09:36

skshetry mentioned this pull request Jun 28, 2021

add: do not delete stage files before add #6239

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage: create: reset repo when removing existing stage #3349

Stage: create: reset repo when removing existing stage #3349

pared commented Feb 17, 2020 •

edited

Loading

efiop Feb 17, 2020

efiop Feb 17, 2020

pared Feb 17, 2020

efiop Feb 17, 2020

pared Feb 17, 2020

efiop commented Feb 17, 2020

efiop commented Feb 17, 2020

efiop commented Feb 17, 2020

pared commented Feb 17, 2020 •

edited

Loading

pared commented Feb 17, 2020 •

edited

Loading

efiop commented Feb 18, 2020 •

edited

Loading

efiop commented Feb 18, 2020

pared commented Feb 18, 2020 •

edited

Loading

efiop commented Feb 18, 2020

efiop commented Feb 18, 2020

pared commented Feb 18, 2020

efiop commented Feb 18, 2020 •

edited

Loading

pared commented Feb 19, 2020

efiop left a comment

Stage: create: reset repo when removing existing stage #3349

Stage: create: reset repo when removing existing stage #3349

Conversation

pared commented Feb 17, 2020 • edited Loading

efiop Feb 17, 2020

Choose a reason for hiding this comment

efiop Feb 17, 2020

Choose a reason for hiding this comment

pared Feb 17, 2020

Choose a reason for hiding this comment

efiop Feb 17, 2020

Choose a reason for hiding this comment

pared Feb 17, 2020

Choose a reason for hiding this comment

efiop commented Feb 17, 2020

efiop commented Feb 17, 2020

efiop commented Feb 17, 2020

pared commented Feb 17, 2020 • edited Loading

pared commented Feb 17, 2020 • edited Loading

efiop commented Feb 18, 2020 • edited Loading

efiop commented Feb 18, 2020

pared commented Feb 18, 2020 • edited Loading

efiop commented Feb 18, 2020

efiop commented Feb 18, 2020

pared commented Feb 18, 2020

efiop commented Feb 18, 2020 • edited Loading

pared commented Feb 19, 2020

efiop left a comment

Choose a reason for hiding this comment

pared commented Feb 17, 2020 •

edited

Loading

pared commented Feb 17, 2020 •

edited

Loading

pared commented Feb 17, 2020 •

edited

Loading

efiop commented Feb 18, 2020 •

edited

Loading

pared commented Feb 18, 2020 •

edited

Loading

efiop commented Feb 18, 2020 •

edited

Loading