Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stage: create: reset repo when removing existing stage #3349

Merged
merged 1 commit into from
Feb 19, 2020

Conversation

pared
Copy link
Contributor

@pared pared commented Feb 17, 2020

  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • πŸ“– Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™
Fixes #2886

dvc/stage.py Outdated
@@ -599,6 +599,8 @@ def create(repo, accompany_outs=False, **kwargs):
raise StageFileAlreadyExistsError(stage.relpath)

os.unlink(path)
# Removing stage file makes `repo.stages` invalid, need to reset.
repo.reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do it here. We should repo._reset in things like repo.add/run/imp/etc. I know that we are using reset in the wrapper above, but that was justified and this one feels like spreading the context too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it does not feel well to put it here, I will try to find another solution.

@@ -489,7 +489,7 @@ def checkout(self, *args, **kwargs):
def fetch(self, *args, **kwargs):
return self._fetch(*args, **kwargs)

def _reset(self):
def reset(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make it "public" though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for that.

@efiop
Copy link
Contributor

efiop commented Feb 17, 2020

@pared Looks like you've forgotten to methion the ticket that this is closing :)

@efiop
Copy link
Contributor

efiop commented Feb 17, 2020

Also, let's add a test. πŸ™‚

@efiop
Copy link
Contributor

efiop commented Feb 17, 2020

From discussion in #2886 it seems like only dvc add for multiple targets is affected, right? Other things like import/run/etc are not? If so, looks like we only need to add a _reset() call in add().

@pared
Copy link
Contributor Author

pared commented Feb 17, 2020

@efiop, that is actually not true, as I am inspecting the code, run seems to be affected too,
situation is the same: we run check_modified_graph, and if we call stages before, it will fail.
Example:

def test_run_update(tmp_dir, dvc):
    dvc.run(outs=["file"], cmd="echo content > file")
    dvc.stages
    dvc.run(outs=["file"], cmd="echo new_content > file")

Edit:
and I am pretty sure that I can also trigger that in imp_url

Edit2:

def test_imp_update(tmp_dir, dvc, erepo_dir):
    with erepo_dir.chdir():
        erepo_dir.dvc_gen({"file": "file content"}, commit="commit")

    dvc.imp(fspath(erepo_dir), "file")

    dvc.stages
    dvc.imp(fspath(erepo_dir), "file")

@pared
Copy link
Contributor Author

pared commented Feb 17, 2020

So to summarize: the problem occurs when we are re-creating stage and then running check_modified_graph. Each case of this methods combination is vulnerable to cached stages. Need to think about how to handle that.

@efiop
Copy link
Contributor

efiop commented Feb 18, 2020

@pared Looks like there is a miscommunication here. You are showing tests that are not really replicating what happens in the original, issue, right?

repo.stages
repo.add(...)

is not the same as what we get in that error, where we repo.add(["bar", "data"]). I understand that you could trigger that behaviour with something synthetic like calling repo.stages before repo.run/add/imp, but I'm trying to understand what happened in the original issue.

@efiop
Copy link
Contributor

efiop commented Feb 18, 2020

Ok, so what happens is we have

for target in targets:
    stages = _create_stages...
    repo.check_modified_graph(stages)

and when data.dvc already exists and you pass ["bar", "data"] as targets, dvc collects repo.stages(incuding pre-existing data.dvc) in check_modified_graph when processing bar and on the next run, when processing data, it sees that it already has data.dvc in repo.stages and also stages contains data.dvc, and so it detects the collision. And if data is specified as the first entry in targets, that bug doesn't happen, because repo.stages gets collected after _create_stages when data.dvc doesn't exist(because it gets deleted if it already exists) and so repo.stages don't contain data.dvc and check_modified_graph(["data.dvc"]) doesn't detect any issues.

And so this particular issue doesn't happen on run/imp/etc because they don't accept list of targets.

@pared
Copy link
Contributor Author

pared commented Feb 18, 2020

@efiop

You are showing tests that are not really replicating what happens in the original, issue, right?

I believe I do. Yes, you are right, that they do not accept targets, but it is still possible to replicate same
problem (OverlappingOutputPaths) even without having multiple targets. The original cause is that we cache the repo.stages and don't invalidate them when we are deleting stage file upon Stae.create. This is the root of the problem and needs to be fixed, because now, simply calling repo.stages somewhere in code makes repo.check_modified_graph unpredictible.

@efiop
Copy link
Contributor

efiop commented Feb 18, 2020

@pared Ah, i see. I guess the miscommunication is that I'm talking about CLI and in it you won't be able to replicate the issue anywhere except in dvc add target1 target2, right? But your point of invalidating stages when something gets deleted/overwritten is great, that would indeed make the API more robust. Wonder if the current solution could be improved to not mixup the context or if it is inevitable. Probably it is caused by the create method doing too much stuff.

@efiop
Copy link
Contributor

efiop commented Feb 18, 2020

@pared The reason I'm trying to clarify that your tests are not strictly the same as the original issue is because your test cases could be solved by resetting in locked decorator before grabbing the lock, which would be an elegant solution for it. And the original issue won't be solved by that, because there are other things going on, as we are talking about method internals.

@pared
Copy link
Contributor Author

pared commented Feb 18, 2020

@efiop, ah that is right, for the original issue resetting in locked decorator will not solve the problem.
It seems to me that the most "universal" solution is the one I proposed originally, that is resetting the repo when unlinking the stage file. It takes care of both situations, at the cost of entangling Repo with Stage. Maybe we should reconsider the amount of responsibility that Stage has? I would need to take a deeper look at it to propose some alternatives.

@efiop
Copy link
Contributor

efiop commented Feb 18, 2020

@pared Or maybe we could combine reset() in locked with reset inside add()? πŸ™‚ That would not spread the context and would solve both issues. What do you think?

Though, reset in there(in repo.add) will create a performance hit too. Another way is to search and replace the modified stage in repo.stages in repo.add(). Probably faster than resetting and might be a good solution in addition to reset() in locked decorator.

@pared
Copy link
Contributor Author

pared commented Feb 19, 2020

@efiop I think you are right about resetting during add. Basically any call for Stage.create invalidates current repo.stages. Because, well, we are creating a new stage.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@efiop efiop merged commit 9b2ef00 into iterative:master Feb 19, 2020
@pared pared deleted the 2886_add_bug branch March 24, 2020 09:36
skshetry added a commit to skshetry/dvc that referenced this pull request Jun 28, 2021
Because of the way we collect stages and cache them, we were not able to
collect them for the `add` without removing them from the workspace.
As doing so, we'd have two same/similar stages - one collected from the
workspace and the other just created from the `dvc add` in-memory.

This would raise errors during graph checks, so we started to delete them
and reset them (which is very recently, see iterative#2886 and iterative#3349).

By deleting the file before we even do any checks, we are making DVC
fragile, and results in data loss for the users with even simple mistakes.
This should make it more reliable and robust.

And, recently, we have started to keep state of a lot of things, that by
resetting them on each stage, we waste a lot of performance, especially
on gitignores. We cache the dulwich's IgnoreManager, which when resetted
too many times, will waste a lot of our time just collecting them again
next time (see iterative#6227).

It's hard to say how much this improves, as this very much depends on
no. of gitignores in the repo (which can be assumed to be quite in number
for a dvc repo) and the amount of files that we are adding
(eg: `-R` adding a large directory).

On a directory with 10,000 files (in a datadet-registry repo),
creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.
efiop pushed a commit that referenced this pull request Jul 3, 2021
* add: do not delete stage files before add

Because of the way we collect stages and cache them, we were not able to
collect them for the `add` without removing them from the workspace.
As doing so, we'd have two same/similar stages - one collected from the
workspace and the other just created from the `dvc add` in-memory.

This would raise errors during graph checks, so we started to delete them
and reset them (which is very recently, see #2886 and #3349).

By deleting the file before we even do any checks, we are making DVC
fragile, and results in data loss for the users with even simple mistakes.
This should make it more reliable and robust.

And, recently, we have started to keep state of a lot of things, that by
resetting them on each stage, we waste a lot of performance, especially
on gitignores. We cache the dulwich's IgnoreManager, which when resetted
too many times, will waste a lot of our time just collecting them again
next time (see #6227).

It's hard to say how much this improves, as this very much depends on
no. of gitignores in the repo (which can be assumed to be quite in number
for a dvc repo) and the amount of files that we are adding
(eg: `-R` adding a large directory).

On a directory with 10,000 files (in a datadet-registry repo),
creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.

* add tests

* make the test more specific
skshetry added a commit to skshetry/dvc that referenced this pull request Apr 27, 2022
Because of the way we collect stages and cache them, we were not able to
collect them for the `add` without removing them from the workspace.
As doing so, we'd have two same/similar stages - one collected from the
workspace and the other just created from the `dvc add` in-memory.

This would raise errors during graph checks, so we started to delete them
and reset them (which is very recently, see iterative#2886 and iterative#3349).

By deleting the file before we even do any checks, we are making DVC
fragile, and results in data loss for the users with even simple mistakes.
This should make it more reliable and robust.

And, recently, we have started to keep state of a lot of things, that by
resetting them on each stage, we waste a lot of performance, especially
on gitignores. We cache the dulwich's IgnoreManager, which when resetted
too many times, will waste a lot of our time just collecting them again
next time (see iterative#6227).

It's hard to say how much this improves, as this very much depends on
no. of gitignores in the repo (which can be assumed to be quite in number
for a dvc repo) and the amount of files that we are adding
(eg: `-R` adding a large directory).

On a directory with 10,000 files (in a datadet-registry repo),
creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: dvc add fails with a modified file (or directory) at the end of a list of files
2 participants