-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stage: create: reset repo when removing existing stage #3349
Conversation
dvc/stage.py
Outdated
@@ -599,6 +599,8 @@ def create(repo, accompany_outs=False, **kwargs): | |||
raise StageFileAlreadyExistsError(stage.relpath) | |||
|
|||
os.unlink(path) | |||
# Removing stage file makes `repo.stages` invalid, need to reset. | |||
repo.reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do it here. We should repo._reset
in things like repo.add/run/imp/etc
. I know that we are using reset
in the wrapper above, but that was justified and this one feels like spreading the context too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it does not feel well to put it here, I will try to find another solution.
dvc/repo/__init__.py
Outdated
@@ -489,7 +489,7 @@ def checkout(self, *args, **kwargs): | |||
def fetch(self, *args, **kwargs): | |||
return self._fetch(*args, **kwargs) | |||
|
|||
def _reset(self): | |||
def reset(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make it "public" though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for that.
@pared Looks like you've forgotten to methion the ticket that this is closing :) |
Also, let's add a test. π |
From discussion in #2886 it seems like only |
@efiop, that is actually not true, as I am inspecting the code,
Edit: Edit2:
|
So to summarize: the problem occurs when we are re- |
@pared Looks like there is a miscommunication here. You are showing tests that are not really replicating what happens in the original, issue, right?
is not the same as what we get in that error, where we |
Ok, so what happens is we have
and when And so this particular issue doesn't happen on |
I believe I do. Yes, you are right, that they do not accept targets, but it is still possible to replicate same |
@pared Ah, i see. I guess the miscommunication is that I'm talking about CLI and in it you won't be able to replicate the issue anywhere except in |
@pared The reason I'm trying to clarify that your tests are not strictly the same as the original issue is because your test cases could be solved by resetting in |
@efiop, ah that is right, for the original issue resetting in locked decorator will not solve the problem. |
@pared Or maybe we could combine Though, |
@efiop I think you are right about resetting during add. Basically any call for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see iterative#2886 and iterative#3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see iterative#6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.
* add: do not delete stage files before add Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see #2886 and #3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see #6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec. * add tests * make the test more specific
Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see iterative#2886 and iterative#3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see iterative#6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.
β Have you followed the guidelines in the Contributing to DVC list?
π Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
β Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.
Thank you for the contribution - we'll try to review it as soon as possible. π
Fixes #2886