Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep persistent outputs protected by default? #6562

Open
menegais1 opened this issue Sep 8, 2021 · 7 comments
Open

Keep persistent outputs protected by default? #6562

menegais1 opened this issue Sep 8, 2021 · 7 comments
Labels
A: pipelines Related to the pipelines feature feature request Requesting a new feature p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks research

Comments

@menegais1
Copy link

Due to memory concerns, some of my pipeline stages have persistent outputs that I handle in some python scripts. The pipeline stage receives a JSON as input and outputs a folder containing image files. As there is no need to rewrite those files every execution, I check inside the folder if a file is present (read-only) to avoid reprocessing it. As the dataset can get quite large, the space consumption becomes worrisome, as every dvc repro executed needs to unprotect all the files in the folder, copying them from the cache to the workspace. If an output could be marked as safe, so it would only suffer from append/remove operations, the unprotect could be avoided, reducing the space usage.

@menegais1 menegais1 changed the title Flag a output as "Safe" when there is the persistent tag to avoid dvc unprotect from being run Flag an output as "Safe" when there is the persistent tag to avoid dvc unprotect from being run in pipeline stages Sep 8, 2021
@pared pared added the feature request Requesting a new feature label Sep 8, 2021
@shcheklein
Copy link
Member

Probably we can even avoid running unprotect in this at all? Or unprotect dir - make it writable. Seem like there should be a config option for this.

@shcheklein shcheklein added the performance improvement over resource / time consuming tasks label Sep 9, 2021
@shcheklein shcheklein changed the title Flag an output as "Safe" when there is the persistent tag to avoid dvc unprotect from being run in pipeline stages Keep persistent outputs protected by default Sep 9, 2021
@shcheklein
Copy link
Member

Thinking even more about this, I've renamed the ticket a bit.

To be honest, it's not that clear why do we do this extra step (unprotect) in the first place. There are a lot of cases (majority of them to be honest that come to my mind) is when we need an append semantics (script doesn't overwrite the files). It can be very expensive to unprotect things and protect them back every time.

I'm not sure about checkpoints though, but we also use a different flag for them for other reasons?

cc @dberenbaum @pmrowla

@shcheklein shcheklein changed the title Keep persistent outputs protected by default Keep persistent outputs protected by default? Sep 9, 2021
@pared pared added the research label Sep 9, 2021
@dberenbaum
Copy link
Collaborator

For checkpoints, I think it will be necessary to unprotect because it's about loading state from the existing model file and writing an updated model file at the end. However, configuring a flag for dvc repro and dvc exp run generally to avoid needing to unprotect is probably fine.

@daavoo daavoo added the A: pipelines Related to the pipelines feature label Oct 20, 2021
rgferrari added a commit to rgferrari/dvc that referenced this issue Dec 1, 2021
add the flag append_only to the pipeline to avoid unprotecting the files while persist flag is true.

Fixes iterative#6562.
@efiop
Copy link
Contributor

efiop commented Dec 6, 2021

@dberenbaum Just to confirm, you meant that we would need a dvc.yaml option for each such output, right? Similar to what is implemented in #7072 by @rgferrari ?

@dberenbaum
Copy link
Collaborator

Yes, this was a mistake by me. I was thinking of a dvc.yaml option since it seems more useful to permanently specify the behavior than have to include the option every time you reproduce the pipeline.

@uditrana
Copy link

uditrana commented May 30, 2023

Hi all, I am here to re-inject some interest in this issue (relevant discord thread).

I am working with a pretty common workflow where there is a "prep_data.py" script that creates cleaned data for training and other downstream stages.

The problem is that the scenario involves a shared-cache (multiple users on same machine with a shared data drive). Any sort of pipeline operation (e.g. dvc repro) involves every file created by to be unprotected which is extremely expensive. The 'prep_data.py' script always deletes and recreates new files if there are detected changes so I am not worried about corruption (that much).

I think an option to not unprotect and/or allowing us to programatically control protection (via python sdk) would be useful

@dberenbaum dberenbaum added the p2-medium Medium priority, should be done, but less important label May 31, 2023
@dberenbaum
Copy link
Collaborator

To be honest, it's not that clear why do we do this extra step (unprotect) in the first place. There are a lot of cases (majority of them to be honest that come to my mind) is when we need an append semantics (script doesn't overwrite the files). It can be very expensive to unprotect things and protect them back every time.

I'm not sure about checkpoints though, but we also use a different flag for them for other reasons?

Now that we don't have checkpoints, should we consider dropping unprotect as default behavior completely? I guess we aren't in a good position to make this kind of breaking change right now, but at least if we introduce the flag soon, we could make it the default in the next major release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: pipelines Related to the pipelines feature feature request Requesting a new feature p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks research
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants