Keep persistent outputs protected by default? #6562

menegais1 · 2021-09-08T13:02:42Z

Due to memory concerns, some of my pipeline stages have persistent outputs that I handle in some python scripts. The pipeline stage receives a JSON as input and outputs a folder containing image files. As there is no need to rewrite those files every execution, I check inside the folder if a file is present (read-only) to avoid reprocessing it. As the dataset can get quite large, the space consumption becomes worrisome, as every dvc repro executed needs to unprotect all the files in the folder, copying them from the cache to the workspace. If an output could be marked as safe, so it would only suffer from append/remove operations, the unprotect could be avoided, reducing the space usage.

shcheklein · 2021-09-08T15:24:08Z

Probably we can even avoid running unprotect in this at all? Or unprotect dir - make it writable. Seem like there should be a config option for this.

shcheklein · 2021-09-09T04:09:46Z

Thinking even more about this, I've renamed the ticket a bit.

To be honest, it's not that clear why do we do this extra step (unprotect) in the first place. There are a lot of cases (majority of them to be honest that come to my mind) is when we need an append semantics (script doesn't overwrite the files). It can be very expensive to unprotect things and protect them back every time.

I'm not sure about checkpoints though, but we also use a different flag for them for other reasons?

cc @dberenbaum @pmrowla

dberenbaum · 2021-09-09T18:36:56Z

For checkpoints, I think it will be necessary to unprotect because it's about loading state from the existing model file and writing an updated model file at the end. However, configuring a flag for dvc repro and dvc exp run generally to avoid needing to unprotect is probably fine.

add the flag append_only to the pipeline to avoid unprotecting the files while persist flag is true. Fixes iterative#6562.

efiop · 2021-12-06T07:33:55Z

@dberenbaum Just to confirm, you meant that we would need a dvc.yaml option for each such output, right? Similar to what is implemented in #7072 by @rgferrari ?

dberenbaum · 2021-12-06T16:50:02Z

Yes, this was a mistake by me. I was thinking of a dvc.yaml option since it seems more useful to permanently specify the behavior than have to include the option every time you reproduce the pipeline.

uditrana · 2023-05-30T19:51:46Z

Hi all, I am here to re-inject some interest in this issue (relevant discord thread).

I am working with a pretty common workflow where there is a "prep_data.py" script that creates cleaned data for training and other downstream stages.

The problem is that the scenario involves a shared-cache (multiple users on same machine with a shared data drive). Any sort of pipeline operation (e.g. dvc repro) involves every file created by to be unprotected which is extremely expensive. The 'prep_data.py' script always deletes and recreates new files if there are detected changes so I am not worried about corruption (that much).

I think an option to not unprotect and/or allowing us to programatically control protection (via python sdk) would be useful

dberenbaum · 2023-06-15T16:47:57Z

To be honest, it's not that clear why do we do this extra step (unprotect) in the first place. There are a lot of cases (majority of them to be honest that come to my mind) is when we need an append semantics (script doesn't overwrite the files). It can be very expensive to unprotect things and protect them back every time.

I'm not sure about checkpoints though, but we also use a different flag for them for other reasons?

Now that we don't have checkpoints, should we consider dropping unprotect as default behavior completely? I guess we aren't in a good position to make this kind of breaking change right now, but at least if we introduce the flag soon, we could make it the default in the next major release.

menegais1 changed the title ~~Flag a output as "Safe" when there is the persistent tag to avoid dvc unprotect from being run~~ Flag an output as "Safe" when there is the persistent tag to avoid dvc unprotect from being run in pipeline stages Sep 8, 2021

pared added the feature request Requesting a new feature label Sep 8, 2021

shcheklein added the performance improvement over resource / time consuming tasks label Sep 9, 2021

shcheklein changed the title ~~Flag an output as "Safe" when there is the persistent tag to avoid dvc unprotect from being run in pipeline stages~~ Keep persistent outputs protected by default Sep 9, 2021

shcheklein changed the title ~~Keep persistent outputs protected by default~~ Keep persistent outputs protected by default? Sep 9, 2021

pared mentioned this issue Sep 9, 2021

out: persitent outputs require odb when they should not #6533

Closed

pared added the research label Sep 9, 2021

daavoo added the A: pipelines Related to the pipelines feature label Oct 20, 2021

rgferrari added a commit to rgferrari/dvc that referenced this issue Dec 1, 2021

pipeline: add a new flag to dvc repro to avoid unprotecting files

82f861d

add the flag append_only to the pipeline to avoid unprotecting the files while persist flag is true. Fixes iterative#6562.

rgferrari mentioned this issue Dec 1, 2021

dvc.yaml: added a new option to dvc.yaml to avoid unprotecting files #7072

Closed

2 tasks

dberenbaum added the p2-medium Medium priority, should be done, but less important label May 31, 2023

dberenbaum mentioned this issue Aug 5, 2023

dvc remove --force #9789

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep persistent outputs protected by default? #6562

Keep persistent outputs protected by default? #6562

menegais1 commented Sep 8, 2021

shcheklein commented Sep 8, 2021

shcheklein commented Sep 9, 2021

dberenbaum commented Sep 9, 2021

efiop commented Dec 6, 2021

dberenbaum commented Dec 6, 2021

uditrana commented May 30, 2023 •

edited

Loading

dberenbaum commented Jun 15, 2023

Keep persistent outputs protected by default? #6562

Keep persistent outputs protected by default? #6562

Comments

menegais1 commented Sep 8, 2021

shcheklein commented Sep 8, 2021

shcheklein commented Sep 9, 2021

dberenbaum commented Sep 9, 2021

efiop commented Dec 6, 2021

dberenbaum commented Dec 6, 2021

uditrana commented May 30, 2023 • edited Loading

dberenbaum commented Jun 15, 2023

uditrana commented May 30, 2023 •

edited

Loading