-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep persistent outputs protected by default? #6562
Comments
Probably we can even avoid running unprotect in this at all? Or unprotect dir - make it writable. Seem like there should be a config option for this. |
Thinking even more about this, I've renamed the ticket a bit. To be honest, it's not that clear why do we do this extra step (unprotect) in the first place. There are a lot of cases (majority of them to be honest that come to my mind) is when we need an append semantics (script doesn't overwrite the files). It can be very expensive to unprotect things and protect them back every time. I'm not sure about checkpoints though, but we also use a different flag for them for other reasons? |
For checkpoints, I think it will be necessary to unprotect because it's about loading state from the existing model file and writing an updated model file at the end. However, configuring a flag for |
add the flag append_only to the pipeline to avoid unprotecting the files while persist flag is true. Fixes iterative#6562.
@dberenbaum Just to confirm, you meant that we would need a dvc.yaml option for each such output, right? Similar to what is implemented in #7072 by @rgferrari ? |
Yes, this was a mistake by me. I was thinking of a |
Hi all, I am here to re-inject some interest in this issue (relevant discord thread). I am working with a pretty common workflow where there is a "prep_data.py" script that creates cleaned data for training and other downstream stages. The problem is that the scenario involves a shared-cache (multiple users on same machine with a shared data drive). Any sort of pipeline operation (e.g. I think an option to not unprotect and/or allowing us to programatically control protection (via python sdk) would be useful |
Now that we don't have checkpoints, should we consider dropping unprotect as default behavior completely? I guess we aren't in a good position to make this kind of breaking change right now, but at least if we introduce the flag soon, we could make it the default in the next major release. |
Due to memory concerns, some of my pipeline stages have persistent outputs that I handle in some python scripts. The pipeline stage receives a JSON as input and outputs a folder containing image files. As there is no need to rewrite those files every execution, I check inside the folder if a file is present (read-only) to avoid reprocessing it. As the dataset can get quite large, the space consumption becomes worrisome, as every dvc repro executed needs to unprotect all the files in the folder, copying them from the cache to the workspace. If an output could be marked as safe, so it would only suffer from append/remove operations, the unprotect could be avoided, reducing the space usage.
The text was updated successfully, but these errors were encountered: