Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn pushing off for intermediate results #4868

Closed
brnd42 opened this issue Nov 10, 2020 · 9 comments · Fixed by iterative/dvcyaml-schema#25
Closed

Turn pushing off for intermediate results #4868

brnd42 opened this issue Nov 10, 2020 · 9 comments · Fixed by iterative/dvcyaml-schema#25
Assignees
Labels
A: data-sync Related to dvc get/fetch/import/pull/push feature request Requesting a new feature

Comments

@brnd42
Copy link

brnd42 commented Nov 10, 2020

My original request:
Is there a way to automatically never push certain intermediate results? For example if I had a script extract_features.py which outputs the features as hdf5 and than use those hdf5 files for train.py, I would like to never push those hdf5 files, but keep them in the local cache, since we don't want to backup files that can be reproduced exactly. But I also don't want to reproduce these hdf5 files every time.
I could also just gc the remote from time to time, but I don't really want to delete anything there :/

ruslan wrote on Discord:
Hi @brnd42 ! Unfortunately there is no special feature for that right now, and as I understand you still want to cache them locally, so --outs-no-cache doesn't work for you :slight_frown: It would be a very good feature though! To implement it, we would need to introduce something like push: false to dvc.yaml for specific outputs and then just take it into considersation when collecting cache in get_used_cache. It is pretty straightforward and simple. Could you create a feature request on our github, please?

@skshetry skshetry added the feature request Requesting a new feature label Nov 10, 2020
@karajan1001
Copy link
Contributor

karajan1001 commented Nov 10, 2020

Share a same flag with #4581?

@karajan1001
Copy link
Contributor

Two kinds of jobs had DVC done to the data, one is backup and sharing (for security), another is local caching (for speed). We currently mixed them up?

@charlesbaynham
Copy link
Contributor

#4581 introduces all the machinery required: the only thing missing is a flag for the user interface. That PR has already grown quite a lot from the initial scope however: maybe the best way is to wait until it's merged then add a flag in a subsequent, small PR.

@brnd42
Copy link
Author

brnd42 commented Jan 29, 2021

@charlesbaynham Any updates on this issue? As the ticket is still open, I assume the changes have not been released yet?

@charlesbaynham
Copy link
Contributor

The PR is ready, but I think @efiop wanted to do some refactoring of some of the code which it alters, since some of it was getting quite complex as more options were added. @efiop, any updates?

@efiop
Copy link
Contributor

efiop commented Feb 1, 2021

@charlesbaynham @brnd42 Sorry for the delay, guys. Likely the import + dvc.yaml move is not going to happen before 2.0, so we'll likely get back to @charlesbaynham PR to merge it in the current form. But --outs-no-cache might actually work for @brnd42 for now, as it still saves the results to the run-cache. @brnd42 Could you give it a try?

@brnd42
Copy link
Author

brnd42 commented Feb 23, 2021

According to the documentation here: https://dvc.org/doc/command-reference/run
--outs-no-cache would lead to regenerating the data every time, which would be suboptimal, since the data generation/feature extraction usually takes several hours or days in our setup. But there is also the option of --outs-persist-no-cache <path> is probably suitable in our case, since we don't need to track the intermediate data, but still want to generate it in a dvc stage.
@charlesbaynham @efiop Sorry for the late response, do you agree, that --outs-persist-no-cache <path> would work here or do you see any edge cases in its functionality, that might be obstacles for us?

@brnd42
Copy link
Author

brnd42 commented Aug 10, 2021

For anyone reading this --outs-persist-no-cache works for my use case

@pmrowla pmrowla self-assigned this Nov 24, 2022
@pmrowla pmrowla added this to DVC Nov 24, 2022
@pmrowla pmrowla moved this to Backlog in DVC Nov 24, 2022
@pmrowla pmrowla moved this from Backlog to Todo in DVC Nov 24, 2022
@pmrowla pmrowla moved this from Todo to Done in DVC Nov 29, 2022
@pmrowla pmrowla added the A: data-sync Related to dvc get/fetch/import/pull/push label Nov 29, 2022
@pmrowla
Copy link
Contributor

pmrowla commented Nov 29, 2022

This can now be accomplished by setting push: false for outputs in dvc.yaml and in .dvc files. There is currently no CLI flag for dvc run or dvc stage add to set this field, it must be manually added to the dvc.yaml or .dvc file.

observingClouds added a commit to observingClouds/SGFF_NH that referenced this issue Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push feature request Requesting a new feature
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants