Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import-url: disable push by default for cloud-versioned imports #8578

Merged
merged 4 commits into from
Nov 18, 2022

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Nov 17, 2022

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Will close #8429

  • Outs now support a push: param that defaults to true
    • If push: false is used, the given file will be dvc-tracked and cached locally, but will never be pushed to a remote on dvc push
    • The output will still be included when using dvc status -c and dvc pull (i.e. we will still try to pull it from a DVC remote when possible, and will report whether or not the file is exists in a DVC remote on status -c)
  • For import-url with cloud-versioned URLs (or explicit --version-aware flag), the output will be generated with push: false
    • If the user overrides this with push: true or by removing the push: line in the .dvc file (to use the default true value) this will be preserved on dvc update
  • For import-url without cloud-versioning, the output defaults to push: true like any regular output (preserving the current behavior for non-versioned import-url)

Also, this PR does not address #4527. DVC repo imports (dvc import) are essentially always considered push: false regardless of whether or not the user explicitly tries to set push: true in the dvc file. (But in the future using explicit push: true in the .dvc file could be used to support 4527)

@pmrowla pmrowla added A: data-sync Related to dvc get/fetch/import/pull/push A: data-management Related to dvc add/checkout/commit/move/remove A: cloud-versioning Related to cloud-versioned remotes labels Nov 17, 2022
@pmrowla pmrowla self-assigned this Nov 17, 2022
@pmrowla pmrowla requested a review from dberenbaum November 17, 2022 09:08
@pmrowla
Copy link
Contributor Author

pmrowla commented Nov 17, 2022

The output will still be included when using dvc status -c and dvc pull (i.e. we will still try to pull it from a DVC remote when possible, and will report whether or not the file is exists in a DVC remote on status -c)

@dberenbaum I'm not really sure how we want to handle this for generic outs (since it works with any dvc-tracked file and not only import-url), so I went with the simplest implementation for now. Essentially, it seems to me that "I don't want to push this file to a remote" is not necessarily the same as "this file should never be found on a remote". This could maybe open up use cases like setting push: false in local branches but keeping push enabled in main.

Also not sure if we need to add dvc import-url --enable/disable-push flags to use the non-default values for import-url? Currently, using the non-defaults requires manually editing the dvc file.

@codecov
Copy link

codecov bot commented Nov 17, 2022

Codecov Report

Base: 94.33% // Head: 94.31% // Decreases project coverage by -0.02% ⚠️

Coverage data is based on head (6a4aed3) compared to base (b36861a).
Patch coverage: 66.66% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8578      +/-   ##
==========================================
- Coverage   94.33%   94.31%   -0.03%     
==========================================
  Files         430      430              
  Lines       32847    32871      +24     
  Branches     4584     4589       +5     
==========================================
+ Hits        30986    31002      +16     
- Misses       1443     1447       +4     
- Partials      418      422       +4     
Impacted Files Coverage Ξ”
dvc/repo/imp_url.py 83.33% <0.00%> (-3.63%) ⬇️
dvc/repo/index.py 95.08% <ΓΈ> (ΓΈ)
dvc/repo/push.py 30.76% <ΓΈ> (ΓΈ)
dvc/repo/status.py 87.17% <ΓΈ> (ΓΈ)
dvc/schema.py 100.00% <ΓΈ> (ΓΈ)
dvc/stage/serialize.py 92.43% <33.33%> (-1.53%) ⬇️
dvc/output.py 89.58% <60.00%> (-0.48%) ⬇️
tests/func/test_used_objs.py 100.00% <100.00%> (ΓΈ)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

β˜” View full report at Codecov.
πŸ“’ Do you have feedback about the report comment? Let us know in this issue.

@dberenbaum
Copy link
Collaborator

This looks good!

Also, this PR does not address #4527. DVC repo imports (dvc import) are essentially always considered push: false regardless of whether or not the user explicitly tries to set push: true in the dvc file. (But in the future using explicit push: true in the .dvc file could be used to support 4527)

Makes sense for now. Is your hope to migrate dvc import to this syntax in the future? Do you see any obstacles besides having to refactor the existing dvc import?

The output will still be included when using dvc status -c and dvc pull (i.e. we will still try to pull it from a DVC remote when possible, and will report whether or not the file is exists in a DVC remote on status -c)

@dberenbaum I'm not really sure how we want to handle this for generic outs (since it works with any dvc-tracked file and not only import-url), so I went with the simplest implementation for now. Essentially, it seems to me that "I don't want to push this file to a remote" is not necessarily the same as "this file should never be found on a remote". This could maybe open up use cases like setting push: false in local branches but keeping push enabled in main.

I think we should exclude it from dvc status -c (could be in a follow-up PR). dvc status should be telling users what action they need to take, and in this case we shouldn't expect that users want to push these outputs. It makes it impossible to see a clean status if there are any push: false outputs. The branching use case is a good one, but even then I think it's better to show different results for dvc status -c in each branch. If we eventually migrate dvc import to use this syntax, we wouldn't want to break dvc status -c behavior for those outputs.

Also not sure if we need to add dvc import-url --enable/disable-push flags to use the non-default values for import-url? Currently, using the non-defaults requires manually editing the dvc file.

It's fine to leave out CLI flags for now. If we were to add them, we would need them in dvc add also.


What needs to be done to support this for stage outputs? I think that would address #4868.

@dberenbaum
Copy link
Collaborator

dberenbaum commented Nov 17, 2022

The only issue I found is that push: false is ignored when pushing to a cloud-versioned remote:

$ dvc remote add -f -d s3versioned s3://dave-sandbox-versioning/import-push
$ dvc remote modify s3versioned version_aware true
$ dvc remote modify s3versioned worktree true
$ dvc import-url --version-aware s3://dave-sandbox-versioning/example-get-started-cloud-versioning/model.pkl
$ dvc push
1 file pushed

Edit: possibly related to #8315

@dberenbaum
Copy link
Collaborator

Also fixes #7594

@pmrowla
Copy link
Contributor Author

pmrowla commented Nov 18, 2022

Makes sense for now. Is your hope to migrate dvc import to this syntax in the future? Do you see any obstacles besides having to refactor the existing dvc import?

It will need a refactor of some dvc internals. Basically, for the dvc import case, we will need to be able to look for a given file in 2 different remotes on fetch/pull. First we will have to check the local repo's remote to see if a push: true import has already been pushed to our own remote, and then we will have to fall back to the original source repo's remote(s). The way DVC is written right now is that we can only handle 1 remote per object we want to fetch.

I think we should exclude it from dvc status -c (could be in a follow-up PR). dvc status should be telling users what action they need to take, and in this case we shouldn't expect that users want to push these outputs. It makes it impossible to see a clean status if there are any push: false outputs.

I'll make this change in this PR then (it's just a 1 line thing).

The branching use case is a good one, but even then I think it's better to show different results for dvc status -c in each branch. If we eventually migrate dvc import to use this syntax, we wouldn't want to break dvc status -c behavior for those outputs.

What needs to be done to support this for stage outputs? I think that would address #4868.

It should work for stage outputs already, it just requires manually editing dvc.yaml (push: can't be set in dvc run or dvc stage add)

@dberenbaum
Copy link
Collaborator

It should work for stage outputs already, it just requires manually editing dvc.yaml (push: can't be set in dvc run or dvc stage add)

I think the schema needs to be updated for it to work.

@pmrowla pmrowla merged commit 4d6e045 into iterative:main Nov 18, 2022
@pmrowla pmrowla deleted the output-param-push branch November 18, 2022 07:32
@pmrowla
Copy link
Contributor Author

pmrowla commented Nov 18, 2022

Stage outs schema issue and status -c have been updated, worktree issues will be addressed in a follow up PR.

@dberenbaum
Copy link
Collaborator

dberenbaum commented Nov 18, 2022

@pmrowla Can you also update https://github.com/iterative/dvcyaml-schema?

Edit: After that, let's close #4868.

@jorgeorpinel
Copy link
Contributor

Hi. I noticed this isn't documented yet. Besides updating the cmd refs, should we recommend this by default when building pipelines? (Let's make a docs issue?) Thanks

Context: https://discord.com/channels/485586884165107732/1047778499999576084/1048151267190509600

@pmrowla
Copy link
Contributor Author

pmrowla commented Dec 2, 2022

@jorgeorpinel this is being documented in iterative/dvc.org#4142

@jorgeorpinel
Copy link
Contributor

Thanks. I guess I meant to comment on the push: false feature in general. Should've gone to #8581 perhaps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes A: data-management Related to dvc add/checkout/commit/move/remove A: data-sync Related to dvc get/fetch/import/pull/push
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cloud versioning: don't push cloud-versioned imports
3 participants