Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specify remote when adding file to dvc repo #9352

Open
christian-steinmeyer opened this issue Apr 20, 2023 · 19 comments
Open

specify remote when adding file to dvc repo #9352

christian-steinmeyer opened this issue Apr 20, 2023 · 19 comments
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature

Comments

@christian-steinmeyer
Copy link

Currently, one can add a file to dvc via dvc add some.file which will create a tracking file some.file.dvc. In this tracking file, one can add remote: some_remote to the fields, to tell dvc which remote to synchronize with (cf. docs). I would like to be able to do that from the command line per some argument to dvc add. The command already has argument -r/--remote (although currently only allowed in combination with --to-remote. It'd be great if it could be used for this purpose.

What I'm suggesting in code:
dvc add test.txt --remote foo should yield a file with this content

outs:
- md5: <SOME_HASH>
  size: <SOME_SIZE>
  path: test.txt
  remote: foo
@daavoo daavoo added A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature labels Apr 20, 2023
@dberenbaum
Copy link
Collaborator

@christian-steinmeyer Do you have any interest in trying to contribute it?

@christian-steinmeyer
Copy link
Author

I've had a first look in how the arguments are parsed and where the add method of the repo is defined, but I dont know what mechanism creates the actual files and how their content is defined.

@dberenbaum
Copy link
Collaborator

I think it should work similarly to annotation fields like desc that persist unless a new value is explicitly passed, but remote should probably be separate since it's more than an annotation.

The annotation info is added here AFAIK:

out_obj.annot.update(**kwargs)

@iterative/dvc WDYT?

@daavoo
Copy link
Contributor

daavoo commented Apr 20, 2023

I think it should be better passed to:

dvc/dvc/repo/add.py

Lines 259 to 266 in 787c72e

stage = repo.stage.create(
single_stage=True,
validate=False,
fname=fname or path,
wdir=wdir,
outs=[out],
external=external,
)

And pass it all the way down to Output init class:

dvc/dvc/output.py

Lines 316 to 335 in 787c72e

def __init__( # noqa: PLR0913
self,
stage,
path,
info=None,
cache=True,
metric=False,
plot=False,
persist=False,
checkpoint=False,
desc=None,
type=None, # noqa: A002, pylint: disable=redefined-builtin
labels=None,
meta=None,
remote=None,
repo=None,
fs_config=None,
files: Optional[List[Dict[str, Any]]] = None,
push: bool = True,
):

@dberenbaum
Copy link
Collaborator

Sounds good, @daavoo. My only point about annotations was that if someone does dvc add --remote myremote data and later commits some changes with dvc add data, data.dvc should keep the remote: myremote field.

@daavoo
Copy link
Contributor

daavoo commented Apr 20, 2023

My only point about annotations was that if someone does dvc add --remote myremote data and later commits some changes with dvc add data, data.dvc should keep the remote: myremote field.

I am a little lost without actually going into the code, but shouldn't remote be also preserved in that scenario?

@daavoo
Copy link
Contributor

daavoo commented Apr 21, 2023

My only point about annotations was that if someone does dvc add --remote myremote data and later commits some changes with dvc add data, data.dvc should keep the remote: myremote field.

I am a little lost without actually going into the code, but shouldn't remote be also preserved in that scenario?

I think it is preserved and it's tested here:

dvc/tests/func/test_add.py

Lines 846 to 880 in b814531

def test_add_preserve_fields(tmp_dir, dvc):
text = textwrap.dedent(
"""\
# top comment
desc: top desc
outs:
- path: foo # out comment
desc: out desc
type: mytype
labels:
- label1
- label2
remote: testremote
meta: some metadata
"""
)
tmp_dir.gen("foo.dvc", text)
tmp_dir.dvc_gen("foo", "foo")
assert (tmp_dir / "foo.dvc").read_text() == textwrap.dedent(
"""\
# top comment
desc: top desc
outs:
- path: foo # out comment
desc: out desc
type: mytype
labels:
- label1
- label2
remote: testremote
md5: acbd18db4cc2f85cedef654fccc4a4d8
size: 3
meta: some metadata
"""
)

@kshitiz305
Copy link

Hi So shall we add this?

@christian-steinmeyer
Copy link
Author

I got as far as this 😅

Index: dvc/repo/add.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/dvc/repo/add.py b/dvc/repo/add.py
--- a/dvc/repo/add.py	(revision b814531c84071823d83194af6f91d5718fc26976)
+++ b/dvc/repo/add.py	(date 1682085738793)
@@ -72,9 +72,7 @@
             invalid_opt = "--external option"
     else:
         message = "{option} can't be used without --to-remote"
-        if kwargs.get("remote"):
-            invalid_opt = "--remote"
-        elif kwargs.get("jobs"):
+        if kwargs.get("jobs"):
             invalid_opt = "--jobs"
 
     if invalid_opt is not None:
@@ -255,6 +253,7 @@
         path, wdir, out = resolve_paths(
             repo, target, always_local=transfer and not kwargs.get("out")
         )
+        remote = kwargs.get("remote")
 
         stage = repo.stage.create(
             single_stage=True,
@@ -263,6 +262,7 @@
             wdir=wdir,
             outs=[out],
             external=external,
+            remote=remote
         )
 
         out_obj = stage.outs[0]

Not sure where it would need to go next.

@kshitiz305
Copy link

have you pushed these changes on a branch ?

@christian-steinmeyer
Copy link
Author

No, but feel free to just apply the git patch I posted

@efiop
Copy link
Contributor

efiop commented Jun 22, 2023

As noted in the original description, this might be misleading when used with --to-remote and without it. Overall -r has been a runtime option to tell us what remote to use, while adding a remote: field is more of a dvcfile modifier. I am a bit worried about overloading it (docs will be a nightmare). Maybe we need a different name here (nothing good comes to mind), or maybe a different approach completely. For example, we've thought before about introducing a command for modifying dvcfiles (similar to how one could use dvc config/remote to modify config). Considering that dvcfiles are just straight up yaml, one could just use some yaml CLI tools to achieve what is requested in the original description. E.g. one could just use yq (jq for yaml) tool or something else.

@christian-steinmeyer
Copy link
Author

It'd be great if there were some option to achieve this goal with pure dvc (i.e., without any other tool). For my personal use case, two steps (one dvc add + some other command that edits the newly created .dvc file) would work just fine.

Any idea I have to add it do dvc add would cause some kind of backward incompatibility (or documentation nightmare, as you put it). Although I do believe that if it didn't exist already, if I saw the option --remote, the behavior described in this issue is exactly, what I would expect.

I'm slightly surprised that when I use dvc add --to-remote -r some_remote, the resulting .dvc file doesn't include any info about the remote. If it did, reusing the --remote functionality for both wouldn't be that bad, I think. It would kind of do the same. In that case, I think that --remote-only might be a slightly more speaking name than --to-remote, but I understand that changing API names is never fun and easily done.

@dberenbaum
Copy link
Collaborator

I'm slightly surprised that when I use dvc add --to-remote -r some_remote, the resulting .dvc file doesn't include any info about the remote.

Agreed, and I think it has come up before in discussion. @efiop What do you think about it?

@efiop
Copy link
Contributor

efiop commented Jun 26, 2023

I agree, it makes sense with --to-remote -r some_remote to update the remote in the dvcfile. The confusion that I'm worried about is how in dvc push/pull/fetch you can specify the target remote, but here you won't be able to do so without modifying the dvc file as well. Again feels like current approach is a bit messy, where we mix up runtime options and dvc file editing options. I probably just need to think about it a bit more and digest it.

@skshetry
Copy link
Member

skshetry commented Jun 26, 2023

What do you think about a separate command?

dvc out --set-remote <remote> <out>

Or, a generic command to update dvcfiles/dvc.yaml file.

dvc meta update --remote <remote> <target>

@efiop
Copy link
Contributor

efiop commented Jun 26, 2023

@skshetry That is an option, but I don't think it is worth investing into right now. As mentioned above, it is just a yaml and one could use yq or any other tool to modify it to their liking. If this becomes a reoccuring request or we need to modify dvc files in scripts often - then yeah, some purpose-built command for modifying dvcfiles according to schema would make sense to spend time on.

@christian-steinmeyer
Copy link
Author

Slightly off-topic: Do you have usage data / case studies of how end users handle multiple remotes? We, for once, recently started a project that has some smaller data dependencies for development and a big data dependency for e.g. training a model. I.e., not everyone checking out the repo needs all the data. We "solved" this by using several dvc remotes, but as we now have to specify a remote for each file / stage or else they would get pushed to both remotes, that doesn't feel like the intended use case.

I'm asking to gain some insights, whether we are the outsiders or perhaps the norm. If the latter should be the case, perhaps it would make sense to revisit the way dvc handles multiple remotes, which I believe to be connected to @efiop's point about the push/pull/fetch commands and the target remotes. From my perspective, it would make sense to specify the remote to be used for each data (or only those exceptions from the default). One could perhaps list several in case one wants data duplication across remotes. Further, when push/pull/fetching and specifying a remote, all those data connected to this remote could be selected. In case of a specification as in dvc push <file/stage> <remote> that is conflicting the state of the <file/stage>'s configuration, the entry could be added to the file's dvc config after a confirmation (or --yes).

With our use cases in mind, this behavior would be a bit more intuitive to me. But perhaps I'm just missing other use cases.

I understand, if you find this too derailing from the original topic and too broad a change from the current behavior to address in this thread.

@dberenbaum
Copy link
Collaborator

@christian-steinmeyer It might help to know that dvc push/pull/fetch --remote were developed first and the remote: field in .dvc files was only added later, so unfortunately I think the current implementation has more to do with that history than any product or user experience motivation. IMHO the remote: field is generally the more useful approach for most circumstances I know.

There is some related discussion in #8298 in case that seems more directly relevant to your questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature
Projects
None yet
6 participants