Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: --to-remote needed? OR --external needed? #5445

Closed
2 tasks done
jorgeorpinel opened this issue Feb 10, 2021 · 47 comments
Closed
2 tasks done

add: --to-remote needed? OR --external needed? #5445

jorgeorpinel opened this issue Feb 10, 2021 · 47 comments
Assignees
Labels
discussion requires active participation to reach a conclusion enhancement Enhances DVC product: VSCode Integration with VSCode extension

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 10, 2021

Follow up to #5198 (comment), #5301, and iterative/dvc.org#2172 (comment):

Question

add --to-remote is a bit strange because normally add doesn't move target data, rather tracks it in-place (analog to git add). But --to-remote implies that external data will be moved into the workspace at some point, which we skip for now but "pre-push" (transfer) it to remote storage (for later pull/fetch).

As of now add --to-remote has a similar result to get-url + add + push + remove, gc. So OK, maybe it's nice to have a shortcut to all that, but we already have import-url (--to-remote) to achieve the same.

The only difference vs. importing is that the data source is not recorded as a dependency in the .dvc file. So you can't update it or unfreeze+repro it. However I don't see any use cases where you would want to prevent the .dvc from having this dep, as you can simply never update or unfreeze it.

TLDR: I think import-url --to-remote is enough and what we should recommend for these situations. And add --to-remote breaks the Git analogy. Cc @dberenbaum

Improvement

  • But if we keep it, an improvement would be to NOT require the --external flag with it (cc @isidentical). This saves the user from typing a flag that is always needed, but also make sense since the data is not actually being treated as external in the sense that it won't be tracked/controlled in it's original location (requiring external cache, etc.).

@jorgeorpinel jorgeorpinel added enhancement Enhances DVC question I have a question? labels Feb 10, 2021
@jorgeorpinel jorgeorpinel changed the title add: --to-remote question or improvement add: --to-remote needed? Feb 10, 2021
@jorgeorpinel jorgeorpinel added the product: VSCode Integration with VSCode extension label Feb 10, 2021
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 10, 2021

This probably applies to add --to-cache add --out (to-cache) too BTW. Not sure (we haven't documented that one yet).

@shcheklein
Copy link
Member

My 2cs:

  • add is more natural to find (I need to add data to DVC but it's too large to do it with a regular dvc add on this machine - etc - I'll be searching for something like "How do add large data to DVC" or something.
  • import-url has a different semantics (.dvc file is different). So, I am not sure why would keep only one, even though they are similar.
  • agreed on removing --external constraint, --to-remote should imply it. Since it's probably almost always the case that we need both.

@jorgeorpinel
Copy link
Contributor Author

add is more natural to find (I need to add data... searching for something like "How do add large data to DVC

I get the semantics part but that can probably addressed with good docs. Or maybe import could have a better name or become a flag in add (just a thought).

I still find add --to-remote unintuitive, esp. for people used to git add which never moves/copies files. Maybe if we had another way to use add that moves external data into the workspace like add --import, then this could be expressed as add --import --skip-local/--push (again rhetorical, not an actual suggestion).

Also, when you add /external/path it gives you a hint that you need --external. It could hint instead/additionally that you probably want import-url --to-remote.

.dvc file is different

When users ask, which one are we going to recommend in most cases? IMO import-url has added benefits (frozen external dependency) that even if you don't need now, can hardly get in the way, and may come handy later.

agreed on removing --external constraint, --to-remote should imply it

Cc @isidentical that's the only action point so far 😬

@dberenbaum
Copy link
Collaborator

I haven't been tracking this issue closely, so I lack some context on both the command and related commands and their intended uses, which maybe is helpful so I can provide more of an outsider perspective?

Between get, add, and import, plus get-url and import-url, it's pretty overwhelming and confusing. If we also have --external, --to-remote, and --to-cache flags for several (but not all) of these commands, as well as run --external, it makes it hard for any user to understand, so my first instinct is to minimize and consolidate the number of different commands and flags as much as we can.

Is add --external needed at all if import-url --to-cache/--to-remote exists? I see some notes above on the differences between them, so maybe that will become more obvious when I spend some time with the commands.

@dberenbaum
Copy link
Collaborator

When users ask, which one are we going to recommend in most cases? IMO import-url has added benefits (frozen external dependency) that even if you don't need now, can hardly get in the way, and may come handy later.

Agreed. Playing around with it a bit, I'm struggling to see how to meaningfully differentiate between:

  • dvc import-url --to-cache and dvc add --external (would --to-cache replace --external or is there some difference between them?)
  • dvc import-url --to-remote and dvc add [--external] --to-remote.

@isidentical
Copy link
Contributor

Cc @isidentical that's the only action point so far grimacing

I am still new to the concept of --external, though I don't see why it would be needed in case of to-cache or to-remote;

(.venv) (Python 3.10.0a4+) [ 11:16ÖS ]  [ isidentical@desktop:/tmp/test_dvc(master✗) ]
 $ dvc add /home/isidentical/t.py -o t.py
100% Add|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00,  1.95file/s]
                                                                                                                                                                                                                                                   
To track the changes with git, run:

        git add t.py.dvc
(.venv) (Python 3.10.0a4+) [ 11:16ÖS ]  [ isidentical@desktop:/tmp/test_dvc(master✗) ]
 $ cat t.py.dvc
outs:
- md5: 239e65e4b720cc7a7f06cf129d060d53
  nfiles: 1
  path: t.py
(.venv) (Python 3.10.0a4+) [ 11:19ÖS ]  [ isidentical@desktop:/tmp/test_dvc(master✗) ]
 $ dvc add /home/isidentical/notes --to-remote -o notes
100% Add|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00,  4.47file/s]
                                                                                                                                                                                                                                                   
To track the changes with git, run:

        git add notes.dvc
(.venv) (Python 3.10.0a4+) [ 11:19ÖS ]  [ isidentical@desktop:/tmp/test_dvc(master✗) ]
 $ dvc pull notes
A       notes                                                                                                                                                                                                                                      
1 file added and 1 file fetched                                                                                                                                                                                                                    

In case of you don't supply -o notes with --to-remote (you can't do this with to-cache, since having -o is obligatory, though you can try to set the output destination outside of repo, which you need --external only then) it will try to add /home/isidentical/notes under the control of dvc, which I thought would be the reason why added this commands in the first place (unless I am missing something, which seems probable so examples are much appreciated for me to understand this situation better).

@dberenbaum
Copy link
Collaborator

dberenbaum commented Feb 12, 2021

Examples definitely help, @isidentical!

$ dvc add /home/isidentical/notes --to-remote -o notes

Wouldn't the following be pretty similar (except for the difference in the .dvc file noted above)?

$ dvc import-url --to-remote /Users/dave/notes notes
$ cat notes.dvc
frozen: true
deps:
- path: /Users/dave/notes
outs:
- md5: d41d8cd98f00b204e9800998ecf8427e
  nfiles: 1
  path: notes

On the other hand, I guess there's no import-url equivalent for:

$ dvc add --external --to-remote /Users/dave/notes -o /Users/dave/notes_new_location
$ cat notes_new_location.dvc
outs:
- md5: d41d8cd98f00b204e9800998ecf8427e
  nfiles: 1
  path: /Users/dave/notes_new_location

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 13, 2021

minimize and consolidate

Agree @dberenbaum , that would be ideal. Too many ways to deal with external data IMO

Is add --external needed at all if import-url --to-cache/--to-remote exists?
still new to the concept of --external, though I don't see why it would be needed

Yeah they're different. Story: add always accepted external paths (which end up as external outputs in the .dvc file) but we started realizing over time we shouldn't encourage that use, so we started requiring the explicit --external flag as kind of a barrier (cc @isidentical). Now we also have a scary warning in https://dvc.org/doc/user-guide/managing-external-data.

add --to-remote and -o results in something completely new for add which is to move an external target to the workspace (which happens after doing that + pull).

set the output destination outside of repo, which you need --external

This is also new to me: that you can output an external file/dir to some other external location. I wonder whether there's a use case for that or if it's just a side effects of providing --out.

@shcheklein
Copy link
Member

I see that there is a lot of confusion (as usual with any data located externally) because for some reason --external got involved. My take on this - we should forget about --external completely here. They are completely different features. For some (unfortunate) reason (to my mind) we forgot to disable --external for --to-remote. But it should be disabled to keep all things related to the "external outputs and dependencies" feature separate from the regular dvc add --to-remote.

--to-remote is in fact a utility for the regular dvc add, regular dvc import-url, etc. It has nothing to my mind has to do with "external outputs and dependencies".

I get the semantics part but that can probably addressed with good docs.
IMO import-url has added benefits (frozen external dependency) that even if you don't need now, can hardly get in the way, and may come handy later.

@jorgeorpinel could you clarify this? I'm not sure I understand, to be honest, why do we compare import and add? They are two different operations to my mind. --to-remote makes sense for both of them (or neither of them if we can find an alternative).

I still find add --to-remote unintuitive

It might be not ideal, but it was the best we were able to come with. At least w/o introducing an additional top level command. We decided against it (top level command) mostly for a reason that this is expected to be an advanced utility to help people add large data files in specific cases, not an alternative to a general workflow, not a global scenario.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 14, 2021

nothing to my mind has to do with "external outputs and dependencies".

True. I'm referring to the more general concept of handling "external data" (found outside the project). But yes, other than not requiring --external with --to-cache/remote, we can forget about --external in this discussion.

IMO import-url has added benefits

could you clarify this?

Sure: my argument was that when people have a scenario for a --to-cache/remote operation, I believe I'll hardly ever have an incentive to recommend add over import-url. I expect the latter to be superior in the vast majority of cases.

I could be wrong of course, this is hypothetical. And even if I'm right the worst case scenario is that no one ends up using add --to-... which isn't really a big deal.

why do we compare import and add? --to-remote makes sense for both of them

Originally --to-remote was only going to be introduced to import-url IIRC. Then for some reason we decided to also put it in add (in the same core PR) which may be why we're still talking about them in tandem.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 14, 2021

add --to-remote and -o results in something completely new for add which is to move an external target to the workspace

On this concern (consistency) maybe if we allowed --out for any add operation then there would be no consistency issue between regular add vs. --to-... add 🙂

@isidentical
Copy link
Contributor

isidentical commented Feb 16, 2021

maybe if we allowed --out for any add operation then there would be no consistency issue

Well -o without --to-remote is basically to-cache, and already supported. (instead of transferring to remote, it transfers/saves to the local cache, just like normal add)

@isidentical
Copy link
Contributor

In #5473 I've changed the behavior to match with remote providers (s3/azure etc), so now we can use dvc add /absolute/path --to-remote directly without dvc add /absolute/path --to-remote -o path (it won't work like --external, but rather work like the second command I gave as an example). One thing to note, that PR also bans using --external, so if you do something like this dvc add /absolute/path -o /outside/repo it will fail and there won't be any way to do it (and I assume no one will actually do it).

@dberenbaum
Copy link
Collaborator

add and import-url enable files/dirs from either (1) the workspace or (2) an external location to be added to either (1) the cache or (2) a remote.

Here are the four combinations of those that need support:

  1. workspace -> cache: add myfile
  2. workspace -> remote: add myfile --to-remote
  3. external -> cache: add /external/myfile [--external] or import-url /external/myfile (also downloads to workspace)
  4. external -> remote: add /external/myfile [--external] --to-remote or import-url /external/myfile --to-remote (downloads to workspace only after dvc pull)

Scenarios 3 and 4 are why I think it's worth discussing --external, because there is overlap with import-url, and the slight differences in behavior between each (not even mentioning frozen) are not obvious to me. Questions I still have on 3 and 4:

  • Should all of the commands be supported?
  • When is --external required or even allowed?
  • Which commands are recommended?
  • How do we document?

@shcheklein
Copy link
Member

Good questions, @dberenbaum !

I think the biggest issue is that the difference between add and import-url is not clear in this scenario. The obvious one -import-url deals with an external resource. But also, it keeps that connection to the external source. .dvc file it creates is way more complicated than the regular dvc add's one. It can even act as a stage (thus frozen by default, but you can unfreeze it) that checks if the source is updated.

@isidentical
Copy link
Contributor

When is --external required or even allowed?

--external is only forbidden when you are transferring the data either into the cache or the remote storage.

Which commands are recommended?

  • dvc add s3://blah/blah --to-remote (transfers directly to the remote storage, creates blah.dvc)
  • dvc add /external/something -o something (transfers directly to cache, creates something.dvc)
  • dvc add /extneral/something --to-remote (transfers directly to the remote storage creates something.dvc)
  • If the remote location might change in the future, and you want to track its updates; dvc import-url s3://some/location/ --to-remote -r my_remote -o data, and regularly sync it with dvc update data.dvc --to-remote -r my_remote

Bad / forbidden practices:

  • dvc add /some/place -o /another/place (straight-to-cache but the output is outside of the repo)
  • dvc add /some/place --to-remote -o /another/place (straight-to-remote but the output is outside of the repo)
  • dvc add /somewhere --to-remote --external (wouldn't make sense, since the output would be in the repo ($CWD/somewhere, not /somewhere)) so it is forbidden

@isidentical
Copy link
Contributor

How do we document?

We added 2 different examples (one is merged, one is on the review) to both import-url / add docs under the "Transferring Data" sections. What might make sense is adding an dvc update --to-remote example to the import-url after its own documentation gets merged (cc: @jorgeorpinel, what do you think about this?)

@isidentical isidentical self-assigned this Feb 17, 2021
@dberenbaum
Copy link
Collaborator

dberenbaum commented Feb 17, 2021

I think the biggest issue is that the difference between add and import-url is not clear in this scenario.

Agreed.

What might make sense is adding an dvc update --to-remote example to the import-url

Great idea IMO.

Going back to the original question from @jorgeorpinel :

add --to-remote is a bit strange because normally add doesn't move target data, rather tracks it in-place (analog to git add). But --to-remote implies that external data will be moved into the workspace at some point, which we skip for now but "pre-push" (transfer) it to remote storage (for later pull/fetch).

It seems like add --to-remote keeps the target data in its external location, which further differentiates it from import-url. Try this:

$ touch test
$ mkdir repo
$ mkdir dvcremote
$ cd repo
$ git init
$ dvc init -q
$ dvc remote add -d default ../dvcremote
$ git add .
$ git commit --quiet -m "init"
$ dvc add ../test --to-remote
$ rm ../test
$ dvc pull
A       ../test
1 file added

@isidentical
Copy link
Contributor

I have to state that this behavior is going to change @dberenbaum, with the #5473 (for both import-url and add). When you run dvc add ../test --to-remote (or with -o for to-cache), it will save it like it was saving a remote location so instead of having an output pointing towards to ../test it will actually point to ./test (in the current directory). So whenever you pull, it won't touch the test in the upper directory and put it inside of your repo (probably this is what you would want in real world).

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 17, 2021

-o without --to-remote is basically to-cache
dvc add /external/something -o something (transfers directly to cache, creates something.dvc)

@isidentical So -o now implies --to-cache? But no actual output is added in the workspace. So if I add localfile -o newname I have to checkout to get ./newname? Confused haha

Bad / forbidden practices:

add ./file --to-cache is also meaningless I think, is it forbidden or does it just ignore the flag?

What might make sense is adding an dvc update --to-remote example to the import-url

Sure, it can be mentioned like in https://dvc.org/doc/command-reference/import-url#example-detecting-external-file-changes 👍

But all those Qs are sort of unrelated to the OP 🙂

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 17, 2021

Regarding the OP:

  1. workspace -> remote: add myfile --to-remote

Doesn't seem like an intended use case b/c --to-remote is meant to skip the local disk. I guess it's possible but I'd put that in the non-rec list too.

  1. external -> cache: add /external/myfile or import-url /external/myfile

Or add|import-url /external/myfile --to-cache [-o myfiletoo]

Should all of the commands be supported? // Which commands are recommended? // How do we document?

Supporting them and documenting them is doable. I think that the most helpful Q is which ones we rec, indeed.

If the remote location might change in the future, and you want to track its updates
dvc import-url s3://some/location/ --to-remote -r my_remote -o data
and regularly sync it with dvc update

@isidentical the list of recs is great! But that one I'm not sure about b/c I think external data moved into the workspace with add --to-remote can be updated from the external source too (simply repro) — will check and follow up ⌛

@isidentical
Copy link
Contributor

So -o now implies --to-cache?

There was never a --to-cache option. I guess the confusion is because we call --to-remote option to-remote so people might assume there is a separate flag for this, but no.

So if I add localfile -o newname I have to checkout to get ./newname? Confused haha

The difference with to-cache is that, it transfers it to the cache and then links the outputs from cache to workspace. So you don't have to a do a checkout implicitly.

@isidentical
Copy link
Contributor

isidentical commented Feb 17, 2021

But all those Qs are sort of unrelated to the OP

What do you mean by this?

I think external data moved into the workspace with add --to-remote can be updated from the external source too (simply repro) — will check and follow up hourglass

It would be nice if you could share a full example.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 18, 2021

.dvc file it creates is way more complicated... It can even act as a stage (thus frozen by default

As user, you can just ignore the deps: entry (since the .dvc file is frozen), or even remove it.

But that reminded me that regular .dvc files can also act as stages to some extent (@shcheklein): repro will re-add the data if it changed (rel. iterative/dvc.org/issues/2115).

However, I just tried it with external data, and repro does NOT behave that way (which may also be unexpected):

$ echo 1 > nums
$ dvc add nums  # nums is cached, nums.dvc is created
...
$ echo 2 >> nums
$ dvc repro nums.dvc  # new nums is cached, nums.dvc is updated
...
$ rm nums
$ dvc repro nums.dvc  # errors out
ERROR: failed to reproduce 'nums.dvc': missing data 'source': nums

$ dvc add /external/file -o file  # file is cached (but not linked to workspace), file.dvc is created
...
# change /external/file
$ dvc repro file.dvc  # nothing happens (since there's no ./file, but there's no error)
'file.dvc' didn't change, skipping...

$ dvc import-url /external/file -o file  # creates frozen .dvc file, transfers data
$ dvc repro file.dvc  # nothing happens unless you unfreeze first — OK
...
$ dvc update file.dvc [--to-remote]  # updates .dvc file, downloads/transfers latest data

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 18, 2021

My conclusions so far on add -o|--to-remote:

Pros

  • Natural semantics ("add this /external/file to my project")
  • Doesn't hurt to have more UI options for the user

Cons

  • I think add should never move targets (stay similar to Git)
  • I expect import-url to be technically superior in 99% of cases (delete deps: from .dvc file in the other 1%)
  • Confusing integrations, unintended side-effects (may be hard to doc and maintain)

Alternative 💡

Scratch add -o|--to-remote and introduce import*|update --sever to avoid or remove the deps: in the import .dvc file (turns it into a regular .dvc file as if added). Or maybe get* --add/--to-remote 🙂

And update the message to rec that when people try add /external/file (stop recommending --external).

@shcheklein
Copy link
Member

As user, you can just ignore the deps: entry (since the .dvc file is frozen), or even remove it.

No, you can run dvc update that is made for this. Or you can unfreeze it.

I think add should never move targets (stay similar to Git)

git add moves stuff (at least into cache). I think. Not sure also that Git is the best analogy here. We deal with data (huge) that's why certain shortcuts and utils become important.

I expect import-url to be technically superior in 99% of cases (delete deps: from .dvc file in the other 1%)

See above I think. One of the biggest points of the import-url is that dep.

Confusing integrations, unintended side-effects (may be hard to doc and maintain)

needs some clarification :)

@dberenbaum
Copy link
Collaborator

whenever you pull, it won't touch the test in the upper directory and put it inside of your repo (probably this is what you would want in real world)

Let's say users are collaborating and have data on a shared network drive or something similar that doesn't fit in their workspace/filesystem.

  1. User adds/imports data using --to-remote
  2. User runs their pipeline, creating a bunch of smaller additional dependencies and outputs that fit in the workspace
  3. User pushes to git and dvc remotes

How is another user supposed to reproduce this pipeline if their workspace can't fit the initial data? Sorry to keep getting off topic, but I want to make sure I understand the use case and what the recommended workflow would be here.

@isidentical
Copy link
Contributor

The use case seems like fit better to to-cache rather than to-remote. In such case, when the users have an external cache in a different disk and have their own workspaces in their main drives (small ssds etc), they can add it with dvc add /external/data -o data, and have dvc transfer their data to the cache directly and creating a link without actually holding it on the disk that contains the current workspace.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 18, 2021

@shcheklein re needs some clarification :) do you mean we should improve the docs? Yeah at least we should incorporate some of the knowledge in this issue to docs (explanations, recommendations, maybe examples). Please move this to dvc.org if that's the decision.

@dberenbaum to do # 2 the user needs to dvc pull first (equivalent to importing regularly and then dvc pushing). In any case there's no way other users can reproduce that pipeline if they can't dvc pull.

@isidentical That's another great rec, which makes me wonder do we need --external at all a this point...

@shcheklein
Copy link
Member

do you mean we should improve the docs?

no, I meant Confusing integrations, unintended side-effects (may be hard to doc and maintain) is not informative to my mind. This point itself can be expanded.

@shcheklein
Copy link
Member

@isidentical thanks! That's one the most powerful cases for --to-remote. I call it bootstrapping cache/remote, bootstrapping data registry. Usually users come and say something like - hey I have two disks - SSD (where my repo is located) and let's say HDD or NAS (where data is located). I don't have space enough on SSD to get the whole dataset to do dvc add, but I want to put it into DVC and then using symlinks pull it into SSD, or on a completely different machine that has enough space dvc pull.

@shcheklein
Copy link
Member

@jorgeorpinel

That's another great rec, which makes me wonder do we need --external at all a this point...

why do you have this doubt? Could you elaborate a bit please?

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 19, 2021

Confusing integrations, unintended side-effects

This point itself can be expanded.

@shcheklein by that I basically meant that the complexity of add increases significantly and I'm afraid there may be too many combinations to predict that may end up causing confused users and unexpected problems. Summary:

  • When to use get*/import* if you can add -o/--to-remote? (Users may abuse add)
  • add, then pull? Unexpected workflow (normally you add, then push like in Git)
  • add is a little overloaded? (--recursive/no-commit/glob/external)
  • What does add -o --no-commit do? What about add -R --to-remote?
  • add /ext/f --out f skips the workspace (not typical known tools like wget -o)
    but add ./f -o fcp does link ./fcp (a kind of copy)
  • Can't set --out in an external location (because combining with --external is now forbidden)
  • add ./myfile --to-remote is possible but kind of meaningless (skip cache? not very visible: may cause errors)
  • repro doesn't "reproduce" the original operation you did (transfer from some external location, now unknown).
  • Can't easily update from original source (should've used import/update but will people know in advance?).
  • People sometimes already (mistakenly) expect remote storage to reflect their workspace dir structure (like a Git remote repo). The sound of "add to remote" may reinforce that misconception.

All that said, for the record I've very much exhausted my arguments and I'm personally fine either way.

@jorgeorpinel jorgeorpinel added discussion requires active participation to reach a conclusion and removed question I have a question? labels Feb 19, 2021
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 19, 2021

Perhaps most importantly (although unrelated to the OP):

wonder do we need --external at all

Could you elaborate a bit please?

It seems like we always discourage users from add --external nowadays. Whatever problem it originally tried to solve, may have a better solution now with to-cache/remote? I guess it can still make sense for web protocol remotes where you can't link files from an external cache to the workspace.

But external outputs are risky by nature: can mess up with other systems when targets get replaced by file links. Also can't collaborate with other DVC repos (even copies of the same project) due to overlapping outputs — incompatible with shared server scenarios.

@jorgeorpinel jorgeorpinel changed the title add: --to-remote needed? add: --to-remote needed? OR --external needed? Feb 19, 2021
@isidentical
Copy link
Contributor

When to use get*/import* if you can add -o/--to-remote? (Users may abuse add)

The simple answer is (btw we don't have a --to-remote for get-url, since we don't sync data directly, we save the data to remote in the form of cache so only add/import-url --to-remote is available) if you'd like to update your data with dvc update --to-remote, then you should use import-url. Otherwise add is enough.

add, then pull? Unexpected workflow (normally you add, then push like in Git)

If you are going to run pull right after add, then you can use import-url + push as well. The general scenario is that, you don't pull right after add, though I agree it seems a little bit unusual.

What does add -o --no-commit do? What about add -R --to-remote?

Forbidden.

dvc/dvc/repo/add.py

Lines 49 to 54 in 63f3293

if len(targets) != 1:
invalid_opt = "multiple targets"
elif no_commit:
invalid_opt = "--no-commit option"
elif recursive:
invalid_opt = "--recursive option"

add /ext/f --out f skips the workspace (not typical known tools like wget -o)

What do you mean by skips the workspace? We construct a link in the workspace pointing to your cache

Can't set --out in an external location (because combining with --external is now forbidden)

IMHO it is better in this way, though these are initial restrictions and if some use case comes up we can always remove them.

add ./myfile --to-remote is possible but kind of meaningless (skip cache? not very visible: may cause errors)

Yeah, it just automates dvc add myfile + dvc push though if it is a big dataset using --to-remote might be a bit more efficient and less memory-intensive (just assumptions).

repro doesn't "reproduce" the original operation you did (transfer from some external location, now unknown).

I don't know about repro's behavior for this case, though a full example would be great (so we can try and see).

Can't easily update from original source (should've used import/update but will people know in advance?).

We can document it I guess.

People sometimes already (mistakenly) expect remote storage to reflect their workspace dir structure (like a Git remote repo). The sound of "add to remote" may reinforce that misconception.

If we would've added this to get-url, the confusion would make sense. Though both add/import-url saves these stuff in the cache, as in the same format about how we save them into remote storage

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 20, 2021

Thanks for the detailed answers @isidentical (some of the Qs where rhetorical but still!)

Forbidden...

OK, good. But restricting -o/to-cache to a single target and no combos with any other flags reinforces my perception that we're essentially writing a different command (overloading add) — part add, part sync.

add /ext/f --out f skips the workspace (unlike tools like wget -o)

What do you mean?

The target is not linked (you don't see it with ls).

don't know about repro's behavior for this case, though a full example would be great

Example in #5445 (comment) above 🙂

@isidentical
Copy link
Contributor

The target is not linked (you don't see it with ls).

(.venv38) (Python 3.8.5+) [ 10:34ÖÖ ]  [ isidentical@desktop:/tmp/test_dvc(master✗) ]
 $ ls -altrh
total 100K
-rw-rw-r--  1 isidentical isidentical  139 Şub 22 10:32 .dvcignore
drwxrwxr-x  7 isidentical isidentical 4,0K Şub 22 10:32 .git
drwxrwxr-x  4 isidentical isidentical 4,0K Şub 22 10:34 .dvc
drwxrwxr-x  4 isidentical isidentical 4,0K Şub 22 10:34 .
drwxrwxrwt 22 root        root         80K Şub 22 10:34 ..
(.venv38) (Python 3.8.5+) [ 10:34ÖÖ ]  [ isidentical@desktop:/tmp/test_dvc(master✗) ]
 $ dvc add /tmp/foo -o foo 

100% Add|██████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00,  4.46file/s]
                                                                                                                                    
To track the changes with git, run:

        git add foo.dvc
(.venv38) (Python 3.8.5+) [ 10:34ÖÖ ]  [ isidentical@desktop:/tmp/test_dvc(master✗) ]
 $ 
(.venv38) (Python 3.8.5+) [ 10:34ÖÖ ]  [ isidentical@desktop:/tmp/test_dvc(master✗) ]
 $ ls -altrh
total 104K
-rw-rw-r--  1 isidentical isidentical  139 Şub 22 10:32 .dvcignore
drwxrwxr-x  7 isidentical isidentical 4,0K Şub 22 10:32 .git
lrwxrwxrwx  1 isidentical isidentical   58 Şub 22 10:34 foo -> /tmp/test_dvc/.dvc/cache/1f/2051184882afa1ef8d9cb5fd3c0362
drwxrwxr-x  5 isidentical isidentical 4,0K Şub 22 10:34 .dvc
drwxrwxr-x  4 isidentical isidentical 4,0K Şub 22 10:34 .
-rw-rw-r--  1 isidentical isidentical   70 Şub 22 10:34 foo.dvc
drwxrwxrwt 22 root        root         80K Şub 22 10:34 ..

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 23, 2021

@isidentical I thought "to-cache" meant the file is moved straight to the cache and not linked locally, like with to-remote. So --out DOES link external data to the workspace... Then the only change is that the path in the .dvc file is moved to the location given to --out, right?

If so, why do we call that "straight to cache"? I don't think we need a special term for that behavior other than "out path" 🙂 Anyway, good to know (for when we doc this), thanks.

@isidentical
Copy link
Contributor

So the advantage is that, you never move the file to your workspace but move it directly to your cache (they might be in different hard drives etc), and then when we move it to your cache we just do a link (symlink preferably, if you configured in that way).

@dberenbaum
Copy link
Collaborator

After discussion, let's close this issue for now. With the 2.0 release upcoming, we will gather feedback and decide whether the existing UI makes sense based on that. This is a great issue to reference for #5392.

I want to document one last discussion point that hasn't been mentioned in the thread in case we decide to come back to it. dvc add could always require -o for external targets, regardless of whether they are going to cache or to remote. --to-remote would then be an additional flag to go straight to remote instead of cache. For example:

$ dvc add /external/file -o file # copies straight to cache and links to workspace
$ dvc add --to-remote /external/file -o file # copies straight to remote

To the user, this would give a consistent UI for adding external targets that would always make clear where those targets will be linked to in the workspace.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 24, 2021

Sounds good @dberenbaum. And I do like the idea to always specify --out (with --to-remote as an additional modifier).

Also worth mentioning we'll keep --external but plan to stop recommending it from the CLI too, right? (and docs of course — there's already issues for that). Currently a HINT to use it comes up when you try to add an external path cc @efiop

@isidentical one last Q (mostly curious):

symlink preferably, if you configured in that way

Should symlinks be preferred over reflinks for external data? (why)

Thanks

@isidentical
Copy link
Contributor

Should symlinks be preferred over reflinks for external data? (why)

CC: @efiop

@shcheklein
Copy link
Member

Should symlinks be preferred over reflinks for external data? (why)

It actually depends on where DVC cache is located.

reflinks/hardlinks are not supported across different volumes (usual case to do --to-cache and/or --to-remote, etc). If data and DVC cache are located on the same volume we can recommend to use the regular dvc add.

@jorgeorpinel
Copy link
Contributor Author

I see. So by default add /ext/path --out path will attempt to make a copy of the data in the workspace when the cache is external (if cache.type isn't configured). Maybe symlinking or not creating the local path (with a warning message) should be the default for these cases?

@dberenbaum
Copy link
Collaborator

I don't think defaults need to change since that's going to be complicated, but hinting that data could be copied to the workspace and cache and linking to https://dvc.org/doc/user-guide/large-dataset-optimization or something similar could make sense.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 24, 2021

I want to agree... But if the assumed use case for add --out external data is the "straight-to-cache" functionality, it seems to defeat the purpose to silently make a local copy (when the cache is external).

The copying may even fail due to lacking storage space and I wonder whether the rest of the process will complete/ be recoverable. Will the .dvc file still get created so you can change the linking config and dvc checkout? Will retrying skip the transfer since it's already in cache?

@isidentical
Copy link
Contributor

The copying may even fail due to lacking storage space and I wonder whether the rest of the process will complete/ be recoverable.

First we make the copy, and after everything is done we do the linking (which is actually a checkout under the hood). If the linking fails, we don't create the dvc file but the data is already in the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion enhancement Enhances DVC product: VSCode Integration with VSCode extension
Projects
None yet
Development

No branches or pull requests

4 participants