-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add: --to-remote needed? OR --external needed? #5445
Comments
This probably applies to |
My 2cs:
|
I get the semantics part but that can probably addressed with good docs. Or maybe I still find Also, when you
When users ask, which one are we going to recommend in most cases? IMO
Cc @isidentical that's the only action point so far 😬 |
I haven't been tracking this issue closely, so I lack some context on both the command and related commands and their intended uses, which maybe is helpful so I can provide more of an outsider perspective? Between Is |
Agreed. Playing around with it a bit, I'm struggling to see how to meaningfully differentiate between:
|
I am still new to the concept of
In case of you don't supply |
Examples definitely help, @isidentical!
Wouldn't the following be pretty similar (except for the difference in the $ dvc import-url --to-remote /Users/dave/notes notes
$ cat notes.dvc
frozen: true
deps:
- path: /Users/dave/notes
outs:
- md5: d41d8cd98f00b204e9800998ecf8427e
nfiles: 1
path: notes On the other hand, I guess there's no $ dvc add --external --to-remote /Users/dave/notes -o /Users/dave/notes_new_location
$ cat notes_new_location.dvc
outs:
- md5: d41d8cd98f00b204e9800998ecf8427e
nfiles: 1
path: /Users/dave/notes_new_location |
Agree @dberenbaum , that would be ideal. Too many ways to deal with external data IMO
Yeah they're different. Story:
This is also new to me: that you can output an external file/dir to some other external location. I wonder whether there's a use case for that or if it's just a side effects of providing |
I see that there is a lot of confusion (as usual with any data located externally) because for some reason
@jorgeorpinel could you clarify this? I'm not sure I understand, to be honest, why do we compare
It might be not ideal, but it was the best we were able to come with. At least w/o introducing an additional top level command. We decided against it (top level command) mostly for a reason that this is expected to be an advanced utility to help people add large data files in specific cases, not an alternative to a general workflow, not a global scenario. |
True. I'm referring to the more general concept of handling "external data" (found outside the project). But yes, other than not requiring
Sure: my argument was that when people have a scenario for a I could be wrong of course, this is hypothetical. And even if I'm right the worst case scenario is that no one ends up using
Originally |
On this concern (consistency) maybe if we allowed |
Well |
In #5473 I've changed the behavior to match with remote providers (s3/azure etc), so now we can use |
Here are the four combinations of those that need support:
Scenarios 3 and 4 are why I think it's worth discussing
|
Good questions, @dberenbaum ! I think the biggest issue is that the difference between |
Bad / forbidden practices:
|
We added 2 different examples (one is merged, one is on the review) to both |
Agreed.
Great idea IMO. Going back to the original question from @jorgeorpinel :
It seems like $ touch test
$ mkdir repo
$ mkdir dvcremote
$ cd repo
$ git init
$ dvc init -q
$ dvc remote add -d default ../dvcremote
$ git add .
$ git commit --quiet -m "init"
$ dvc add ../test --to-remote
$ rm ../test
$ dvc pull
A ../test
1 file added |
I have to state that this behavior is going to change @dberenbaum, with the #5473 (for both import-url and add). When you run |
@isidentical So
Sure, it can be mentioned like in https://dvc.org/doc/command-reference/import-url#example-detecting-external-file-changes 👍 But all those Qs are sort of unrelated to the OP 🙂 |
Regarding the OP:
Doesn't seem like an intended use case b/c
Or
Supporting them and documenting them is doable. I think that the most helpful Q is which ones we rec, indeed.
@isidentical the list of recs is great! But that one I'm not sure about b/c I think external data moved into the workspace with |
There was never a
The difference with to-cache is that, it transfers it to the cache and then links the outputs from cache to workspace. So you don't have to a do a checkout implicitly. |
What do you mean by this?
It would be nice if you could share a full example. |
As user, you can just ignore the But that reminded me that regular .dvc files can also act as stages to some extent (@shcheklein): However, I just tried it with external data, and $ echo 1 > nums
$ dvc add nums # nums is cached, nums.dvc is created
...
$ echo 2 >> nums
$ dvc repro nums.dvc # new nums is cached, nums.dvc is updated
...
$ rm nums
$ dvc repro nums.dvc # errors out
ERROR: failed to reproduce 'nums.dvc': missing data 'source': nums
$ dvc add /external/file -o file # file is cached (but not linked to workspace), file.dvc is created
...
# change /external/file
$ dvc repro file.dvc # nothing happens (since there's no ./file, but there's no error)
'file.dvc' didn't change, skipping...
$ dvc import-url /external/file -o file # creates frozen .dvc file, transfers data
$ dvc repro file.dvc # nothing happens unless you unfreeze first — OK
...
$ dvc update file.dvc [--to-remote] # updates .dvc file, downloads/transfers latest data |
My conclusions so far on Pros
Cons
Alternative 💡 Scratch And update the message to rec that when people try |
No, you can run
See above I think. One of the biggest points of the
needs some clarification :) |
Let's say users are collaborating and have data on a shared network drive or something similar that doesn't fit in their workspace/filesystem.
How is another user supposed to reproduce this pipeline if their workspace can't fit the initial data? Sorry to keep getting off topic, but I want to make sure I understand the use case and what the recommended workflow would be here. |
The use case seems like fit better to to-cache rather than to-remote. In such case, when the users have an external cache in a different disk and have their own workspaces in their main drives (small ssds etc), they can add it with |
@shcheklein re @dberenbaum to do # 2 the user needs to @isidentical That's another great rec, which makes me wonder do we need |
no, I meant |
@isidentical thanks! That's one the most powerful cases for |
why do you have this doubt? Could you elaborate a bit please? |
@shcheklein by that I basically meant that the complexity of
All that said, for the record I've very much exhausted my arguments and I'm personally fine either way. |
Perhaps most importantly (although unrelated to the OP):
It seems like we always discourage users from But external outputs are risky by nature: can mess up with other systems when targets get replaced by file links. Also can't collaborate with other DVC repos (even copies of the same project) due to overlapping outputs — incompatible with shared server scenarios. |
The simple answer is (btw we don't have a --to-remote for get-url, since we don't sync data directly, we save the data to remote in the form of cache so only add/import-url --to-remote is available) if you'd like to update your data with
If you are going to run pull right after add, then you can use
Forbidden. Lines 49 to 54 in 63f3293
What do you mean by skips the workspace? We construct a link in the workspace pointing to your cache
IMHO it is better in this way, though these are initial restrictions and if some use case comes up we can always remove them.
Yeah, it just automates
I don't know about repro's behavior for this case, though a full example would be great (so we can try and see).
We can document it I guess.
If we would've added this to |
Thanks for the detailed answers @isidentical (some of the Qs where rhetorical but still!)
OK, good. But restricting -o/to-cache to a single target and no combos with any other flags reinforces my perception that we're essentially writing a different command (overloading add) — part add, part sync.
The target is not linked (you don't see it with
Example in #5445 (comment) above 🙂 |
|
@isidentical I thought "to-cache" meant the file is moved straight to the cache and not linked locally, like with to-remote. So If so, why do we call that "straight to cache"? I don't think we need a special term for that behavior other than "out path" 🙂 Anyway, good to know (for when we doc this), thanks. |
So the advantage is that, you never move the file to your workspace but move it directly to your cache (they might be in different hard drives etc), and then when we move it to your cache we just do a link (symlink preferably, if you configured in that way). |
After discussion, let's close this issue for now. With the 2.0 release upcoming, we will gather feedback and decide whether the existing UI makes sense based on that. This is a great issue to reference for #5392. I want to document one last discussion point that hasn't been mentioned in the thread in case we decide to come back to it. $ dvc add /external/file -o file # copies straight to cache and links to workspace
$ dvc add --to-remote /external/file -o file # copies straight to remote To the user, this would give a consistent UI for adding external targets that would always make clear where those targets will be linked to in the workspace. |
Sounds good @dberenbaum. And I do like the idea to always specify Also worth mentioning we'll keep @isidentical one last Q (mostly curious):
Should symlinks be preferred over reflinks for external data? (why) Thanks |
CC: @efiop |
It actually depends on where DVC cache is located.
|
I see. So by default |
I don't think defaults need to change since that's going to be complicated, but hinting that data could be copied to the workspace and cache and linking to https://dvc.org/doc/user-guide/large-dataset-optimization or something similar could make sense. |
I want to agree... But if the assumed use case for The copying may even fail due to lacking storage space and I wonder whether the rest of the process will complete/ be recoverable. Will the .dvc file still get created so you can change the linking config and |
First we make the copy, and after everything is done we do the linking (which is actually a checkout under the hood). If the linking fails, we don't create the dvc file but the data is already in the cache. |
Question
add --to-remote
is a bit strange because normallyadd
doesn't move target data, rather tracks it in-place (analog togit add
). But--to-remote
implies that external data will be moved into the workspace at some point, which we skip for now but "pre-push" (transfer) it to remote storage (for laterpull/fetch
).As of now
add --to-remote
has a similar result toget-url
+add
+push
+remove
,gc
. So OK, maybe it's nice to have a shortcut to all that, but we already haveimport-url (--to-remote)
to achieve the same.The only difference vs. importing is that the data source is not recorded as a dependency in the .dvc file. So you can't
update
it orunfreeze
+repro
it. However I don't see any use cases where you would want to prevent the .dvc from having thisdep
, as you can simply neverupdate
orunfreeze
it.TLDR: I think
import-url --to-remote
is enough and what we should recommend for these situations. Andadd --to-remote
breaks the Git analogy. Cc @dberenbaumImprovement
--external
flag with it (cc @isidentical). This saves the user from typing a flag that is always needed, but also make sense since the data is not actually being treated as external in the sense that it won't be tracked/controlled in it's original location (requiring external cache, etc.).The text was updated successfully, but these errors were encountered: