Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example for dvc add --to-remote #2172

Merged
merged 5 commits into from
Feb 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 50 additions & 16 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ other DVC commands), a few actions are taken under the hood:
3. Attempt to replace the file with a link to the cached data (more details on
file linking further down). Skipped if `--to-remote` is used.
4. Create a corresponding `.dvc` file to track the file, using its path and hash
to identify the cached data. The `.dvc` file lists the DVC-tracked file as an
to identify the cached data (with `--to-remote`/`-o`, an external path is
moved to the workspace). The `.dvc` file lists the DVC-tracked file as an
<abbr>output</abbr> (`outs` field). Unless the `--file` option is used, the
`.dvc` file name generated by default is `<file>.dvc`, where `<file>` is the
file name of the first target.
Expand All @@ -72,20 +73,6 @@ large files. DVC also supports other link types for use on file systems without
`reflink` support, but they have to be specified manually. Refer to the
`cache.type` config option in `dvc config cache` for more information.

### Transferring data directly to remote storage

When you have a very big dataset that you want to move from some external
location to [remote storage](/doc/command-reference/remote) while avoiding
storing it locally, you can use the `--to-remote` option. This will transfer a
copy of the target data directly to a remote of your choice (or the default
one). A `.dvc` file will be created normally, but the data won't be found in
your local project until you `dvc pull` it.

This option is useful when the local system can't handle the target data, but
you still want to track and store it in remote storage, so that whenever you
switch to a different system that can handle it, you can simply pull the data
and start working on it.

### Adding entire directories

A `dvc add` target can be either a file or a directory. In the latter case, a
Expand Down Expand Up @@ -165,7 +152,7 @@ not.
> link above for more details.

- `--to-remote` - import an external target, but don't move it into the
workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it
workspace, nor cache it. [Transfer it](#example-transfer-to-remote-storage) it
directly to remote storage (the default one, unless `-r` is specified)
instead. Use `dvc pull` to get the data locally.

Expand Down Expand Up @@ -344,3 +331,50 @@ $ tree .dvc/cache

Only the hash values of the `dir/` directory (with `.dir` file extension) and
`file2` have been cached.

## Example: Transfer to remote storage

When you have a large dataset in an external location, you may want to track it
as if it was in your project, but without downloading it locally (for now). The
`--to-remote` option lets you do so, while storing a copy
[remotely](/doc/command-reference/remote) so it can be
[pulled](/doc/command-reference/plots) later. Let's initialize a DVC project,
Comment on lines +335 to +341
Copy link
Contributor

@jorgeorpinel jorgeorpinel Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still explaining this almost exactly as in https://dvc.org/doc/command-reference/import-url#example-transfer-to-remote-storage but they're supposed to be completely different use cases. Is "you may want to track it as if it was in your project" clear and different enough?

How should we best differentiate them based on the discussions in iterative/dvc/issues/5445? Cc @shcheklein

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can defer this discussion. I am awaiting this PR to be merged so that I can work on to-cache docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would write a better intro in general here:

By default when you do import-url DVC downloads data into the workspace so that it can be saved into cache, and later into the remote storage. That's important to preserve it since we want to keep the project reproducible. In some situations though you might not have enough space on the machine you are running import-url, but you still want this data to be saved into remote storage, you still want this data be accessible through regular commands like dvc pull (e.g. to run the pipeline on another machine that has enough space in cache, or when a large shared cache is being used, etc). In those cases, to "bootstrap" the project it's handy to use --to-remote ....

@jorgeorpinel we can take it over and rewrite it a bit, but let's not block @isidentical , and if we do this let's try to do this asap please or even as a separate PR

Copy link
Contributor

@jorgeorpinel jorgeorpinel Feb 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer this discussion. I am awaiting this PR to be merged
let's not block @isidentical

I did approve the PR along with my comment... (#2172 (review))

This comment was marked as resolved.

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. Also added checkbox to #2121 for now (will come back to that one)

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. guys I finally got to rewriting these examples a bit in #2302. Feel free to check it out.

and setup a remote:

```dvc
$ mkdir example # workspace
$ cd example
$ git init
$ dvc init
$ mkdir /tmp/dvc-storage
$ dvc remote add myremote /tmp/dvc-storage
```

Now let's add the `data.xml` to our remote storage from the given remote
location.

```dvc
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml \
--to-remote -r myremote
...
```

The only difference that dataset is transferred straight to remote, so DVC won't
control the remote location you gave but rather continue managing your remote
storage where the data is now on. The operation will still be resulted with an
`.dvc` file:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ ls
data.xml.dvc
```

Whenever anyone wants to actually download the added data (for example from a
system that can handle it), they can use `dvc pull` as usual:

```dvc
$ dvc pull data.xml.dvc -r tmp_remote

A data.xml
1 file added and 1 file fetched
```
4 changes: 2 additions & 2 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -361,8 +361,8 @@ Running stage 'prepare' with command:
## Example: Transfer to remote storage

When you have a large dataset in an external location, you may want to import it
to you project without downloading it to the local file system (for using it
later/elsewhere). The `--to-remote` option lets you skip the download, while
to your project without downloading it to the local file system (for using it
later/elsewhere). The `--to-remote` option let you skip the download, while
storing the imported data [remotely](/doc/command-reference/remote). Let's
initialize a DVC project, and setup a remote:

Expand Down