From 7f5f54312cc07370409eedf35fe150df9b3542fc Mon Sep 17 00:00:00 2001 From: Batuhan Taskaya Date: Tue, 9 Feb 2021 16:25:03 +0300 Subject: [PATCH 1/5] Example for dvc add --to-remote --- content/docs/command-reference/add.md | 47 ++++++++++++++++++++++++--- 1 file changed, 43 insertions(+), 4 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 0d0a527acd..8c12e4a1ab 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -81,10 +81,8 @@ copy of the target data directly to a remote of your choice (or the default one). A `.dvc` file will be created normally, but the data won't be found in your local project until you `dvc pull` it. -This option is useful when the local system can't handle the target data, but -you still want to track and store it in remote storage, so that whenever you -switch to a different system that can handle it, you can simply pull the data -and start working on it. +(ℹ️) See the [Transfer to remote storage](#example-transfer-to-remote-storage) +below. ### Adding entire directories @@ -344,3 +342,44 @@ $ tree .dvc/cache Only the hash values of the `dir/` directory (with `.dir` file extension) and `file2` have been cached. + +# Example: Transfer to remote storage + +When you want to include a remote location (like some file in a s3 bucket, or a +Google Drive directory) to your project, but don't want DVC to control the given +remote location rather just sync the data into your remote storage, +`--to-remote` option can be used. + +```dvc +$ mkdir example # workspace +$ cd example +$ git init +$ dvc init +$ mkdir /tmp/dvc-storage +$ dvc remote add myremote /tmp/dvc-storage +``` + +Now let's add the `data.xml` to our remote storage, and create a `.dvc` file. + +```dvc +$ dvc add https://data.dvc.org/get-started/data.xml --to-remote \ + -r myremote +``` + +The only change in our local workspace is a newly created `.dvc` +file: + +```dvc +$ ls +data.xml.dvc +``` + +Whenever anyone wants to actually have the added data (for example from a system +with much larger space), they can use `dvc pull` as usual: + +```dvc + $ dvc pull data.xml.dvc -r tmp_remote + +A data.xml +1 file added and 1 file fetched +``` From 11a5c0ce7439b6ef1aa5224501878f3c0f5397a0 Mon Sep 17 00:00:00 2001 From: Batuhan Taskaya Date: Wed, 10 Feb 2021 15:14:29 +0300 Subject: [PATCH 2/5] move import-url example to the add --- content/docs/command-reference/add.md | 29 ++++++++++++-------- content/docs/command-reference/import-url.md | 4 +-- 2 files changed, 19 insertions(+), 14 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 8c12e4a1ab..3c6635d862 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -343,12 +343,13 @@ $ tree .dvc/cache Only the hash values of the `dir/` directory (with `.dir` file extension) and `file2` have been cached. -# Example: Transfer to remote storage +## Example: Transfer to remote storage -When you want to include a remote location (like some file in a s3 bucket, or a -Google Drive directory) to your project, but don't want DVC to control the given -remote location rather just sync the data into your remote storage, -`--to-remote` option can be used. +When you have a large dataset in an external location, you may want to add it to +your project without downloading it to the local file system (for using it +later/elsewhere). The `--to-remote` option let you skip the download, while +storing the imported data [remotely](/doc/command-reference/remote). Let's +initialize a DVC project, and setup a remote: ```dvc $ mkdir example # workspace @@ -359,23 +360,27 @@ $ mkdir /tmp/dvc-storage $ dvc remote add myremote /tmp/dvc-storage ``` -Now let's add the `data.xml` to our remote storage, and create a `.dvc` file. +Now let's add the `data.xml` to our remote storage from the given remote +location. ```dvc -$ dvc add https://data.dvc.org/get-started/data.xml --to-remote \ - -r myremote +$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml \ + --to-remote -r myremote +... ``` -The only change in our local workspace is a newly created `.dvc` -file: +The only difference that dataset is transferred straight to remote, so DVC won't +control the remote location you gave but rather continue managing your remote +storage where the data is now on. The operation will still be resulted with an +`.dvc` file: ```dvc $ ls data.xml.dvc ``` -Whenever anyone wants to actually have the added data (for example from a system -with much larger space), they can use `dvc pull` as usual: +Whenever anyone wants to actually download the added data (for example from a +system that can handle it), they can use `dvc pull` as usual: ```dvc $ dvc pull data.xml.dvc -r tmp_remote diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index c00998fabb..325b58c7e9 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -361,8 +361,8 @@ Running stage 'prepare' with command: ## Example: Transfer to remote storage When you have a large dataset in an external location, you may want to import it -to you project without downloading it to the local file system (for using it -later/elsewhere). The `--to-remote` option lets you skip the download, while +to your project without downloading it to the local file system (for using it +later/elsewhere). The `--to-remote` option let you skip the download, while storing the imported data [remotely](/doc/command-reference/remote). Let's initialize a DVC project, and setup a remote: From 404b04f47fed7c362c23c7e599555e85aa32efc1 Mon Sep 17 00:00:00 2001 From: Batuhan Taskaya Date: Fri, 19 Feb 2021 12:48:00 +0300 Subject: [PATCH 3/5] mention external path becomes a local one --- content/docs/command-reference/add.md | 23 ++++++----------------- 1 file changed, 6 insertions(+), 17 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 3c6635d862..103224797d 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -44,11 +44,12 @@ other DVC commands), a few actions are taken under the hood: for more details.) 3. Attempt to replace the file with a link to the cached data (more details on file linking further down). Skipped if `--to-remote` is used. -4. Create a corresponding `.dvc` file to track the file, using its path and hash - to identify the cached data. The `.dvc` file lists the DVC-tracked file as an - output (`outs` field). Unless the `--file` option is used, the - `.dvc` file name generated by default is `.dvc`, where `` is the - file name of the first target. +4. Create a corresponding `.dvc` file to track the file, using its path (when + used with `--to-remote`/`-o`, the external path becomes a local path) and + hash to identify the cached data. The `.dvc` file lists the DVC-tracked file + as an output (`outs` field). Unless the `--file` option is used, + the `.dvc` file name generated by default is `.dvc`, where `` is + the file name of the first target. 5. Add the `targets` to `.gitignore` in order to prevent them from being committed to the Git repository (unless `dvc init --no-scm` was used when initializing the DVC project). @@ -72,18 +73,6 @@ large files. DVC also supports other link types for use on file systems without `reflink` support, but they have to be specified manually. Refer to the `cache.type` config option in `dvc config cache` for more information. -### Transferring data directly to remote storage - -When you have a very big dataset that you want to move from some external -location to [remote storage](/doc/command-reference/remote) while avoiding -storing it locally, you can use the `--to-remote` option. This will transfer a -copy of the target data directly to a remote of your choice (or the default -one). A `.dvc` file will be created normally, but the data won't be found in -your local project until you `dvc pull` it. - -(ℹ️) See the [Transfer to remote storage](#example-transfer-to-remote-storage) -below. - ### Adding entire directories A `dvc add` target can be either a file or a directory. In the latter case, a From 32bf48c9da014f12fb7b0b8365693c1d7bc6c517 Mon Sep 17 00:00:00 2001 From: Batuhan Taskaya Date: Fri, 19 Feb 2021 12:48:37 +0300 Subject: [PATCH 4/5] fix dead anchor --- content/docs/command-reference/add.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 103224797d..94bfacc34b 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -152,7 +152,7 @@ not. > link above for more details. - `--to-remote` - import an external target, but don't move it into the - workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it + workspace, nor cache it. [Transfer it](#example-transfer-to-remote-storage) it directly to remote storage (the default one, unless `-r` is specified) instead. Use `dvc pull` to get the data locally. @@ -334,11 +334,12 @@ Only the hash values of the `dir/` directory (with `.dir` file extension) and ## Example: Transfer to remote storage -When you have a large dataset in an external location, you may want to add it to -your project without downloading it to the local file system (for using it -later/elsewhere). The `--to-remote` option let you skip the download, while -storing the imported data [remotely](/doc/command-reference/remote). Let's -initialize a DVC project, and setup a remote: +When you have a large dataset in an external location, you may want to track it +as if it was in your project, but without downloading it locally (for now). The +`--to-remote` option lets you do so, while storing a copy +[remotely](/doc/command-reference/remote) so it can be +[pulled](/doc/command-reference/plots) later. Let's initialize a DVC project, +and setup a remote: ```dvc $ mkdir example # workspace From 95e0058181c34ed430c6f55fb19615956828f637 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 21 Feb 2021 22:35:36 -0600 Subject: [PATCH 5/5] Update content/docs/command-reference/add.md --- content/docs/command-reference/add.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 94bfacc34b..8ffeb47ef3 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -44,12 +44,12 @@ other DVC commands), a few actions are taken under the hood: for more details.) 3. Attempt to replace the file with a link to the cached data (more details on file linking further down). Skipped if `--to-remote` is used. -4. Create a corresponding `.dvc` file to track the file, using its path (when - used with `--to-remote`/`-o`, the external path becomes a local path) and - hash to identify the cached data. The `.dvc` file lists the DVC-tracked file - as an output (`outs` field). Unless the `--file` option is used, - the `.dvc` file name generated by default is `.dvc`, where `` is - the file name of the first target. +4. Create a corresponding `.dvc` file to track the file, using its path and hash + to identify the cached data (with `--to-remote`/`-o`, an external path is + moved to the workspace). The `.dvc` file lists the DVC-tracked file as an + output (`outs` field). Unless the `--file` option is used, the + `.dvc` file name generated by default is `.dvc`, where `` is the + file name of the first target. 5. Add the `targets` to `.gitignore` in order to prevent them from being committed to the Git repository (unless `dvc init --no-scm` was used when initializing the DVC project).