Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs regaring --to-remote option for add/import-url #2091

Merged
merged 40 commits into from
Feb 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
f64f601
Initial pre-texts regarding straight to remote
isidentical Jan 12, 2021
98cf237
Add an import-url example
isidentical Jan 12, 2021
114ba8d
More mentions to --to-remote
isidentical Jan 12, 2021
cbdf546
More description regarding --to-remote
isidentical Jan 12, 2021
820cbd6
checkout => pull
isidentical Jan 12, 2021
4c83bdf
Address some reviews
isidentical Jan 14, 2021
aaa0273
Reference to the example in the docs
isidentical Jan 14, 2021
02f9ade
remove brackets
isidentical Jan 15, 2021
66c8710
-j for import-url/add
isidentical Jan 15, 2021
c11ef07
apply suggestions from jorge
isidentical Jan 18, 2021
0b79d10
Reorder parameters according to the core
isidentical Jan 18, 2021
2ea1f22
Apply a bunch more suggestions
isidentical Jan 18, 2021
3ff3d01
Update content/docs/command-reference/add.md
jorgeorpinel Jan 19, 2021
a4cbe61
Update content/docs/command-reference/add.md
jorgeorpinel Jan 23, 2021
4fb63eb
Update content/docs/command-reference/import-url.md
jorgeorpinel Jan 23, 2021
d07166d
Update content/docs/command-reference/add.md
jorgeorpinel Jan 23, 2021
c249ee6
Update content/docs/command-reference/add.md
jorgeorpinel Jan 23, 2021
b16d407
Update content/docs/command-reference/import-url.md
jorgeorpinel Jan 23, 2021
570f38c
Update content/docs/command-reference/import-url.md
jorgeorpinel Jan 23, 2021
6c8a592
Update content/docs/command-reference/import-url.md
jorgeorpinel Jan 23, 2021
5737bd2
Restyled by prettier
restyled-commits Jan 23, 2021
96d767f
proper initalization
isidentical Feb 5, 2021
133a939
suggestions
isidentical Feb 5, 2021
6c7f65a
rebase
isidentical Feb 5, 2021
194a764
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 6, 2021
0dd63c7
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 6, 2021
8e66b2b
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 7, 2021
a473848
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 7, 2021
e5b9d4e
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 7, 2021
d7ca231
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 7, 2021
65ce340
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 7, 2021
c6351f3
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 7, 2021
ee24963
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 7, 2021
89c1bb9
changes
isidentical Feb 8, 2021
f32473e
sync with master
isidentical Feb 8, 2021
1d5ef74
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 9, 2021
25b0cdf
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 9, 2021
c036a07
Update content/docs/command-reference/add.md
jorgeorpinel Feb 9, 2021
46b5164
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 9, 2021
d58af5b
Update content/docs/command-reference/import-url.md
jorgeorpinel Feb 9, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 32 additions & 4 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@ file.

```usage
usage: dvc add [-h] [-q | -v] [-R] [--no-commit] [--external]
[--glob] [--file <filename>] [--desc <text>]
[--glob] [--file <filename>] [-o <path>] [--to-remote]
[-r <name>] [-j <number>] [--desc <text>]
targets [targets ...]

positional arguments:
Expand Down Expand Up @@ -36,12 +37,13 @@ After checking that each `target` hasn't been added before (or tracked with
other DVC commands), a few actions are taken under the hood:

1. Calculate the file hash.
2. Move the file contents to the cache (by default in `.dvc/cache`), using the
file hash to form the cached file path. (See
2. Move the file contents to the cache (by default in `.dvc/cache`) (or to
remote storage if `--to-remote` is given), using the file hash to form the
cached file path. (See
[Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
for more details.)
3. Attempt to replace the file with a link to the cached data (more details on
file linking further down).
file linking further down). Skipped if `--to-remote` is used.
4. Create a corresponding `.dvc` file to track the file, using its path and hash
to identify the cached data. The `.dvc` file lists the DVC-tracked file as an
<abbr>output</abbr> (`outs` field). Unless the `--file` option is used, the
Expand Down Expand Up @@ -70,6 +72,20 @@ large files. DVC also supports other link types for use on file systems without
`reflink` support, but they have to be specified manually. Refer to the
`cache.type` config option in `dvc config cache` for more information.

### Transferring data directly to remote storage

When you have a very big dataset that you want to move from some external
location to [remote storage](/doc/command-reference/remote) while avoiding
storing it locally, you can use the `--to-remote` option. This will transfer a
copy of the target data directly to a remote of your choice (or the default
one). A `.dvc` file will be created normally, but the data won't be found in
your local project until you `dvc pull` it.

This option is useful when the local system can't handle the target data, but
you still want to track and store it in remote storage, so that whenever you
switch to a different system that can handle it, you can simply pull the data
and start working on it.

### Adding entire directories

A `dvc add` target can be either a file or a directory. In the latter case, a
Expand Down Expand Up @@ -148,6 +164,18 @@ not.
> Note that external outputs typically require an external cache setup. See
> link above for more details.

- `--to-remote` - import an external target, but don't move it into the
workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it
directly to remote storage (the default one, unless `-r` is specified)
instead. Use `dvc pull` to get the data locally.

- `-r <name>`, `--remote <name>` - name of the
[remote storage](/doc/command-reference/remote) to transfer external target to
(can only be used with `--to-remote`).
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

- `-o <path>`, `--out <path>` - destination `path` for the transferred data (can
only be used with `--to-remote`).

- `--desc <text>` - user description of the data (optional). This doesn't affect
any DVC operations.

Expand Down
8 changes: 2 additions & 6 deletions content/docs/command-reference/get-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Download a file or directory from a supported URL (for example `s3://`,
## Synopsis

```usage
usage: dvc get-url [-h] [-q | -v] [-j <number>] url [out]
usage: dvc get-url [-h] [-q | -v] url [out]
Copy link
Contributor

@jorgeorpinel jorgeorpinel Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops @isidentical I'm seeing lots of -j related changes here. Maybe this got contaminated from another one of your docs branches?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file (content/docs/command-reference/get-url.md), content/docs/command-reference/get.md, and content/docs/command-reference/import.md to be precise.


positional arguments:
url (See supported URLs in the description.)
Expand All @@ -31,7 +31,7 @@ while `out` can be used to specify the directory and/or file name desired for
the downloaded data. If an existing directory is specified, then the file or
directory will be placed inside.

DVC supports several types of (local or) remote data sources (protocols):
DVC supports several types of (local or) remote locations (protocols):

| Type | Description | `url` format example |
| --------- | ---------------------------- | --------------------------------------------- |
Expand Down Expand Up @@ -72,10 +72,6 @@ $ wget https://example.com/path/to/data.csv

## Options

- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
from the source. The default value is `4 * cpu_count()`. For SSH remotes, the
default is `4`. Using more jobs may speed up the operation.

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
Expand Down
9 changes: 1 addition & 8 deletions content/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ directory.
## Synopsis

```usage
usage: dvc get [-h] [-q | -v] [-o <path>] [--rev <commit>] [-j <number>]
url path
usage: dvc get [-h] [-q | -v] [-o <path>] [--rev <commit>] url path

positional arguments:
url Location of DVC or Git repository to download from
Expand Down Expand Up @@ -66,12 +65,6 @@ name.
download the file or directory from. The latest commit in `master` (tip of the
default branch) is used by default when this option is not specified.

- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
from the remote. The default value is `4 * cpu_count()`. For SSH remotes, the
default is `4`. Using more jobs may speed up the operation. Note that the
default value can be set in the source repo using the `jobs` config option of
`dvc remote modify`.

- `--show-url` - instead of downloading the file or directory, just print the
storage location (URL) of the target data. If `path` is a Git-tracked file,
this option is ignored.
Expand Down
73 changes: 67 additions & 6 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# import-url

Download a file or directory from a supported URL (for example `s3://`,
`ssh://`, and other protocols) into the <abbr>workspace</abbr>, and track it (an
import `.dvc` file is created).
Track a file or directory found in an external location (`s3://`, `/local/path`,
etc.), and download it to the local project, or make a copy in
[remote storage](/doc/command-reference/remote).

> See `dvc import` to download and tack data/model files or directories from
> other <abbr>DVC repositories</abbr> (e.g. hosted on GitHub).
Expand All @@ -11,7 +11,8 @@ import `.dvc` file is created).

```usage
usage: dvc import-url [-h] [-q | -v] [-j <number>] [--file <filename>]
[--no-exec] [--desc <text>]
[--no-exec] [--to-remote] [-r <name>]
[--desc <text>]
url [out]

positional arguments:
Expand All @@ -22,8 +23,9 @@ positional arguments:
## Description

In some cases it's convenient to add a data file or directory from an external
location into the workspace, such that it can be updated later, if/when the
external data source changes. Example scenarios:
location into the workspace (or to
[remote storage](/doc/command-reference/remote)), such that it can be updated
later, if/when the external data source changes. Example scenarios:

- A remote system may produce occasional data files that are used in other
projects.
Expand All @@ -37,6 +39,12 @@ external data source changes. Example scenarios:
having to manually copy files from the supported locations (listed below), which
may require installing a different tool for each type.

When you don't want to store the target data in your local system, you can still
create an import `.dvc` file while transferring a file or directory directly to
remote storage, by using the `--to-remote` option. See the
[Transfer to remote storage](#example-transfer-to-remote-storage) example for
more details.

The `url` argument specifies the external location of the data to be imported.
The imported data is <abbr>cached</abbr>, and linked (or copied) to the current
working directory with its original file name e.g. `data.txt` (or to a location
Expand Down Expand Up @@ -131,6 +139,15 @@ $ dvc run -n download_data \
finish the operation(s)); or if the target data already exist locally and you
want to "DVCfy" this state of the project (see also `dvc commit`).

- `--to-remote` - import an external target, but don't move it into the
workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
directly to remote storage (the default one, unless `-r` is specified)
instead. Use `dvc pull` to get the data locally.

- `-r <name>`, `--remote <name>` - name of the
[remote storage](/doc/command-reference/remote) (can only be used with
`--to-remote`).

- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
from the source. The default value is `4 * cpu_count()`. For SSH remotes, the
default is `4`. Using more jobs may speed up the operation.
Expand Down Expand Up @@ -340,3 +357,47 @@ $ dvc repro
Running stage 'prepare' with command:
python src/prepare.py data/data.xml
```

## Example: Transfer to remote storage

When you have a large dataset in an external location, you may want to import it
to you project without downloading it to the local file system (for using it
later/elsewhere). The `--to-remote` option lets you skip the download, while
Comment on lines +361 to +365
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And copy over the Example as well (will need some some adapting). Thanks

storing the imported data [remotely](/doc/command-reference/remote). Let's
initialize a DVC project, and setup a remote:

```dvc
$ mkdir example # workspace
$ cd example
$ git init
$ dvc init
$ mkdir /tmp/dvc-storage
$ dvc remote add myremote /tmp/dvc-storage
```

Now let's create an import `.dvc` file without downloading the target data,
transferring it directly to remote storage instead:

```
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
$ dvc import-url https://data.dvc.org/get-started/data.xml data.xml \
--to-remote -r myremote
...
```

The only change in our local <abbr>workspace</abbr> is a newly created import
`.dvc` file:

```dvc
$ ls
data.xml.dvc
```

Whenever anyone wants to actually download the imported data (for example from a
system that can handle it), they can use `dvc pull` as usual:

```
$ dvc pull data.xml.dvc -r tmp_remote

A data.xml
1 file added and 1 file fetched
```
10 changes: 5 additions & 5 deletions content/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,11 +105,11 @@ repo at `url`) are not supported.
finish the operation(s)); or if the target data already exist locally and you
want to "DVCfy" this state of the project (see also `dvc commit`).

- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
from the remote. The default value is `4 * cpu_count()`. For SSH remotes, the
default is `4`. Using more jobs may speed up the operation. Note that the
default value can be set in the source repo using the `jobs` config option of
`dvc remote modify`.
- `-j <number>`, `--jobs <number>` - number of threads to run simultaneously to
handle the downloading of files from the remote. The default value is
`4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs
may improve the total download speed if a combination of small and large files
are being fetched.

- `--desc <text>` - user description of the data (optional). This doesn't affect
any DVC operations.
Expand Down