Skip to content

Commit

Permalink
Merge pull request #464 from jorgeorpinel/master
Browse files Browse the repository at this point in the history
cmd ref: doc `import`, `get`, and `get-url`; other misc. updates
  • Loading branch information
shcheklein authored Jul 15, 2019
2 parents 2aff0ed + 1bdd04f commit ad20d2e
Show file tree
Hide file tree
Showing 57 changed files with 585 additions and 280 deletions.
6 changes: 6 additions & 0 deletions src/Documentation/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,11 @@
"destroy.md",
"diff.md",
"fetch.md",
"get-url.md",
"get.md",
"gc.md",
"import-url.md",
"import.md",
"init.md",
"install.md",
"lock.md",
Expand Down Expand Up @@ -135,8 +138,11 @@
"destroy.md": "destroy",
"diff.md": "diff",
"fetch.md": "fetch",
"get-url.md": "get-url",
"get.md": "get",
"gc.md": "gc",
"import-url.md": "import-url",
"import.md": "import",
"init.md": "init",
"install.md": "install",
"lock.md": "lock",
Expand Down
12 changes: 6 additions & 6 deletions static/docs/commands-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,12 @@ to work with directory hierarchies with `dvc add`.
the single DVC-file points to a file in the DVC cache that contains
references to the files in the added hierarchy.

In a DVC project `dvc add` can be used to version control any data artifacts -
input, intermediate, output files and directories, as well as model files. It is
useful by itself to go back and forth between different versions of datasets or
models. Usually though, it is recommended to use `dvc run` and `dvc repro`
mechanism to version control intermediate and output artifacts (like models).
This way you bring data provenance and make your project reproducible.
In a DVC project `dvc add` can be used to version control any <abbr>data
artifact</abbr> (input, intermediate, or output files and directories, and model
files). It is useful by itself to go back and forth between different versions
of datasets or models. Usually though, it is recommended to use `dvc run` and
`dvc repro` mechanism to version control intermediate and final results (like
models). This way you bring data provenance and make your project reproducible.

## Options

Expand Down
5 changes: 2 additions & 3 deletions static/docs/commands-reference/cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,8 @@ default `cache` directory.

The DVC cache is where your data files, models, etc (anything you want to
version with DVC) are actually stored. The corresponding files you see in the
working directory or "workspace" simply link to the ones in cache. (See
`dvc config cache` `type` setting for more information on file links on
different platforms.)
workspace simply link to the ones in cache. (See `dvc config cache`, `type`
config option, for more information on file links on different platforms.)

> For more cache-related configuration options refer to `dvc config cache`.
Expand Down
15 changes: 7 additions & 8 deletions static/docs/commands-reference/cache_dir.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# dir
# cache dir

Set/unset the cache directory location intuitively (compared to using
`dvc config cache`).
Expand All @@ -18,7 +18,7 @@ positional arguments:

Helper to set the `cache.dir` configuration option. Unlike doing so with
`dvc config cache`, this command transform paths (`value`) that are provided
relative to the present working directory into paths **relative to the config
relative to the current working directory into paths **relative to the config
file location**. They are required in the latter form for the config file.

## Options
Expand All @@ -29,12 +29,11 @@ file location**. They are required in the latter form for the config file.
- `--system` - modify a system config file (e.g. `/etc/dvc.config`) instead of
`.dvc/config`.

- `--local` - modify a local
[config file](/doc/user-guide/dvc-files-and-directories) instead of
`.dvc/config`. It is located in `.dvc/config.local` and is Git-ignored. This
is useful when you need to specify private config options in your config that
you don't want to track and share through Git (credentials, private locations,
etc).
- `--local` - modify a local [config file](/doc/commands-reference/config)
instead of `.dvc/config`. It is located in `.dvc/config.local` and is
Git-ignored. This is useful when you need to specify private config options in
your config that you don't want to track and share through Git (credentials,
private locations, etc).

- `-u`, `--unset` - remove the `cache.dir` config option from the config file.
Don't provide a `value` when using this flag.
Expand Down
2 changes: 1 addition & 1 deletion static/docs/commands-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33
```

What if we want to rewind history, so to speak? The `git checkout` command lets
us checkout at any point in the commit history, or even check out other tags. It
us checkout at any point in the commit history, or even checkout other tags. It
automatically adjusts the files, by replacing file content and adding or
deleting files as necessary.

Expand Down
12 changes: 6 additions & 6 deletions static/docs/commands-reference/commit.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,12 @@ to the DVC cache as the last step. What _commit_ means is that DVC:
- Adds the file/directory or to the DVC cache.

There are many cases where the last step is not desirable (usually, rapid
iteration on some experiment). For the DVC commands where it is appropriate the
`--no-commit` option prevents the last step from occurring - thus, we are saving
some time and space, by not storing all the data artifacts for all the attempts
we do. The checksum is still computed and added to the DVC-file, but the file is
not added to the cache. That's where the `dvc commit` command comes into play.
It handles that last step of adding the file to the DVC cache.
iteration on some experiment). For the DVC commands where available, the
`--no-commit` option prevents the last step from occurring, thus we are saving
time and space by not storing all the <abbr>data artifacts</abbr> for every
command attempt. The checksum is still computed and added to the DVC-file, but
the file is not added to the cache. That's where the `dvc commit` command comes
into play. It handles that last step of adding the file to the DVC cache.

## Options

Expand Down
10 changes: 5 additions & 5 deletions static/docs/commands-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ You can query/set/replace/unset DVC configuration options with this command. It
takes a config option `name` (a section and a key, separated by a dot) and its
`value` (any valid alpha-numeric string generally).

This command reads and overwrites the DVC config file `.dvc/config`. If
This command reads and overwrites the DVC configuration file `.dvc/config`. If
`--local` option is specified, `.dvc/config.local` is modified instead.

If the config option `value` is not provided and `--unset` option is not used,
Expand Down Expand Up @@ -95,16 +95,16 @@ details.)
config location results in `.dvc/cache`.

> See also helper command `dvc cache dir` to intuitively set this config
> option, properly transforming paths relative to the present working
> option, properly transforming paths relative to the current working
> directory into paths relative to the config file location.
- `cache.protected` - makes files in the workspace read-only. Possible values
are `true` or `false` (default). Run `dvc checkout` for the change go into
effect. (It affects only files that are under DVC control.)

Due to the way DVC handles linking between the data files in the cache and
their counterparts in the working directory, it's easy to accidentally corrupt
the cached version of a file by editing or overwriting it. Turning this config
their counterparts in the workspace, it's easy to accidentally corrupt the
cached version of a file by editing or overwriting it. Turning this config
option on forces you to run `dvc unprotect` before updating a file, providing
an additional layer of security to your data.

Expand Down Expand Up @@ -158,7 +158,7 @@ details.)

### state

State config options. Check the
State config options. See
[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to learn
more about the state file that is used for optimization.

Expand Down
2 changes: 1 addition & 1 deletion static/docs/commands-reference/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ by the Git SCM, for example when `dvc init` was used with the `--no-scm` option.

- `-t TARGET`, `--target TARGET` - Source path to a data file or directory. If
not specified, compares all files and directories that are under DVC control
in the current workspace.
in the workspace.

- `-h`, `--help` - prints the usage/help message, and exit.

Expand Down
2 changes: 1 addition & 1 deletion static/docs/commands-reference/gc.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ $ du -sh .dvc/cache/
```

When you run `dvc gc` it removes all objects from cache that are not referenced
in the current workspace (by collecting hash sums from the DVC-files):
in the workspace (by collecting hash sums from the DVC-files):

```dvc
$ dvc gc
Expand Down
158 changes: 158 additions & 0 deletions static/docs/commands-reference/get-url.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# get-url

Download or copy file or directory from any supported URL (for example `s3://`,
`ssh://`, and other protocols) or local directory to the local file system.

> Unlike `dvc import-url`, this command does not track the downloaded data
> file(s) (does not create a DVC-file).
## Synopsis

```usage
usage: dvc get-url [-h] [-q | -v] url [out]
positional arguments:
url (See supported URLs in the description.)
out Destination path to put data to.
```

## Description

In some cases it's convenient to get a data file or directory from a remote
location into the current working directory, regardless of whether it's a DVC
project. The `dvc get-url` command helps the user do just that.

The `url` argument should provide the location of the data to be downloaded,
while `out` can be used to specify the (path and) file name desired for the
downloaded data file or directory.

Note that this command doesn't require an existing DVC project to run in. It's a
single-purpose command that can be used out of the box after installing DVC.

> See `dvc get` to download data or model files or directories from other DVC
> repositories (e.g. Github URLs).
DVC supports several types of (local or) remote locations (protocols):

| Type | Discussion | URL format |
| ------- | ------------------------------------------------------- | ------------------------------------------ |
| `local` | Local path | `/path/to/local/file` |
| `s3` | Amazon S3 | `s3://mybucket/data.csv` |
| `gs` | Google Storage | `gs://mybucket/data.csv` |
| `ssh` | SSH server | `ssh://user@example.com:/path/to/data.csv` |
| `hdfs` | HDFS | `hdfs://user@example.com/path/to/data.csv` |
| `http` | HTTP to file with _strong ETag_ (see explanation below) | `https://example.com/path/to/data.csv` |

> Depending on the remote locations type you plan to download data from you
> might need to specify one of the optional dependencies: `[s3]`, `[ssh]`,
> `[gs]`, `[azure]`, and `[oss]` (or `[all]` to include them all) when
> [installing DVC](/doc/get-started/install) with `pip`.
Another way to understand the `dvc get-url` command is as a tool for downloading
data files.

On GNU/Linux systems for example, instead of `dvc get-url` with HTTP(S) it's
possible to instead use:

```dvc
$ wget https://example.com/path/to/data.csv
```

## Options

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples

<details>

### Click and expand for a local example

```dvc
$ dvc get-url /local/path/to/data
```

The above command will copy the `/local/path/to/data` file or directory into
`./dir`.

</details>

<details>

### Click for AWS S3 example

This command will copy an S3 object into the current working directory with the
same file name:

```dvc
$ dvc get-url s3://bucket/path
```

By default DVC expects your AWS CLI is already
[configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html).
DVC will be using default AWS credentials file to access S3. To override some of
these settings, you could the options described in `dvc remote modify`.

> We use the `boto3` library to and communicate with AWS S3. The following API
> methods may be performed:
>
> - `head_object`
> - `download_file`
>
> So make sure you have the `s3:GetObject` permission enabled.
</details>

<details>

### Click for Google Cloud Storage example

```dvc
$ dvc get-url gs://bucket/path file
```

The above command downloads the `/path` file (or directory) into `./file`.

</details>

<details>

### Click for SSH example

```dvc
$ dvc get-url ssh://user@example.com/path/to/data
```

Using default SSH credentials, the above command gets the `data` file (or
directory).

</details>

<details>

### Click for HDFS example

```dvc
$ dvc get-url hdfs://user@example.com/path/to/data
```

</details>

<details>

### Click for HTTP example

> Both HTTP and HTTPS protocols are supported.
```dvc
$ dvc get-url https://example.com/path/to/data
```

</details>

<details>
47 changes: 47 additions & 0 deletions static/docs/commands-reference/get.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# get

Download or copy file or directory from another DVC repository (on a git server
such as Github) into the local file system.

> Unlike `dvc import`, this command does not track the downloaded data file(s)
> (does not create a DVC-file).
## Synopsis

```usage
usage: dvc get [-h] [-q | -v] [-o [OUT]] [--rev [REV]] url path
positional arguments:
url URL of Git repository with DVC project to download from.
path Path to data within DVC repository.
```

## Description

DVC provides an easy way to reuse datasets, intermediate results, ML models, or
other files and directories tracked in another DVC repository into the current
working directory, regardless of whether it's a DVC project. The `dvc get`
command downloads such a <abbr>data artifact</abbr>.

The `url` argument specifies the external DVC project's Git repository URL (both
HTTP and SSH protocols supported, e.g. `[user@]server:project.git`), while
`path` is used to specify the path to the data to be downloaded within the repo.

Note that this command doesn't require an existing DVC project to run in. It's a
single-purpose command that can be used out of the box after installing DVC.

> See `dvc get-url` to download data from other supported URLs.
After running this command successfully, the data found in the `url` `path` is
created in the current working directory with its original file name.

## Options

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

<!-- ## Example -->
Loading

0 comments on commit ad20d2e

Please sign in to comment.