diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index 97cb30c2a3..10604461dd 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -249,7 +249,6 @@ $ git status -s M src/train.py $ dvc status - train.dvc: changed deps: modified: src/train.py @@ -275,7 +274,6 @@ dependencies ['src/train.py'] of 'train.dvc' changed. Are you sure you commit it? [y/n] y $ dvc status - Data and pipelines are up to date. ``` diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index 7e6c843796..8a3bbd22cf 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -154,8 +154,8 @@ into our local cache. ```dvc $ dvc status --cloud ... - deleted: data/features/train.pkl - deleted: model.pkl + deleted: data/features/train.pkl + deleted: model.pkl $ dvc fetch diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 21e682703e..8bc4c2e8b3 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -31,20 +31,19 @@ directory. (Analogous to `wget`, but for repos.) > directories to download. The `url` argument specifies the address of the DVC or Git repository containing -the data source. Both HTTP and SSH protocols are supported for online repos -(e.g. `[user@]server:project.git`). `url` can also be a local file system path -to an "offline" repo (if it's a DVC repo without a default remote, instead of -downloading, DVC will try to copy the target data from its cache). +the data source. Both HTTP and SSH protocols are supported (e.g. +`[user@]server:project.git`). `url` can also be a local file system path. The `path` argument is used to specify the location of the target to download within the source repository at `url`. `path` can specify any file or directory -in the source repo, either tracked by DVC (including paths inside tracked -directories) or by Git. Note that DVC-tracked targets must be found in a -`dvc.yaml` or `.dvc` file of the repo. - -⚠️ The project should have a default -[DVC remote](/doc/command-reference/remote), containing the actual data for this -command to work. +tracked by either Git or DVC (including paths inside tracked directories). Note +that DVC-tracked targets must be found in a `dvc.yaml` or `.dvc` file of the +repo. + +⚠️ DVC repos should have a default [DVC remote](/doc/command-reference/remote) +containing the target actual for this command to work. The only exception is for +local repos, where DVC will try to copy the data from its cache +first. > See `dvc get-url` to download data from other supported locations such as S3, > SSH, HTTP, etc. diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index d4b283be2d..6eca56502b 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -109,8 +109,12 @@ $ dvc run -n download_data \ wget https://data.dvc.org/get-started/data.xml -O data.xml ``` -`dvc import-url` generates an import stage `.dvc` file and `dvc run` a regular -stage (in `dvc.yaml`). +`dvc import-url` generates an import stage `.dvc` file and +`dvc run` a regular stage (in `dvc.yaml`). + +⚠️ DVC won't push or pull imported data to/from +[remote storage](/doc/command-reference/remote), it will rely on it's original +source. ## Options diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 341d4fd587..3f9d9f7a6b 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -34,21 +34,19 @@ updating the import later, if it has changed in its data source. (See > directories to import. The `url` argument specifies the address of the DVC or Git repository containing -the data source. Both HTTP and SSH protocols are supported for online repos -(e.g. `[user@]server:project.git`). `url` can also be a local file system path -to an "offline" repo (if it's a DVC repo without a default remote, instead of -downloading, DVC will try to copy the target data from its cache). +the data source. Both HTTP and SSH protocols are supported (e.g. +`[user@]server:project.git`). `url` can also be a local file system path. The `path` argument is used to specify the location of the target to download within the source repository at `url`. `path` can specify any file or directory -in the source repo, either tracked by DVC (including paths inside tracked -directories) or by Git. Note that DVC-tracked targets must be found in a -`dvc.yaml` or `.dvc` file of the repo. Chained imports (importing data that was -imported into the source repo at `url`) are not supported, however. +tracked by either Git or DVC (including paths inside tracked directories). Note +that DVC-tracked targets must be found in a `dvc.yaml` or `.dvc` file of the +repo. -⚠️ The project should have a default -[DVC remote](/doc/command-reference/remote), containing the actual data for this -command to work. +⚠️ DVC repos should have a default [DVC remote](/doc/command-reference/remote) +containing the target actual for this command to work. The only exception is for +local repos, where DVC will try to copy the data from its cache +first. > See `dvc import-url` to download and track data from other supported locations > such as S3, SSH, HTTP, etc. @@ -66,6 +64,10 @@ path in the workspace. It records enough metadata about the imported data to enable DVC efficiently determining whether the local copy is out of date. +⚠️ DVC won't push or pull imported data to/from +[remote storage](/doc/command-reference/remote), it will rely on it's original +source. + To actually [version the data](/doc/tutorials/get-started/data-versioning), `git add` (and `git commit`) the import stage. @@ -74,6 +76,9 @@ Note that import stages are considered always they won't be updated. Use `dvc update` to update the downloaded data artifact from the source repo. +Also note that chained imports (importing data that was imported into the source +repo at `url`) are not supported. + ## Options - `-o `, `--out ` - specify a path to the desired location in the @@ -112,9 +117,10 @@ Importing 'data/data.xml (git@github.com:iterative/example-get-started)' ``` In contrast with `dvc get`, this command doesn't just download the data file, -but it also creates an import stage (`.dvc` file) with a link to the data source -(as explained in the description above). (This import stage can later be used to -[update](/doc/command-reference/update) the import.) Check `data.xml.dvc`: +but it also creates an import stage (`.dvc` file) with a link to +the data source (as explained in the description above). (This import stage can +later be used to [update](/doc/command-reference/update) the import.) Check +`data.xml.dvc`: ```yaml md5: 7de90e7de7b432ad972095bc1f2ec0f8 diff --git a/content/docs/command-reference/install.md b/content/docs/command-reference/install.md index 35461f2ed5..24dac7b7d1 100644 --- a/content/docs/command-reference/install.md +++ b/content/docs/command-reference/install.md @@ -247,7 +247,6 @@ M model.pkl M data/features/ $ dvc status - Data and pipelines are up to date. ``` diff --git a/content/docs/command-reference/list.md b/content/docs/command-reference/list.md index 7347303170..651c5537ff 100644 --- a/content/docs/command-reference/list.md +++ b/content/docs/command-reference/list.md @@ -21,7 +21,7 @@ DVC, by effectively replacing data files, models, directories with `.dvc` files files when you browse a DVC repository on Git hosting (e.g. GitHub), you just see the `dvc.yaml` and `.dvc` files. This makes it hard to navigate the project to find data artifacts for use with `dvc get`, -`dvc import`, or `dvc.api`. +`dvc import`, or `dvc.api` functions. `dvc list` prints a virtual view of a DVC repository, as if files and directories tracked by DVC were found directly in the remote Git repo. Only the @@ -36,10 +36,9 @@ $ dvc pull $ ls ``` -The `url` argument specifies the address of the Git repository containing the -data source. Both HTTP and SSH protocols are supported for online repos (e.g. -`[user@]server:project.git`). `url` can also be a local file system path to an -"offline" Git repo. +The `url` argument specifies the address of the DVC or Git repository containing +the data source. Both HTTP and SSH protocols are supported (e.g. +`[user@]server:project.git`). `url` can also be a local file system path. The optional `path` argument is used to specify a directory to list within the source repository at `url` (including paths inside tracked directories). It's diff --git a/content/docs/command-reference/metrics/diff.md b/content/docs/command-reference/metrics/diff.md index daec243ab7..dd381aecc3 100644 --- a/content/docs/command-reference/metrics/diff.md +++ b/content/docs/command-reference/metrics/diff.md @@ -41,7 +41,7 @@ lists all the current metrics without comparisons. ## Options -- `--targets ` - limit command scope to these metric files. Using -R, +- `--targets ` - limit command scope to these metric files. Using `-R`, directories to search metric files in can also be given. When specifying arguments for `--targets` before `revisions`, you should use `--` after this option's arguments, e.g.: diff --git a/content/docs/command-reference/move.md b/content/docs/command-reference/move.md index f2bbe24df5..e564b45653 100644 --- a/content/docs/command-reference/move.md +++ b/content/docs/command-reference/move.md @@ -109,7 +109,7 @@ $ dvc commit -f - `-v`, `--verbose` - displays detailed tracing information. -## Example: change the file name +## Example: Change the file name We first use `dvc add` to track file with DVC. Then, we change its name using `dvc move`. @@ -130,7 +130,7 @@ $ tree └── other.csv.dvc ``` -## Example: change the location +## Example: Change a file location We use `dvc add` to track a file with DVC, then we use `dvc move` to change its location. If the target path is a directory and already exists, the data file is @@ -166,7 +166,7 @@ $ tree └── foo.dvc ``` -## Example: change an imported directory name and location +## Example: Move a directory Let's try the same with an entire directory imported from an external DVC repository with `dvc import`. Note that, as in the previous cases, the diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 1bcc61e8cc..4e5b65f640 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -192,6 +192,7 @@ such that the data in some of these stages should be updated in the ```dvc $ dvc status -c +... deleted: data/features/test.pkl deleted: data/features/train.pkl deleted: model.pkl diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index e2151ccfcd..09def79221 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -149,9 +149,10 @@ Imagine the project has been modified such that the ```dvc $ dvc status --cloud - new: data/model.p - new: data/matrix-test.p - new: data/matrix-train.p +... + new: data/model.p + new: data/matrix-test.p + new: data/matrix-train.p ``` One could do a simple `dvc push` to share all the data, but what if you only @@ -258,7 +259,6 @@ $ tree ~/vault/recursive 10 directories, 10 files $ dvc status --cloud - Data and pipelines are up to date. ``` diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index f9a5766a0f..c45e63edf8 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -160,11 +160,11 @@ bar.dvc: modified: bar changed outs: not in cache: foo -foo.dvc +foo.dvc: changed outs: deleted: foo changed checksum -prepare.dvc +prepare.dvc: changed outs: new: bar always changed @@ -180,11 +180,11 @@ This shows that for stage `bar.dvc`, the dependency `foo` and the ```dvc $ dvc status foo.dvc dobar -foo.dvc +foo.dvc: changed outs: deleted: foo changed checksum -dobar +dobar: changed deps: modified: bar changed outs: @@ -220,7 +220,7 @@ $ dvc status model.p Data and pipelines are up to date. $ dvc status model.p --with-deps -matrix-train.p +matrix-train.p: changed deps: modified: code/featurization.py ``` @@ -243,10 +243,11 @@ remote yet: ```dvc $ dvc status --remote storage -new: data/model.p -new: data/eval.txt -new: data/matrix-train.p -new: data/matrix-test.p +... + new: data/model.p + new: data/eval.txt + new: data/matrix-train.p + new: data/matrix-test.p ``` The output shows where the location of the remote storage is, as well as any diff --git a/content/docs/user-guide/dvcignore.md b/content/docs/user-guide/dvcignore.md index cc0b0578b7..d826859e03 100644 --- a/content/docs/user-guide/dvcignore.md +++ b/content/docs/user-guide/dvcignore.md @@ -149,12 +149,10 @@ adding new file: ```dvc $ dvc status - Data and pipelines are up to date. $ mv data/data1 data/data3 $ dvc status - data.dvc: changed outs: modified: data diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 2a3846b616..e89283bb47 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -146,27 +146,39 @@ $ dvc run -n download_file \ -## Example: DVC remote aliases +## Example: Using DVC remote aliases -If instead of a URL you'd like to use an alias that can be managed -independently, or if the external dependency location requires access -credentials, you may use `dvc remote add` to define this location as a DVC -Remote, and then use a special URL with format `remote://{remote_name}/{path}` -to define an external dependency. +You may want to encapsulate external locations as configurable entities that can +be managed independently. This is useful if multiple dependencies (or stages) +reuse the same location, or if its likely to change in the future. And if the +location requires authentication, you need a way to configure it in order to +connect. -For example, for an HTTPs remote/dependency: +[DVC remotes](/doc/command-reference/remote) can do just this. You may use +`dvc remote add` to define them, and then use a special URL with format +`remote://{remote_name}/{path}` (remote alias) to define the external +dependency. + +Let's see an example using SSH. First, register and configure the remote: + +```dvc +$ dvc remote add myssh ssh://myserver.com +$ dvc remote modify --local myssh user myuser +$ dvc remote modify --local myssh password mypassword +``` + +> Please refer to `dvc remote add` for more details like setting up access +> credentials for the different remote types. + +Now, use an alias to this remote when defining the stage: ```dvc -$ dvc remote add example https://example.com $ dvc run -n download_file \ - -d remote://example/data.txt \ + -d remote://myssh/path/to/data.txt \ -o data.txt \ wget https://example.com/data.txt -O data.txt ``` -Please refer to `dvc remote add` for more details like setting up access -credentials for the different remotes. - ## Example: `import-url` command In the previous examples, special downloading tools were used: `scp`, @@ -205,11 +217,11 @@ determine whether the source has changed and we need to download the file again. -## Example: Using import +## Example: Imports `dvc import` can download a data artifact from any DVC -project or Git repository. It also creates an external dependency in its -import `.dvc` file. +project, or any file from a Git repository. It also creates an external +dependency in its import `.dvc` file. ```dvc $ dvc import git@github.com:iterative/example-get-started model.pkl diff --git a/content/docs/user-guide/merge-conflicts.md b/content/docs/user-guide/merge-conflicts.md index 33d77948a6..231cd3d9a2 100644 --- a/content/docs/user-guide/merge-conflicts.md +++ b/content/docs/user-guide/merge-conflicts.md @@ -103,11 +103,6 @@ To resolve conflicted `.dvc` files generated by `dvc import` or `dvc import-url`, remove the conflicted hashes altogether: ```yaml -< < < < < < < HEAD -md5: 263395583f35403c8e0b1b94b30bea32 -======= -md5: 520d2602f440d13372435d91d3bfa176 -> > > > > > > branch frozen: true deps: - path: get-started/data.xml @@ -115,15 +110,15 @@ deps: url: https://github.com/iterative/dataset-registry < < < < < < < HEAD rev_lock: f31f5c4cdae787b4bdeb97a717687d44667d9e62 -======= += = = = = = = rev_lock: 06be1104741f8a7c65449322a1fcc8c5f1070a1e ->>>>>>> branch +> > > > > > > branch outs: < < < < < < < HEAD - md5: a304afb96060aad90176268345e10355 -======= += = = = = = = - md5: 35dd1fda9cfb4b645ae431f4621fa324 -> > > > > > > +> > > > > > > branch path: data.xml ``` @@ -139,4 +134,8 @@ outs: - path: data.xml ``` -And then `dvc update` the `.dvc` file. +And then `dvc update` the `.dvc` file to download the latest data from its +original source. + +> Note that updating will bring in the latest version of the data found in its +> source, which may not correspond with any of the hashes that was removed. diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index 18f86f0acb..045de098a8 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -1,6 +1,6 @@ # What Is DVC? -**Data Version Control** is a new type of data versioning, workflow and +**Data Version Control** is a new type of data versioning, workflow, and experiment management software, that builds upon [Git](https://git-scm.com/) (although it can work stand-alone). DVC reduces the gap between established engineering tool sets and data science needs, allowing users to take advantage