diff --git a/static/docs/command-reference/add.md b/static/docs/command-reference/add.md index 434bf85969..0b55806f35 100644 --- a/static/docs/command-reference/add.md +++ b/static/docs/command-reference/add.md @@ -216,6 +216,12 @@ outs: wdir: . ``` +The cache file with `.dir` extension is a special text file that stores the +mapping of files in the `pics/` directory (as a JSON array), along with their +checksums. (Refer to +[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +for an example.) + If instead you use the `--recursive` option, the output looks as so: ```dvc diff --git a/static/docs/command-reference/remote/index.md b/static/docs/command-reference/remote/index.md index cb3f38df2f..5afd084928 100644 --- a/static/docs/command-reference/remote/index.md +++ b/static/docs/command-reference/remote/index.md @@ -67,9 +67,7 @@ For the typical process to share the project via remote, see - `-v`, `--verbose` - displays detailed tracing information. -## Examples - -1. Let's for simplicity add a default local remote: +## Example: Add a default local remote
@@ -85,11 +83,10 @@ project/repository itself. ```dvc $ dvc remote add -d myremote /path/to/remote $ dvc remote list - myremote /path/to/remote ``` -The project's config file would look like: +The project's config file should now look like this: ```ini ['remote "myremote"'] @@ -98,10 +95,9 @@ url = /path/to/remote remote = myremote ``` -2. Add Amazon S3 remote and modify its region: +## Example: Add Amazon S3 remote and modify its region -> **Note!** Before adding a new remote be sure to login into AWS and follow -> instructions at +> **Note!** Before adding a new remote be sure follow the instructions at > [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). ```dvc @@ -109,7 +105,7 @@ $ dvc remote add mynewremote s3://mybucket/myproject $ dvc remote modify mynewremote region us-east-2 ``` -3. Remove remote: +## Example: Remove remote: ```dvc $ dvc remote remove mynewremote diff --git a/static/docs/tutorials/deep/define-ml-pipeline.md b/static/docs/tutorials/deep/define-ml-pipeline.md index fe3a98af16..dd09ebad97 100644 --- a/static/docs/tutorials/deep/define-ml-pipeline.md +++ b/static/docs/tutorials/deep/define-ml-pipeline.md @@ -138,28 +138,28 @@ train ML models out of the data files. DVC helps you to define [stages](/doc/command-reference/run) of your ML process and easily connect them into a ML [pipeline](/doc/command-reference/pipeline). -`dvc run` executes any command that you pass into it as a list of parameters. +`dvc run` executes any command that you pass it as a list of parameters. However, the command to run alone is not as interesting as its role within a -pipeline, so we'll need to specify its dependencies and output files. We call -this a pipeline stage. Dependencies may include input files and directories, and -the actual command to run. Outputs are files written to by the command, if any. +larger data pipeline, so we'll need to specify its dependencies and output +files. We call all this a pipeline _stage_. Dependencies may include input files +or directories, and the actual command to run. Outputs are files written to by +the command, if any. -1. Option `-d file.tsv` should be used to specify a dependency file or - directory. The dependency can be a regular file from a repository or a data - file. +- Option `-d file.tsv` should be used to specify a dependency file or directory. + The dependency can be a regular file from a repository or a data file. -2. `-o file.tsv` (lower case o) specifies output data file. DVC will track this - data file by creating a corresponding - [DVC-file](/doc/user-guide/dvc-file-format) (as if running `dvc add file.tsv` - after `dvc run` instead). +- `-o file.tsv` (lower case o) specifies output data file. DVC will track this + data file by creating a corresponding + [DVC-file](/doc/user-guide/dvc-file-format) (as if running `dvc add file.tsv` + after `dvc run` instead). -3. `-O file.tsv` (upper case O) specifies a regular output file (not to be added - to DVC). +- `-O file.tsv` (upper case O) specifies a regular output file (not to be added + to DVC). -It is important to specify the dependencies and the outputs of the command to -run before the command to run itself. +It's important to specify dependencies and outputs before the command to run +itself. -Let's see how an extract command `unzip` works under DVC: +Let's see how an extraction command `unzip` works under DVC, for example: ```dvc $ dvc run -d data/Posts.xml.zip -o data/Posts.xml \ @@ -191,9 +191,9 @@ The `unzip` command extracts data file `data/Posts.xml.zip` to a regular file `data/Posts.xml`. It knows nothing about data files or DVC. DVC executes the command and does some additional work if the command was successful: -1. DVC transforms all the output files (`-o` option) into data files. It's like - applying `dvc add` for each of the outputs. As a result, all the actual data - files content goes to the cache directory `.dvc/cache` and each +1. DVC transforms all the output files (`-o` option) into tracked data files + (similar to using `dvc add` for each of them). As a result, all the actual + data contents goes to the cache directory `.dvc/cache`, and each of the file names will be added to `.gitignore`. 2. For reproducibility purposes, `dvc run` creates the `Posts.xml.dvc` stage diff --git a/static/docs/understanding-dvc/collaboration-issues.md b/static/docs/understanding-dvc/collaboration-issues.md index c20b32428c..651e6cbfc1 100644 --- a/static/docs/understanding-dvc/collaboration-issues.md +++ b/static/docs/understanding-dvc/collaboration-issues.md @@ -10,14 +10,16 @@ manage. To make progress on this challenge, many areas of the ML experimentation process need to be formalized. Many common questions need to be answered in an unified, -principled way: +principled way. -1. Source code and data versioning. +## Questions + +### Source code and data versioning - How do you avoid any discrepancies between versions of the source code and versions of the data files when the data cannot fit into a repository? -2. Experiment time log. +### Experiment time log - How do you track which of the [hyperparameter]() @@ -25,7 +27,7 @@ principled way: [metric](/doc/command-reference/metrics)? How do you monitor the extent of each change? -3. Navigating through experiments. +### Navigating through experiments - How do you recover a model from last week without wasting time waiting for the model to retrain? @@ -33,12 +35,12 @@ principled way: - How do you quickly switch between the large dataset and a small subset without modifying source code? -4. Reproducibility. +### Reproducibility - How do you run a model's evaluation again without retraining the model and preprocessing a raw dataset? -5. Managing and sharing large data files. +### Managing and sharing large data files - How do you share models trained in a GPU environment with colleagues who don't have access to a GPU? diff --git a/static/docs/understanding-dvc/core-features.md b/static/docs/understanding-dvc/core-features.md index c614ba2c90..6faf9694d2 100644 --- a/static/docs/understanding-dvc/core-features.md +++ b/static/docs/understanding-dvc/core-features.md @@ -1,20 +1,19 @@ # Core Features -1. DVC works **on top of Git repositories** and has a similar command line - interface and Git workflow. +- DVC works **on top of Git repositories** and has a similar command line + interface and Git workflow. -2. It makes data science projects **reproducible** by creating lightweight - [pipelines](/doc/command-reference/pipeline) using implicit dependency - graphs. +- It makes data science projects **reproducible** by creating lightweight + [pipelines](/doc/command-reference/pipeline) using implicit dependency graphs. -3. **Large data file versioning** works by creating pointers in your Git - repository to the cache, typically stored on a local hard drive. +- **Large data file versioning** works by creating pointers in your Git + repository to the cache, typically stored on a local hard drive. -4. **Programming language agnostic**: Python, R, Julia, shell scripts, etc. ML - library agnostic: Keras, Tensorflow, PyTorch, scipy, etc. +- **Programming language agnostic**: Python, R, Julia, shell scripts, etc. ML + library agnostic: Keras, Tensorflow, PyTorch, scipy, etc. -5. **Open-sourced** and **Self-served**: DVC is free and doesn't require any - additional services. +- **Open-sourced** and **Self-served**: DVC is free and doesn't require any + additional services. -6. DVC supports cloud storage (Amazon S3, Azure Blob Storage, and Google Cloud - Storage) for **data sources and pre-trained models sharing**. +- DVC supports cloud storage (Amazon S3, Azure Blob Storage, and Google Cloud + Storage) for **data sources and pre-trained models sharing**. diff --git a/static/docs/understanding-dvc/how-it-works.md b/static/docs/understanding-dvc/how-it-works.md index 05cf87b8f2..8149b7fdca 100644 --- a/static/docs/understanding-dvc/how-it-works.md +++ b/static/docs/understanding-dvc/how-it-works.md @@ -1,93 +1,93 @@ # How It Works -1. DVC is a command line tool that works on top of Git: - - ```dvc - $ cd my_git_repo - $ dvc init - ``` - - > See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) - -2. DVC helps define command pipelines, and keeps each command - [stage](/doc/command-reference/run) and dependencies in a Git repository: - - ```dvc - $ dvc run -d input.csv -o model.pkl -o results.csv \ - python cnn_train.py --seed 20180227 --epoch 20 \ - input.csv model.pkl results.csv - $ git add model.pkl.dvc - $ git commit -m "Train CNN. 20 epochs." - ``` - -3. DVC is programming language agnostic. R command example: - - ```dvc - $ dvc run -d result.csv -o plots.jpg \ - Rscript plot.R result.csv plots.jpg - $ git add plots.jpg.dvc - $ git commit -m "CNN plots" - ``` - -4. DVC can reproduce a pipeline with respect to its dependencies: - - ```dvc - # The input dataset was changed - $ dvc repro plots.jpg.dvc - - Reproducing 'model.pkl': - python cnn_train.py --seed 20180227 --epoch 20 \ - input.csv model.pkl results.csv - Reproducing 'plots.jpg': - Rscript plot.R result.csv plots.jpg - ``` - -5. DVC introduces the concept of data files to Git repositories. DVC keeps data - files outside of the repository but saves special - [DVC-files](/doc/user-guide/dvc-file-format) in Git: - - ```dvc - $ git checkout a03_normbatch_vgg16 # checkout code and DVC-files - $ dvc checkout # checkout data files from the cache - $ ls -l data/ # These LARGE files came from the cache, not from Git - - total 1017488 - -r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv - -r-------- 2 501 staff 12G Jan 27 03:48 Posts-train.tsv - ``` - -6. DVC makes repositories reproducible. DVC-files can be easily shared through - any Git server, and allows for experiments to be easily reproduced: - - ```dvc - $ git clone https://github.com/dataversioncontrol/myrepo.git - $ cd myrepo - # Reproduce data files - $ dvc repro - - Reproducing 'output.p': - python cnn_train.py --seed 20180227 --epoch 20 \ - input.csv model.pkl results.csv - Reproducing 'plots.jpg': - Rscript plot.R result.csv plots.jpg - ``` - -7. The cache of a DVC project can be shared with colleagues through Amazon S3, - Azure Blob Storage, Google Cloud Storage, among others: - - ```dvc - $ git push - $ dvc push # push from the cache to remote storage - - # On a colleague's machine: - $ git clone https://github.com/dataversioncontrol/myrepo.git - $ cd myrepo - $ git pull # download tracked data from remote storage - $ dvc checkout # checkout data files - $ ls -l data/ # You just got gigabytes of data through Git and DVC: - - total 1017488 - -r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv - ``` - -8. DVC works on Mac, Linux, and Windows. +- DVC is a command line tool that works on top of Git: + + ```dvc + $ cd my_git_repo + $ dvc init + ``` + + > See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) + +- DVC helps define command pipelines, and keeps each command + [stage](/doc/command-reference/run) and dependencies in a Git repository: + + ```dvc + $ dvc run -d input.csv -o model.pkl -o results.csv \ + python cnn_train.py --seed 20180227 --epoch 20 \ + input.csv model.pkl results.csv + $ git add model.pkl.dvc + $ git commit -m "Train CNN. 20 epochs." + ``` + +- DVC is programming language agnostic. R command example: + + ```dvc + $ dvc run -d result.csv -o plots.jpg \ + Rscript plot.R result.csv plots.jpg + $ git add plots.jpg.dvc + $ git commit -m "CNN plots" + ``` + +- DVC can reproduce a pipeline with respect to its dependencies: + + ```dvc + # The input dataset was changed + $ dvc repro plots.jpg.dvc + + Reproducing 'model.pkl': + python cnn_train.py --seed 20180227 --epoch 20 \ + input.csv model.pkl results.csv + Reproducing 'plots.jpg': + Rscript plot.R result.csv plots.jpg + ``` + +- DVC introduces the concept of data files to Git repositories. DVC keeps data + files outside of the repository but saves special + [DVC-files](/doc/user-guide/dvc-file-format) in Git: + + ```dvc + $ git checkout a03_normbatch_vgg16 # checkout code and DVC-files + $ dvc checkout # checkout data files from the cache + $ ls -l data/ # These LARGE files came from the cache, not from Git + + total 1017488 + -r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv + -r-------- 2 501 staff 12G Jan 27 03:48 Posts-train.tsv + ``` + +- DVC makes repositories reproducible. DVC-files can be easily shared through + any Git server, and allows for experiments to be easily reproduced: + + ```dvc + $ git clone https://github.com/dataversioncontrol/myrepo.git + $ cd myrepo + # Reproduce data files + $ dvc repro + + Reproducing 'output.p': + python cnn_train.py --seed 20180227 --epoch 20 \ + input.csv model.pkl results.csv + Reproducing 'plots.jpg': + Rscript plot.R result.csv plots.jpg + ``` + +- The cache of a DVC project can be shared with colleagues through Amazon S3, + Azure Blob Storage, Google Cloud Storage, among others: + + ```dvc + $ git push + $ dvc push # push from the cache to remote storage + + # On a colleague's machine: + $ git clone https://github.com/dataversioncontrol/myrepo.git + $ cd myrepo + $ git pull # download tracked data from remote storage + $ dvc checkout # checkout data files + $ ls -l data/ # You just got gigabytes of data through Git and DVC: + + total 1017488 + -r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv + ``` + +- DVC works on Mac, Linux, and Windows. diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index 9cd764dd1e..7747bd0fed 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -1,22 +1,24 @@ # Comparison to Existing Technologies -Due to the the novelty of this approach, DVC can be better understood in -comparison to existing technologies and ideas. +Due to the the novelty of our approach, it may be easier to understand DVC in +comparison to existing technologies and tools. -DVC combines a number of existing technologies and ideas into a single product -with the goal of bringing the best engineering practices into the data science -process. +DVC combines a number of existing ideas into a single product, with the goal of +bringing best practices from software engineering into the data science field. -1. **Git**. The difference is: +## Differences with related tools + +### Git - DVC extends Git by introducing the concept of _data files_ – large files that should NOT be stored in a Git repository but still need to be tracked and versioned. -2. **Workflow management tools** ([pipelines](/doc/command-reference/pipeline) - and dependency graphs - ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph))): Airflow, - Luigi, etc. The differences are: +### Workflow management tools + +Pipelines and dependency graphs +([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) such as Airflow, +Luigi, etc. - DVC is focused on data science and modeling. As a result, DVC pipelines are lightweight, easy to create and modify. However, DVC lacks pipeline execution @@ -26,9 +28,10 @@ process. doesn't run any daemons or servers. Nevertheless, DVC can generate images with pipeline and experiment workflow visualization. -3. **Experiment management software** today is mostly designed for enterprise - usage. An open-sourced experimentation tool example: http://studio.ml/. The - differences are: +### Experiment management software + +Mostly designed for enterprise usage, but with open-sourced options such as +http://studio.ml/ - DVC uses Git as the underlying platform for experiment tracking instead of a web application. @@ -41,8 +44,7 @@ process. (including the cache directory) have a human-readable format and can be easily reused by external tools. -4. **Git workflows** and Git usage methodologies such as Gitflow. The - differences are: +### Git workflows/methodologies such as Gitflow - DVC supports a new experimentation methodology that integrates easily with a Git workflow. A separate branch should be created for each experiment, with a @@ -51,8 +53,9 @@ process. - DVC innovates by giving experimenters the ability to easily navigate through past experiments without recomputing them. -5. **[Make](https://www.gnu.org/software/make/)** (and other build automation - tools). The differences are: +### Build automation tools + +[Make](https://www.gnu.org/software/make/) and others. - DVC utilizes a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) @@ -82,7 +85,7 @@ process. avoid recomputing all dependency files checksum, which would be highly problematic when working with large files (10 GB+). -6. **Git-annex**. The differences are: +### Git-annex - DVC uses the idea of storing the content of large files (that you don't want to see in your Git repository) in a local key-value store and use file @@ -105,7 +108,7 @@ process. - DVC is not fundamentally bound to Git, having the option of changing the repository format. -7. **Git-LFS** (Large File Storage). The differences are: +### Git-LFS (Large File Storage) - DVC does not require special Git servers like Git-LFS demands. Any cloud storage like S3, GCS, or on-premises SSH server can be used as a backend for diff --git a/static/docs/user-guide/contributing-docs.md b/static/docs/user-guide/contributing-docs.md index f705635301..9cebb9128c 100644 --- a/static/docs/user-guide/contributing-docs.md +++ b/static/docs/user-guide/contributing-docs.md @@ -101,8 +101,8 @@ pre-commit hook that is integrated when `yarn` installs the project dependencies manually before committing changes if you prefer. More [advanced usage](https://prettier.io/docs/en/cli.html) of Prettier is available through `yarn`, for example - `yarn prettier --write '{pages,src}/**/*.{js,jsx}` formats all the JavaScript - files. + `yarn prettier --write '{pages,src}/**/*.{js,jsx,md}` formats all the + JavaScript and Markdown files. - Using `dvc ` in the Markdown files, the docs engine will create a link to that command automatically. (No need to use `[]()` explicitly to diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index 1a027c3e94..be43dfb1f1 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -93,24 +93,17 @@ $ tree .dvc/cache │   └── 1d8cd98f00b204e9800998ecf8427e └── 20     └── 0b40427ee0998e9802335d98f08cd98f -$ cat .dvc/cache/19/6a322c107c2572335158503c64bfba.dir -[{"md5": "d41d8cd98f00b204e9800998ecf8427e", "relpath": "cat.jpeg"}, - {"md5": "200b40427ee0998e9802335d98f08cd98f", "relpath": "index.jpeg"}] ``` -Like the previous case, the first two digits of the checksum are used to name -the directory and rest 30 characters are used in naming the cache file. The -cache file with `.dir` extension stores the mapping of files in the data -directory and their checksum as an array. The other two cache file names are -checksums of the files stored inside data directory. A typical `.dir` cache file -looks like this: +The cache file with `.dir` extension is a special text file that stores the +mapping of files in the `data/` directory (as a JSON array), along with their +checksums. The other two cache files are the files inside `data/`. A typical +`.dir` cache file looks like this: ```dvc $ cat .dvc/cache/19/6a322c107c2572335158503c64bfba.dir -[ - {"md5": "dff70c0392d7d386c39a23c64fcc0376", "relpath": "cat.jpeg"}, - {"md5": "29a6c8271c0c8fbf75d3b97aecee589f", "relpath": "index.jpeg"} -] +[{"md5": "dff70c0392d7d386c39a23c64fcc0376", "relpath": "cat.jpeg"}, +{"md5": "29a6c8271c0c8fbf75d3b97aecee589f", "relpath": "index.jpeg"}] ``` See also `dvc cache dir` to set the location of the cache directory. diff --git a/static/docs/user-guide/external-dependencies.md b/static/docs/user-guide/external-dependencies.md index 3210ee68b4..b0f6e5911d 100644 --- a/static/docs/user-guide/external-dependencies.md +++ b/static/docs/user-guide/external-dependencies.md @@ -12,17 +12,18 @@ DVC to control data externally. With DVC you can specify external files as dependencies for your pipeline stages. DVC will track changes in those files and will reflect that in your -pipeline state. Currently, the following types of external dependencies -(protocols) are supported: +pipeline state. Currently, the following types (protocols) of external +dependencies are supported: -1. Local files and directories outside of your dvc repository; -2. Amazon S3; -3. Google Cloud Storage; -4. SSH; -5. HDFS; -6. HTTP; +- Local files and directories outside of your workspace; +- SSH; +- Amazon S3; +- Google Cloud Storage; +- HDFS; +- HTTP -> Note that these match with the remote storage types supported by `dvc remote`. +> Note that these are a subset of the remote storage types supported by +> `dvc remote`. In order to specify an external dependency for your stage, use the usual '-d' option in `dvc run` with the external path or URL pointing to your desired file @@ -45,6 +46,14 @@ $ dvc run -d /home/shared/data.txt \ cp /home/shared/data.txt data.txt ``` +### SSH + +```dvc +$ dvc run -d ssh://user@example.com:/home/shared/data.txt \ + -o data.txt \ + scp user@example.com:/home/shared/data.txt data.txt +``` + ### Amazon S3 ```dvc @@ -61,14 +70,6 @@ $ dvc run -d gs://mybucket/data.txt \ gsutil cp gs://mybucket/data.txt data.txt ``` -### SSH - -```dvc -$ dvc run -d ssh://user@example.com:/home/shared/data.txt \ - -o data.txt \ - scp user@example.com:/home/shared/data.txt data.txt -``` - ### HDFS ```dvc diff --git a/static/docs/user-guide/managing-external-data.md b/static/docs/user-guide/managing-external-data.md index dae3250bca..b2c7c91fa9 100644 --- a/static/docs/user-guide/managing-external-data.md +++ b/static/docs/user-guide/managing-external-data.md @@ -14,14 +14,14 @@ You can take under DVC control files on an external storage with `dvc add` or specify external files as outputs for [DVC-files](/doc/user-guide/dvc-file-format) created by `dvc run` (stage files) DVC will track changes in those files and will reflect so in your pipeline -[status](/doc/command-reference/status). Currently, the following types of -external outputs (protocols) are supported: +[status](/doc/command-reference/status). Currently, the following types +(protocols) of external outputs (and cache) are supported: -1. Local files and directories outside of your dvc repository; -2. Amazon S3; -3. Google Cloud Storage; -4. SSH; -5. HDFS; +- Local files and directories outside of your workspace; +- SSH; +- Amazon S3; +- Google Cloud Storage; +- HDFS > Note that these are a subset of the remote storage types supported by > `dvc remote`. @@ -62,10 +62,28 @@ $ dvc run -d data.txt \ cp data.txt /home/shared/data.txt ``` +### SSH + +```dvc +# Add SSH remote to be used as cache location for SSH files +$ dvc remote add sshcache ssh://user@example.com:/cache + +# Tell dvc to use the 'sshcache' remote as SSH cache location +$ dvc config cache.ssh sshcache + +# Add data on SSH directly +$ dvc add ssh://user@example.com:/mydata + +# Create the stage with external SSH output +$ dvc run -d data.txt \ + -o ssh://user@example.com:/home/shared/data.txt \ + scp data.txt user@example.com:/home/shared/data.txt +``` + ### Amazon S3 ```dvc -# Add S3 remote to be uses as cache location for S3 files +# Add S3 remote to be used as cache location for S3 files $ dvc remote add s3cache s3://mybucket/cache # Tell dvc to use the 's3cache' remote as S3 cache location @@ -98,24 +116,6 @@ $ dvc run -d data.txt \ gsutil cp data.txt gs://mybucket/data.txt ``` -### SSH - -```dvc -# Add SSH remote to be used as cache location for SSH files -$ dvc remote add sshcache ssh://user@example.com:/cache - -# Tell dvc to use the 'sshcache' remote as SSH cache location -$ dvc config cache.ssh sshcache - -# Add data on SSH directly -$ dvc add ssh://user@example.com:/mydata - -# Create the stage with external SSH output -$ dvc run -d data.txt \ - -o ssh://user@example.com:/home/shared/data.txt \ - scp data.txt user@example.com:/home/shared/data.txt -``` - ### HDFS ```dvc