Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More regular updates (Oct) #680

Merged
merged 5 commits into from
Oct 10, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 5 additions & 9 deletions static/docs/command-reference/remote/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,7 @@ For the typical process to share the <abbr>project</abbr> via remote, see

- `-v`, `--verbose` - displays detailed tracing information.

## Examples

1. Let's for simplicity add a default local remote:
## Example: Add a default local remote:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

<details>

Expand All @@ -85,11 +83,10 @@ project/repository itself.
```dvc
$ dvc remote add -d myremote /path/to/remote
$ dvc remote list

myremote /path/to/remote
```

The <abbr>project</abbr>'s config file would look like:
The <abbr>project</abbr>'s config file should now look like this:

```ini
['remote "myremote"']
Expand All @@ -98,18 +95,17 @@ url = /path/to/remote
remote = myremote
```

2. Add Amazon S3 remote and modify its region:
## Example: Add Amazon S3 remote and modify its region:

> **Note!** Before adding a new remote be sure to login into AWS and follow
> instructions at
> **Note!** Before adding a new remote be sure follow the instructions at
> [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html).

```dvc
$ dvc remote add mynewremote s3://mybucket/myproject
$ dvc remote modify mynewremote region us-east-2
```

3. Remove remote:
## Example: Remove remote:

```dvc
$ dvc remote remove mynewremote
Expand Down
38 changes: 19 additions & 19 deletions static/docs/tutorials/deep/define-ml-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,28 +138,28 @@ train ML models out of the data files. DVC helps you to define
[stages](/doc/command-reference/run) of your ML process and easily connect them
into a ML [pipeline](/doc/command-reference/pipeline).

`dvc run` executes any command that you pass into it as a list of parameters.
`dvc run` executes any command that you pass it as a list of parameters.
However, the command to run alone is not as interesting as its role within a
pipeline, so we'll need to specify its dependencies and output files. We call
this a pipeline stage. Dependencies may include input files and directories, and
the actual command to run. Outputs are files written to by the command, if any.
larger data pipeline, so we'll need to specify its dependencies and output
files. We call all this a pipeline _stage_. Dependencies may include input files
or directories, and the actual command to run. Outputs are files written to by
the command, if any.

1. Option `-d file.tsv` should be used to specify a dependency file or
directory. The dependency can be a regular file from a repository or a data
file.
- Option `-d file.tsv` should be used to specify a dependency file or directory.
The dependency can be a regular file from a repository or a data file.

2. `-o file.tsv` (lower case o) specifies output data file. DVC will track this
data file by creating a corresponding
[DVC-file](/doc/user-guide/dvc-file-format) (as if running `dvc add file.tsv`
after `dvc run` instead).
- `-o file.tsv` (lower case o) specifies output data file. DVC will track this
data file by creating a corresponding
[DVC-file](/doc/user-guide/dvc-file-format) (as if running `dvc add file.tsv`
after `dvc run` instead).

3. `-O file.tsv` (upper case O) specifies a regular output file (not to be added
to DVC).
- `-O file.tsv` (upper case O) specifies a regular output file (not to be added
to DVC).

It is important to specify the dependencies and the outputs of the command to
run before the command to run itself.
It's important to specify dependencies and outputs before the command to run
itself.

Let's see how an extract command `unzip` works under DVC:
Let's see how an extraction command `unzip` works under DVC, for example:

```dvc
$ dvc run -d data/Posts.xml.zip -o data/Posts.xml \
Expand Down Expand Up @@ -191,9 +191,9 @@ The `unzip` command extracts data file `data/Posts.xml.zip` to a regular file
`data/Posts.xml`. It knows nothing about data files or DVC. DVC executes the
command and does some additional work if the command was successful:

1. DVC transforms all the output files (`-o` option) into data files. It's like
applying `dvc add` for each of the outputs. As a result, all the actual data
files content goes to the <abbr>cache</abbr> directory `.dvc/cache` and each
1. DVC transforms all the output files (`-o` option) into tracked data files
(similar to using `dvc add` for each of them). As a result, all the actual
data contents goes to the <abbr>cache</abbr> directory `.dvc/cache`, and each
of the file names will be added to `.gitignore`.

2. For reproducibility purposes, `dvc run` creates the `Posts.xml.dvc` stage
Expand Down
14 changes: 8 additions & 6 deletions static/docs/understanding-dvc/collaboration-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,35 +10,37 @@ manage.

To make progress on this challenge, many areas of the ML experimentation process
need to be formalized. Many common questions need to be answered in an unified,
principled way:
principled way.

1. Source code and data versioning.
## Questions

### Source code and data versioning

- How do you avoid any discrepancies between versions of the source code and
versions of the data files when the data cannot fit into a repository?

2. Experiment time log.
### Experiment time log

- How do you track which of the
[hyperparameter](<https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)>)
changes contributed the most to producing your target
[metric](/doc/command-reference/metrics)? How do you monitor the extent of
each change?

3. Navigating through experiments.
### Navigating through experiments

- How do you recover a model from last week without wasting time waiting for the
model to retrain?

- How do you quickly switch between the large dataset and a small subset without
modifying source code?

4. Reproducibility.
### Reproducibility

- How do you run a model's evaluation again without retraining the model and
preprocessing a raw dataset?

5. Managing and sharing large data files.
### Managing and sharing large data files

- How do you share models trained in a GPU environment with colleagues who don't
have access to a GPU?
Expand Down
25 changes: 12 additions & 13 deletions static/docs/understanding-dvc/core-features.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,19 @@
# Core Features

1. DVC works **on top of Git repositories** and has a similar command line
interface and Git workflow.
- DVC works **on top of Git repositories** and has a similar command line
interface and Git workflow.

2. It makes data science projects **reproducible** by creating lightweight
[pipelines](/doc/command-reference/pipeline) using implicit dependency
graphs.
- It makes data science projects **reproducible** by creating lightweight
[pipelines](/doc/command-reference/pipeline) using implicit dependency graphs.

3. **Large data file versioning** works by creating pointers in your Git
repository to the <abbr>cache</abbr>, typically stored on a local hard drive.
- **Large data file versioning** works by creating pointers in your Git
repository to the <abbr>cache</abbr>, typically stored on a local hard drive.

4. **Programming language agnostic**: Python, R, Julia, shell scripts, etc. ML
library agnostic: Keras, Tensorflow, PyTorch, scipy, etc.
- **Programming language agnostic**: Python, R, Julia, shell scripts, etc. ML
library agnostic: Keras, Tensorflow, PyTorch, scipy, etc.

5. **Open-sourced** and **Self-served**: DVC is free and doesn't require any
additional services.
- **Open-sourced** and **Self-served**: DVC is free and doesn't require any
additional services.

6. DVC supports cloud storage (Amazon S3, Azure Blob Storage, and Google Cloud
Storage) for **data sources and pre-trained models sharing**.
- DVC supports cloud storage (Amazon S3, Azure Blob Storage, and Google Cloud
Storage) for **data sources and pre-trained models sharing**.
182 changes: 91 additions & 91 deletions static/docs/understanding-dvc/how-it-works.md
Original file line number Diff line number Diff line change
@@ -1,93 +1,93 @@
# How It Works

1. DVC is a command line tool that works on top of Git:

```dvc
$ cd my_git_repo
$ dvc init
```

> See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories)

2. DVC helps define command pipelines, and keeps each command
[stage](/doc/command-reference/run) and dependencies in a Git repository:

```dvc
$ dvc run -d input.csv -o model.pkl -o results.csv \
python cnn_train.py --seed 20180227 --epoch 20 \
input.csv model.pkl results.csv
$ git add model.pkl.dvc
$ git commit -m "Train CNN. 20 epochs."
```

3. DVC is programming language agnostic. R command example:

```dvc
$ dvc run -d result.csv -o plots.jpg \
Rscript plot.R result.csv plots.jpg
$ git add plots.jpg.dvc
$ git commit -m "CNN plots"
```

4. DVC can reproduce a pipeline with respect to its dependencies:

```dvc
# The input dataset was changed
$ dvc repro plots.jpg.dvc

Reproducing 'model.pkl':
python cnn_train.py --seed 20180227 --epoch 20 \
input.csv model.pkl results.csv
Reproducing 'plots.jpg':
Rscript plot.R result.csv plots.jpg
```

5. DVC introduces the concept of data files to Git repositories. DVC keeps data
files outside of the repository but saves special
[DVC-files](/doc/user-guide/dvc-file-format) in Git:

```dvc
$ git checkout a03_normbatch_vgg16 # checkout code and DVC-files
$ dvc checkout # checkout data files from the cache
$ ls -l data/ # These LARGE files came from the cache, not from Git

total 1017488
-r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv
-r-------- 2 501 staff 12G Jan 27 03:48 Posts-train.tsv
```

6. DVC makes repositories reproducible. DVC-files can be easily shared through
any Git server, and allows for experiments to be easily reproduced:

```dvc
$ git clone https://github.com/dataversioncontrol/myrepo.git
$ cd myrepo
# Reproduce data files
$ dvc repro

Reproducing 'output.p':
python cnn_train.py --seed 20180227 --epoch 20 \
input.csv model.pkl results.csv
Reproducing 'plots.jpg':
Rscript plot.R result.csv plots.jpg
```

7. The cache of a DVC project can be shared with colleagues through Amazon S3,
Azure Blob Storage, Google Cloud Storage, among others:

```dvc
$ git push
$ dvc push # push from the cache to remote storage

# On a colleague's machine:
$ git clone https://github.com/dataversioncontrol/myrepo.git
$ cd myrepo
$ git pull # download tracked data from remote storage
$ dvc checkout # checkout data files
$ ls -l data/ # You just got gigabytes of data through Git and DVC:

total 1017488
-r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv
```

8. DVC works on Mac, Linux, and Windows.
- DVC is a command line tool that works on top of Git:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ cd my_git_repo
$ dvc init
```

> See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories)

- DVC helps define command pipelines, and keeps each command
[stage](/doc/command-reference/run) and dependencies in a Git repository:

```dvc
$ dvc run -d input.csv -o model.pkl -o results.csv \
python cnn_train.py --seed 20180227 --epoch 20 \
input.csv model.pkl results.csv
$ git add model.pkl.dvc
$ git commit -m "Train CNN. 20 epochs."
```

- DVC is programming language agnostic. R command example:

```dvc
$ dvc run -d result.csv -o plots.jpg \
Rscript plot.R result.csv plots.jpg
$ git add plots.jpg.dvc
$ git commit -m "CNN plots"
```

- DVC can reproduce a pipeline with respect to its dependencies:

```dvc
# The input dataset was changed
$ dvc repro plots.jpg.dvc

Reproducing 'model.pkl':
python cnn_train.py --seed 20180227 --epoch 20 \
input.csv model.pkl results.csv
Reproducing 'plots.jpg':
Rscript plot.R result.csv plots.jpg
```

- DVC introduces the concept of data files to Git repositories. DVC keeps data
files outside of the repository but saves special
[DVC-files](/doc/user-guide/dvc-file-format) in Git:

```dvc
$ git checkout a03_normbatch_vgg16 # checkout code and DVC-files
$ dvc checkout # checkout data files from the cache
$ ls -l data/ # These LARGE files came from the cache, not from Git

total 1017488
-r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv
-r-------- 2 501 staff 12G Jan 27 03:48 Posts-train.tsv
```

- DVC makes repositories reproducible. DVC-files can be easily shared through
any Git server, and allows for experiments to be easily reproduced:

```dvc
$ git clone https://github.com/dataversioncontrol/myrepo.git
$ cd myrepo
# Reproduce data files
$ dvc repro

Reproducing 'output.p':
python cnn_train.py --seed 20180227 --epoch 20 \
input.csv model.pkl results.csv
Reproducing 'plots.jpg':
Rscript plot.R result.csv plots.jpg
```

- The cache of a DVC project can be shared with colleagues through Amazon S3,
Azure Blob Storage, Google Cloud Storage, among others:

```dvc
$ git push
$ dvc push # push from the cache to remote storage

# On a colleague's machine:
$ git clone https://github.com/dataversioncontrol/myrepo.git
$ cd myrepo
$ git pull # download tracked data from remote storage
$ dvc checkout # checkout data files
$ ls -l data/ # You just got gigabytes of data through Git and DVC:

total 1017488
-r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv
```

- DVC works on Mac, Linux, and Windows.
Loading