Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: misc content & guide updates #2359

Merged
merged 29 commits into from
May 12, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
be7407f
docs: misc content & guide updates
casperdcl Apr 6, 2021
c67bcee
docs: more get started tweaks & fixes
casperdcl Apr 6, 2021
f1648fc
basic-concepts: fix & cleanup matches
casperdcl Apr 7, 2021
f699438
Apply suggestions from code review
casperdcl Apr 8, 2021
6c9fb03
docs: GS: revert tile prefixes
casperdcl Apr 8, 2021
8350d91
respond to misc review comments
casperdcl Apr 9, 2021
79d3910
Update content/docs/start/data-and-model-access.md
jorgeorpinel Apr 14, 2021
c087649
Apply suggestions from code review
casperdcl Apr 14, 2021
17edc41
more code review updates
casperdcl Apr 14, 2021
d111851
Do You Feel Entitled?
casperdcl Apr 14, 2021
bace74b
Update content/docs/start/data-and-model-versioning.md
jorgeorpinel Apr 22, 2021
e6b5545
typo
jorgeorpinel Apr 22, 2021
adb0283
Update content/docs/start/data-pipelines.md
jorgeorpinel Apr 22, 2021
56f952b
which->that
jorgeorpinel Apr 22, 2021
25d2933
Update content/docs/start/data-pipelines.md
jorgeorpinel Apr 22, 2021
1a96bbd
misc review responses
casperdcl Apr 22, 2021
1f240d8
purge the emojis!
casperdcl Apr 22, 2021
9f7e835
minor simplification
casperdcl Apr 22, 2021
28bbdfc
revert some <abbr>s
casperdcl Apr 22, 2021
3aa0163
Update content/docs/user-guide/basic-concepts/parameter.md
jorgeorpinel May 6, 2021
e22f272
Apply suggestions from code review
jorgeorpinel May 6, 2021
57befbf
restore link
casperdcl May 6, 2021
5a27352
misc corrections
casperdcl May 6, 2021
e7db7cc
Update content/docs/start/metrics-parameters-plots.md
casperdcl May 6, 2021
5c2d69b
minor review feedback
casperdcl May 10, 2021
58f9ac9
term: which -> that
jorgeorpinel May 10, 2021
2bf1935
more review tweaks
casperdcl May 10, 2021
6315fd8
Update content/docs/start/experiments.md
jorgeorpinel May 10, 2021
fab5421
final review comments
casperdcl May 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 18 additions & 18 deletions content/docs/start/data-and-model-access.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,27 @@ title: 'Get Started: Data and Model Access'

# Get Started: Data and Model Access

Okay, we've learned how to _track_ data and models with DVC, and how to commit
their versions to Git. The next questions are: How can we _use_ these artifacts
outside of the project? How do I download a model to deploy it? How to download
We've learned how to _track_ data and models with DVC, and how to commit their
casperdcl marked this conversation as resolved.
Show resolved Hide resolved
versions to Git. The next questions are: How can we _use_ these artifacts
outside of the project? How do we download a model to deploy it? How to download
a specific version of a model? Or reuse datasets across different projects?

> These questions tend to come up when you browse the files that DVC saves to
> remote storage, e.g.
> remote storage (e.g.
> `s3://dvc-public/remote/get-started/fb/89904ef053f04d64eafcc3d70db673` 😱
> instead of the original files, name such as `model.pkl` or `data.xml`.
> instead of the original file name such as `model.pkl` or `data.xml`).

Read on or watch our video to see how to find and access models and datasets
with DVC.

https://youtu.be/EE7Gk84OZY8

Remember those `.dvc` files `dvc add` generates? Those files (and `dvc.lock`
that we'll cover later), have their history in Git, DVC remote storage config
saved in Git contain all the information needed to access and download any
version of datasets, files, and models. It means that a Git repository with
<abbr>DVC files</abbr> becomes an entry point, and can be used instead of
accessing files directly.
Remember those `.dvc` files `dvc add` generates? Those files (and `dvc.lock`,
which we'll cover later) have their history in Git. DVC's remote storage config
is also saved in Git, and contains all the information needed to access and
download any version of datasets, files, and models. It means that a Git
repository with <abbr>DVC files</abbr> becomes an entry point, and can be used
instead of accessing files directly.

## Find a file or directory

Expand Down Expand Up @@ -62,7 +62,7 @@ the data came from or whether new versions are available.
## Import file or directory

`dvc import` also downloads any file or directory, while also creating a `.dvc`
file that can be saved in the project:
file (which can be saved in the project):

```dvc
$ dvc import https://github.com/iterative/dataset-registry \
Expand All @@ -71,7 +71,7 @@ $ dvc import https://github.com/iterative/dataset-registry \

This is similar to `dvc get` + `dvc add`, but the resulting `.dvc` files
includes metadata to track changes in the source repository. This allows you to
bring in changes from the data source later, using `dvc update`.
bring in changes from the data source later using `dvc update`.
casperdcl marked this conversation as resolved.
Show resolved Hide resolved

<details>

Expand All @@ -83,7 +83,7 @@ bring in changes from the data source later, using `dvc update`.
> `dvc import` downloads from [remote storage](/doc/command-reference/remote).

`.dvc` files created by `dvc import` have special fields, such as the data
source `repo`, and `path` (under `deps`):
source `repo` and `path` (under `deps`):

```git
+deps:
Expand Down Expand Up @@ -111,8 +111,8 @@ directly from within an application at runtime. For example:
import dvc.api

with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
# ... fd is a file descriptor that can be processed normally.
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
# fd is a file descriptor which can be processed normally
```
78 changes: 36 additions & 42 deletions content/docs/start/data-and-model-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ and seeing data files and machine learning models in the workspace. Or switching
to a different version of a 100Gb file in less than a second with a
`git checkout`.

The foundation of DVC consists of a few commands that you can run along with
`git` to track large files, directories, or ML model files. Think "Git for
data". Read on or watch our video to learn about versioning data with DVC!
The foundation of DVC consists of a few commands you can run along with `git` to
track large files, directories, or ML model files. Think "Git for data". Read on
or watch our video to learn about versioning data with DVC!

https://youtu.be/kLKBcPonMYw

Expand All @@ -25,16 +25,16 @@ To start tracking a file or directory, use `dvc add`:

### ⚙️ Expand to get an example dataset.

Having initialized a project in the previous section, get the data file we will
be using later like this:
Having initialized a project in the previous section, we can get the data file
(which we'll be using later) like this:

```dvc
$ dvc get https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml
```

We use the fancy `dvc get` command to jump ahead a bit and show how Git repo
becomes a source for datasets or models - what we call "data/model registry".
We use the fancy `dvc get` command to jump ahead a bit and show how a Git repo
becomes a source for datasets or models what we call a "data/model registry".
`dvc get` can download any file or directory tracked in a <abbr>DVC
repository</abbr>. It's like `wget`, but for DVC or Git repos. In this case we
download the latest version of the `data.xml` file from the
Expand All @@ -48,22 +48,24 @@ $ dvc add data/data.xml
```

DVC stores information about the added file (or a directory) in a special `.dvc`
file named `data/data.xml.dvc`, a small text file with a human-readable
[format](/doc/user-guide/project-structure/dvc-files). This file can be easily
versioned like source code with Git, as a placeholder for the original data
(which gets listed in `.gitignore`):
file named `data/data.xml.dvc` a small text file with a human-readable
[format](/doc/user-guide/project-structure/dvc-files). This metadata file is a
placeholder for the original data, and can be easily versioned like source code
with Git:

```dvc
$ git add data/data.xml.dvc data/.gitignore
$ git commit -m "Add raw data"
```

The original data, meanwhile, is listed in `.gitignore`.

<details>

### 💡 Expand to see what happens under the hood.

`dvc add` moved the data to the project's <abbr>cache</abbr>, and linked\* it
back to the <abbr>workspace</abbr>.
`dvc add` moved the data to the project's <abbr>cache</abbr>, and
<abbr>linked</abbr> it back to the <abbr>workspace</abbr>.
casperdcl marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ tree .dvc/cache
Expand All @@ -82,35 +84,31 @@ outs:
path: data.xml
```

> \* See
> [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization) and
> `dvc config cache` for more info. on file linking.

</details>

## Storing and sharing

You can upload DVC-tracked data or model files with `dvc push`, so they're
safely stored [remotely](/doc/command-reference/remote). This also means they
can be retrieved on other environments later with `dvc pull`. First, we need to
setup a storage:
setup a remote storage location:

```dvc
$ dvc remote add -d storage s3://mybucket/dvcstore
$ git add .dvc/config
$ git commit -m "Configure remote storage"
```

> DVC supports the following remote storage types: Google Drive, Amazon S3,
> Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP.
> Please refer to `dvc remote add` for more details and examples.
> DVC supports many remote storage types, including Amazon S3, SSH, Google
> Drive, Azure Blob Storage, and HDFS. See `dvc remote add` for more details and
> examples.

<details>

### ⚙️ Set up a remote storage
### ⚙️ Expand to set up remote storage.

DVC remotes let you store a copy of the data tracked by DVC outside of the local
cache, usually a cloud storage service. For simplicity, let's set up a _local
cache (usually a cloud storage service). For simplicity, let's set up a _local
remote_:

```dvc
Expand All @@ -121,7 +119,7 @@ $ git commit .dvc/config -m "Configure local remote"

> While the term "local remote" may seem contradictory, it doesn't have to be.
> The "local" part refers to the type of location: another directory in the file
> system. "Remote" is how we call storage for <abbr>DVC projects</abbr>. It's
> system. "Remote" is what we call storage for <abbr>DVC projects</abbr>. It's
> essentially a local data backup.

</details>
Expand Down Expand Up @@ -160,7 +158,7 @@ run it after `git clone` and `git pull`.

<details>

### ⚙️ Expand to explode the project 💣
### ⚙️ Expand to delete locally cached data.
casperdcl marked this conversation as resolved.
Show resolved Hide resolved

If you've run `dvc push`, you can delete the cache (`.dvc/cache`) and
`data/data.xml` to experiment with `dvc pull`:
Expand Down Expand Up @@ -189,8 +187,8 @@ latest version:

### ⚙️ Expand to make some changes.

For the sake of simplicity let's just double the dataset artificially (and
pretend that we got more data from some external source):
Let's say we obtained more data from some external source. We can pretend this
is the case by doubling the dataset:

```dvc
$ cp data/data.xml /tmp/data.xml
Expand All @@ -212,9 +210,8 @@ $ dvc push

## Switching between versions

The regular workflow is to use `git checkout` first to switch a branch, checkout
a commit, or a revision of a `.dvc` file, and then run `dvc checkout` to sync
data:
The regular workflow is to use `git checkout` first (to switch a branch or
checkout a `.dvc` file version) and then run `dvc checkout` to sync data:

```dvc
$ git checkout <...>
Expand All @@ -225,41 +222,38 @@ $ dvc checkout

### ⚙️ Expand to get the previous version of the dataset.

Let's cleanup the previous artificial changes we made and get the previous :
Let's go back to the original version of the data:

```dvc
$ git checkout HEAD^1 data/data.xml.dvc
$ git checkout HEAD~1 data/data.xml.dvc
casperdcl marked this conversation as resolved.
Show resolved Hide resolved
$ dvc checkout
```

Let's commit it (no need to do `dvc push` this time since the previous version
of this dataset was saved before):
Let's commit it (no need to do `dvc push` this time since this original version
of the dataset was already saved):

```dvc
$ git commit data/data.xml.dvc -m "Revert dataset updates"
```

</details>

Yes, DVC is technically not even a version control system! `.dvc` files content
defines data file versions. Git itself provides the version control. DVC in turn
Yes, DVC is technically not even a version control system! `.dvc` file contents
define data file versions. Git itself provides the version control. DVC in turn
creates these `.dvc` files, updates them, and synchronizes DVC-tracked data in
the <abbr>workspace</abbr> efficiently to match them.

## Large datasets versioning

In cases where you process very large datasets, you need an efficient mechanism
(in terms of space and performance) to share a lot of data, including different
versions of itself. Do you use a network attached storage? Or a large external
volume?

While these cases are not covered in the Get Started, we recommend reading the
following sections next to learn more about advanced workflows:
versions. Do you use network attached storage (NAS)? Or a large external volume?
You can learn more about advanced workflows using these links:

- A shared [external cache](/doc/use-cases/shared-development-server) can be set
up to store, version and access a lot of data on a large shared volume
efficiently.
- A quite advanced scenario is to track and version data directly on the remote
storage (e.g. S3). Check out
storage (e.g. S3). See
[Managing External Data](https://dvc.org/doc/user-guide/managing-external-data)
to learn more.
Loading