From be7407f9d9f13c0b57c6aee421ec4af154d5c8cf Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Tue, 6 Apr 2021 23:08:15 +0100 Subject: [PATCH 01/29] docs: misc content & guide updates --- content/docs/start/data-and-model-access.md | 42 +++++----- .../docs/start/data-and-model-versioning.md | 75 ++++++++--------- content/docs/start/data-pipelines.md | 82 ++++++++++--------- content/docs/start/experiments.md | 2 +- content/docs/start/index.md | 2 +- .../docs/start/metrics-parameters-plots.md | 6 +- .../user-guide/basic-concepts/dependency.md | 2 +- .../user-guide/basic-concepts/dvc-cache.md | 4 +- .../user-guide/basic-concepts/file-link.md | 10 +++ .../user-guide/basic-concepts/pipeline.md | 7 ++ content/docs/user-guide/contributing/docs.md | 2 +- 11 files changed, 123 insertions(+), 111 deletions(-) create mode 100644 content/docs/user-guide/basic-concepts/file-link.md create mode 100644 content/docs/user-guide/basic-concepts/pipeline.md diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index eb3782db30..d95310ac80 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -1,30 +1,30 @@ --- -title: 'Get Started: Data and Model Access' +title: Data and Model Access --- -# Get Started: Data and Model Access +# Data and Model Access -Okay, we've learned how to _track_ data and models with DVC, and how to commit -their versions to Git. The next questions are: How can we _use_ these artifacts -outside of the project? How do I download a model to deploy it? How to download +We've learned how to _track_ data and models with DVC, and how to commit their +versions to Git. The next questions are: How can we _use_ these artifacts +outside of the project? How do we download a model to deploy it? How to download a specific version of a model? Or reuse datasets across different projects? -> These questions tend to come up when you browse the files that DVC saves to -> remote storage, e.g. +> These questions tend to come up when you browse the files which DVC saves to +> remote storage (e.g. > `s3://dvc-public/remote/get-started/fb/89904ef053f04d64eafcc3d70db673` 😱 -> instead of the original files, name such as `model.pkl` or `data.xml`. +> instead of the original filename such as `model.pkl` or `data.xml`). Read on or watch our video to see how to find and access models and datasets with DVC. https://youtu.be/EE7Gk84OZY8 -Remember those `.dvc` files `dvc add` generates? Those files (and `dvc.lock` -that we'll cover later), have their history in Git, DVC remote storage config -saved in Git contain all the information needed to access and download any -version of datasets, files, and models. It means that a Git repository with -DVC files becomes an entry point, and can be used instead of -accessing files directly. +Remember those `.dvc` files `dvc add` generates? Those files (and `dvc.lock`, +which we'll cover later), have their history in Git. DVC's remote storage config +is also saved in Git, and contains all the information needed to access and +download any version of datasets, files, and models. It means that a Git +repository with DVC files becomes an entry point, and can be used +instead of accessing files directly. ## Find a file or directory @@ -62,7 +62,7 @@ the data came from or whether new versions are available. ## Import file or directory `dvc import` also downloads any file or directory, while also creating a `.dvc` -file that can be saved in the project: +file which can be saved in the project: ```dvc $ dvc import https://github.com/iterative/dataset-registry \ @@ -71,7 +71,7 @@ $ dvc import https://github.com/iterative/dataset-registry \ This is similar to `dvc get` + `dvc add`, but the resulting `.dvc` files includes metadata to track changes in the source repository. This allows you to -bring in changes from the data source later, using `dvc update`. +bring in changes from the data source later using `dvc update`.
@@ -82,7 +82,7 @@ bring in changes from the data source later, using `dvc update`. > doesn't actually contain a `get-started/data.xml` file. Like `dvc get`, > `dvc import` downloads from [remote storage](/doc/command-reference/remote). -`.dvc` files created by `dvc import` have special fields, such as the data +`.dvc` files created by `dvc import` have special fields β€” such as the data source `repo`, and `path` (under `deps`): ```git @@ -111,8 +111,8 @@ directly from within an application at runtime. For example: import dvc.api with dvc.api.open( - 'get-started/data.xml', - repo='https://github.com/iterative/dataset-registry' - ) as fd: - # ... fd is a file descriptor that can be processed normally. + 'get-started/data.xml', + repo='https://github.com/iterative/dataset-registry' +) as fd: + # fd is a file descriptor which can be used here ``` diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 9f61fc1e05..8a2edd4128 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -5,7 +5,7 @@ regular Git workflow for datasets and ML models, without storing large files in Git.' --- -# Get Started: Data Versioning +# Data Versioning How cool would it be to make Git handle arbitrary large files and directories with the same performance as with small code files? Imagine doing a `git clone` @@ -13,7 +13,7 @@ and seeing data files and machine learning models in the workspace. Or switching to a different version of a 100Gb file in less than a second with a `git checkout`. -The foundation of DVC consists of a few commands that you can run along with +The foundation of DVC consists of a few commands which you can run along with `git` to track large files, directories, or ML model files. Think "Git for data". Read on or watch our video to learn about versioning data with DVC! @@ -25,16 +25,16 @@ To start tracking a file or directory, use `dvc add`: ### βš™οΈ Expand to get an example dataset. -Having initialized a project in the previous section, get the data file we will -be using later like this: +Having initialized a project in the previous section, get the data file which we +will be using later like this: ```dvc $ dvc get https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml ``` -We use the fancy `dvc get` command to jump ahead a bit and show how Git repo -becomes a source for datasets or models - what we call "data/model registry". +We use the fancy `dvc get` command to jump ahead a bit and show how a Git repo +becomes a source for datasets or models β€” what we call a "data/model registry". `dvc get` can download any file or directory tracked in a DVC repository. It's like `wget`, but for DVC or Git repos. In this case we download the latest version of the `data.xml` file from the @@ -48,10 +48,10 @@ $ dvc add data/data.xml ``` DVC stores information about the added file (or a directory) in a special `.dvc` -file named `data/data.xml.dvc`, a small text file with a human-readable -[format](/doc/user-guide/project-structure/dvc-files). This file can be easily -versioned like source code with Git, as a placeholder for the original data -(which gets listed in `.gitignore`): +file named `data/data.xml.dvc` - a small text file with a human-readable +[format](/doc/user-guide/project-structure/dvc-files). This metadata file can be +easily versioned like source code with Git. The original data, meanwhile, is +listed in `.gitignore`: ```dvc $ git add data/data.xml.dvc data/.gitignore @@ -62,8 +62,8 @@ $ git commit -m "Add raw data" ### πŸ’‘ Expand to see what happens under the hood. -`dvc add` moved the data to the project's cache, and linked\* it -back to the workspace. +`dvc add` moved the data to the project's cache, and +linked it back to the workspace. ```dvc $ tree .dvc/cache @@ -82,10 +82,6 @@ outs: path: data.xml ``` -> \* See -> [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization) and -> `dvc config cache` for more info. on file linking. -
## Storing and sharing @@ -93,7 +89,7 @@ outs: You can upload DVC-tracked data or model files with `dvc push`, so they're safely stored [remotely](/doc/command-reference/remote). This also means they can be retrieved on other environments later with `dvc pull`. First, we need to -setup a storage: +setup a storage provider: ```dvc $ dvc remote add -d storage s3://mybucket/dvcstore @@ -101,16 +97,16 @@ $ git add .dvc/config $ git commit -m "Configure remote storage" ``` -> DVC supports the following remote storage types: Google Drive, Amazon S3, -> Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP. -> Please refer to `dvc remote add` for more details and examples. +> DVC supports many remote storage types, including: Google Drive, Amazon S3, +> Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP. See +> `dvc remote add` for more details and examples.
-### βš™οΈ Set up a remote storage +### βš™οΈ Expand to set up a remote storage provider. DVC remotes let you store a copy of the data tracked by DVC outside of the local -cache, usually a cloud storage service. For simplicity, let's set up a _local +cache (usually a cloud storage service). For simplicity, let's set up a _local remote_: ```dvc @@ -121,7 +117,7 @@ $ git commit .dvc/config -m "Configure local remote" > While the term "local remote" may seem contradictory, it doesn't have to be. > The "local" part refers to the type of location: another directory in the file -> system. "Remote" is how we call storage for DVC projects. It's +> system. "Remote" is what we call storage for DVC projects. It's > essentially a local data backup.
@@ -160,7 +156,7 @@ run it after `git clone` and `git pull`.
-### βš™οΈ Expand to explode the project πŸ’£ +### βš™οΈ Expand to refresh the project ⟳ If you've run `dvc push`, you can delete the cache (`.dvc/cache`) and `data/data.xml` to experiment with `dvc pull`: @@ -177,7 +173,7 @@ $ dvc pull ``` > πŸ“– See also -> [Sharing Data and Model Files](/doc/use-cases/sharing-data-and-model-files) +> [sharing data and model files](/doc/use-cases/sharing-data-and-model-files) > for more on basic collaboration workflows. ## Making changes @@ -189,8 +185,8 @@ latest version: ### βš™οΈ Expand to make some changes. -For the sake of simplicity let's just double the dataset artificially (and -pretend that we got more data from some external source): +Let's say we obtained more data from some external source. We can pretend this +is the case by doubling the dataset: ```dvc $ cp data/data.xml /tmp/data.xml @@ -212,8 +208,8 @@ $ dvc push ## Switching between versions -The regular workflow is to use `git checkout` first to switch a branch, checkout -a commit, or a revision of a `.dvc` file, and then run `dvc checkout` to sync +The regular workflow is to use `git checkout` first (to switch a branch or +checkout a commit/revision of a `.dvc` file) and then run `dvc checkout` to sync data: ```dvc @@ -225,15 +221,15 @@ $ dvc checkout ### βš™οΈ Expand to get the previous version of the dataset. -Let's cleanup the previous artificial changes we made and get the previous : +Let's go back to the original version of the data: ```dvc -$ git checkout HEAD^1 data/data.xml.dvc +$ git checkout HEAD~1 data/data.xml.dvc $ dvc checkout ``` -Let's commit it (no need to do `dvc push` this time since the previous version -of this dataset was saved before): +Let's commit it (no need to do `dvc push` this time since this original version +of the dataset was already saved): ```dvc $ git commit data/data.xml.dvc -m "Revert dataset updates" @@ -241,7 +237,7 @@ $ git commit data/data.xml.dvc -m "Revert dataset updates"
-Yes, DVC is technically not even a version control system! `.dvc` files content +Yes, DVC is technically not even a version control system! `.dvc` files' content defines data file versions. Git itself provides the version control. DVC in turn creates these `.dvc` files, updates them, and synchronizes DVC-tracked data in the workspace efficiently to match them. @@ -250,16 +246,13 @@ the workspace efficiently to match them. In cases where you process very large datasets, you need an efficient mechanism (in terms of space and performance) to share a lot of data, including different -versions of itself. Do you use a network attached storage? Or a large external -volume? - -While these cases are not covered in the Get Started, we recommend reading the -following sections next to learn more about advanced workflows: +versions. Do you use network attached storage (NAS)? Or a large external volume? +You can learn more about advanced workflows using these links: - A shared [external cache](/doc/use-cases/shared-development-server) can be set up to store, version and access a lot of data on a large shared volume efficiently. - A quite advanced scenario is to track and version data directly on the remote - storage (e.g. S3). Check out - [Managing External Data](https://dvc.org/doc/user-guide/managing-external-data) + storage (e.g. S3). See + [managing external data](https://dvc.org/doc/user-guide/managing-external-data) to learn more. diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 6923dc58b7..5d59149a18 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -4,12 +4,12 @@ description: 'Learn how to build and use DVC pipelines to capture, organize, version, and reproduce your data science and machine learning workflows.' --- -# Get Started: Data Pipelines +# Data Pipelines Versioning large data files and directories for data science is great, but not enough. How is data filtered, transformed, or used to train ML models? DVC -introduces a mechanism to capture _data pipelines_ β€” series of data processes -that produce a final result. +introduces a mechanism to capture data pipelines β€” a series of data +processes which produce a final result. DVC pipelines and their data can also be easily versioned (using Git). This allows you to better organize projects, and reproduce your workflow and results @@ -23,9 +23,10 @@ https://youtu.be/71IGzyH95UY ## Pipeline stages -Use `dvc run` to create _stages_. These represent processes (source code tracked -with Git) that form the steps of a pipeline. Stages also connect code to its -data input and output. Let's transform a Python script into a +Use `dvc run` to create stages. These represent processes (source +code tracked with Git) which form the steps of a pipeline. Stages +also connect code to its corresponding data input and +output. Let's transform a Python script into a [stage](/doc/command-reference/run):
@@ -84,9 +85,9 @@ The command options used above mean the following: - `-n prepare` specifies a name for the stage. If you open the `dvc.yaml` file you will see a section named `prepare`. -- `-p prepare.seed,prepare.split` defines special types of dependencies - +- `-p prepare.seed,prepare.split` defines special types of dependencies β€” [parameters](/doc/command-reference/params). We'll get to them later in the - [Metrics, Parameters, and Plots](/doc/start/metrics-parameters-plots) page, + [metrics, parameters, and plots](/doc/start/metrics-parameters-plots) page, but the idea is that the stage can depend on field values from a parameters file (`params.yaml` by default): @@ -104,7 +105,7 @@ prepare: - `-o data/prepared` specifies an output directory for this script, which writes two files in it. This is how the workspace should look like now: - ```git + ```diff . β”œβ”€β”€ data β”‚ β”œβ”€β”€ data.xml @@ -119,8 +120,8 @@ prepare: β”œβ”€β”€ ... ``` -- The last line, `python src/prepare.py ...`, is the command to run in this - stage, and it's saved to `dvc.yaml`, as shown below. +- The last line, `python src/prepare.py data/data.xml` is the command to run in + this stage, and it's saved to `dvc.yaml`, as shown below. The resulting `prepare` stage contains all of the information above: @@ -150,8 +151,9 @@ in this case); `dvc run` already took care of this. You only need to run By using `dvc run` multiple times, and specifying outputs of a stage as dependencies of another one, we can describe a sequence of -commands that gets to a desired result. This is what we call a _data pipeline_ -or [_dependency graph_](https://en.wikipedia.org/wiki/Directed_acyclic_graph). +commands which gets to a desired result. This is what we call a data +pipeline or +[dependency graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph). Let's create a second stage chained to the outputs of `prepare`, to perform feature extraction: @@ -202,8 +204,8 @@ The changes to the `dvc.yaml` should look like this: ### βš™οΈ Expand to add more stages. -Let's add the training itself. Nothing new this time, the same `dvc run` command -with the same set of options: +Let's add the training itself. Nothing new this time; just the same `dvc run` +command with the same set of options: ```dvc $ dvc run -n train \ @@ -217,13 +219,13 @@ Please check the `dvc.yaml` again, it should have one more stage now.
-This should be a good point to commit the changes with Git. These include +This should be a good time to commit the changes with Git. These include `.gitignore`, `dvc.lock`, and `dvc.yaml` β€” which describe our pipeline. ## Reproduce -The whole point of creating this `dvc.yaml` pipelines file is an ability to -reproduce the pipeline: +The whole point of creating this `dvc.yaml` file is the ability to easily +reproduce a pipeline: ```dvc $ dvc repro @@ -231,16 +233,15 @@ $ dvc repro
-### βš™οΈ Expand to have some fun with it +### βš™οΈ Expand to have some fun with it. Let's try to play a little bit with it. First, let's try to change one of the parameters for the training stage: -```dvc -$ vim params.yaml -``` +- Open `params.yaml` and change `n_est` to `100`, and +- (re)run `dvc repro`. -Change `n_est` to `100` and run `dvc repro`, you should see: +You should see: ```dvc $ dvc repro @@ -260,9 +261,10 @@ Stage 'prepare' didn't change, skipping Stage 'featurize' didn't change, skipping ``` -Same as before, no need to run `prepare`, `featurize`, etc ... but, it doesn't -run even `train` again this time either! It cached the previous run with the -same set of inputs (parameters + data) and reused it. +This is the same result as before (no need to rerun `prepare`, `featurize`, +etc.) but it also doesn't run `train` again this time either! DVC maintains a +run-cache where it saved the previous run with the same set of +inputs (parameters + data) for later reuse.
@@ -270,12 +272,12 @@ same set of inputs (parameters + data) and reused it. ### πŸ’‘ Expand to see what happens under the hood. -`dvc repro` relies on the DAG definition that it reads from `dvc.yaml`, and uses -`dvc.lock` to determine what exactly needs to be run. +`dvc repro` relies on the DAG definition which it reads from +`dvc.yaml`, and uses `dvc.lock` to determine what exactly needs to be run. -`dvc.lock` file is similar to `.dvc` files and captures hashes (in most cases -`md5`s) of the dependencies, values of the parameters that were used, it can be -considered a _state_ of the pipeline: +Meanwhile, the `dvc.lock` file is similar to a `.dvc` file β€” it captures hashes +(in most cases `md5`s) of the dependencies and values of the parameters which +were used. It can be considered a _state_ of the pipeline: ```yaml schema: '2.0' @@ -304,23 +306,23 @@ stages: DVC pipelines (`dvc.yaml` file, `dvc run`, and `dvc repro` commands) solve a few important problems: -- _Automation_ - run a sequence of steps in a "smart" way that makes iterating +- _Automation_: run a sequence of steps in a "smart" way which makes iterating on your project faster. DVC automatically determines which parts of a project - need to be run, and it caches "runs" and their results, to avoid unnecessary - re-runs. -- _Reproducibility_ - `dvc.yaml` and `dvc.lock` files describe what data to use + need to be run, and it caches "runs" and their results (in the aptly-named + run-cache) to avoid unnecessary reruns. +- _Reproducibility_: `dvc.yaml` and `dvc.lock` files describe what data to use and which commands will generate the pipeline results (such as an ML model). Storing these files in Git makes it easy to version and share. -- _Continuous Delivery and Continuous Integration (CI/CD) for ML_ - describing - projects in way that it can be reproduced (built) is the first necessary step - before introducing CI/CD systems. See our sister project, +- _Continuous Delivery and Continuous Integration (CI/CD) for ML_: describing + projects in way which it can be reproduced (built) is the first necessary step + before introducing CI/CD systems. See our sister project [CML](https://cml.dev/) for some examples. ## Visualize Having built our pipeline, we need a good way to understand its structure. -Seeing a graph of connected stages would help. DVC lets you do just that, -without leaving the terminal! +Seeing a graph of connected stages would help. DVC lets you do just that without +leaving the terminal! ```dvc $ dvc dag diff --git a/content/docs/start/experiments.md b/content/docs/start/experiments.md index 84831ccc64..5648c8ec2d 100644 --- a/content/docs/start/experiments.md +++ b/content/docs/start/experiments.md @@ -2,7 +2,7 @@ title: 'Get Started: Experiments' --- -# Get Started: Experiments +# Experiments ⚠️ This feature is only available in DVC 2.0 ⚠️ diff --git a/content/docs/start/index.md b/content/docs/start/index.md index 34c77a99a8..ecb46dceb8 100644 --- a/content/docs/start/index.md +++ b/content/docs/start/index.md @@ -15,7 +15,7 @@ running `dvc init` inside a Git project: In expandable sections that start with the βš™οΈ emoji, we'll be providing more information for those trying to run the commands. It's up to you to pick the -best way to read the material - read the text (skip sections like this, and it +best way to read the material β€” read the text (skip sections like this, and it should be enough to understand the idea of DVC), or try to run them and get the first hand experience. diff --git a/content/docs/start/metrics-parameters-plots.md b/content/docs/start/metrics-parameters-plots.md index 8654ae7f75..b01cd49b4d 100644 --- a/content/docs/start/metrics-parameters-plots.md +++ b/content/docs/start/metrics-parameters-plots.md @@ -2,7 +2,7 @@ title: 'Get Started: Metrics, Parameters, and Plots' --- -# Get Started: Metrics, Parameters, and Plots +# Metrics, Parameters, and Plots DVC makes it easy to track [metrics](/doc/command-reference/metrics), update [parameters](/doc/command-reference/params), and visualize performance with @@ -95,8 +95,8 @@ Similarly, it writes arrays for the into `roc.json` for an additional plot. > DVC doesn't force you to use any specific file names, or even format or -> structure of a metrics or plots file - it's pretty much user and case defined. -> Please refer to `dvc metrics` and `dvc plots` for more details. +> structure of a metrics or plots file β€” it's pretty much user- and +> case-defined. Please refer to `dvc metrics` and `dvc plots` for more details. You can view tracked metrics and plots with DVC. Let's start with the metrics: diff --git a/content/docs/user-guide/basic-concepts/dependency.md b/content/docs/user-guide/basic-concepts/dependency.md index 256e77f183..72ed629911 100644 --- a/content/docs/user-guide/basic-concepts/dependency.md +++ b/content/docs/user-guide/basic-concepts/dependency.md @@ -1,6 +1,6 @@ --- name: Dependency -match: [dependency, dependencies, depends] +match: [dependency, dependencies, depends, input, inputs] tooltip: >- A file or directory (possibly tracked by DVC) recorded in the `deps` section of a stage (in `dvc.yaml`) or `.dvc` file file. See `dvc run`. Stages are diff --git a/content/docs/user-guide/basic-concepts/dvc-cache.md b/content/docs/user-guide/basic-concepts/dvc-cache.md index 383b1411d2..c809b4c9fc 100644 --- a/content/docs/user-guide/basic-concepts/dvc-cache.md +++ b/content/docs/user-guide/basic-concepts/dvc-cache.md @@ -3,7 +3,7 @@ name: 'DVC Cache' match: ['DVC cache', cache, caches, cached, 'cache directory'] tooltip: >- The DVC cache is a hidden storage (by default in `.dvc/cache`) for files and - directories tracked by DVC, and their different versions. Learn more about - it's structure + directories tracked by DVC, and their different versions. Learn more about its + structure [here](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory). --- diff --git a/content/docs/user-guide/basic-concepts/file-link.md b/content/docs/user-guide/basic-concepts/file-link.md new file mode 100644 index 0000000000..7f0ecc3d6d --- /dev/null +++ b/content/docs/user-guide/basic-concepts/file-link.md @@ -0,0 +1,10 @@ +--- +name: File Linking +match: + [link, symlink, hardlink, reflink, linked, symlinked, hardlinked, reflinked] +tooltip: >- + A way to have a file appear in multiple different folders without occupying + more physical space on the storage disk. This is both fast and economical. See + [large dataset optimization](/doc/user-guide/large-dataset-optimization) and + `dvc config cache` for more on file linking. +--- diff --git a/content/docs/user-guide/basic-concepts/pipeline.md b/content/docs/user-guide/basic-concepts/pipeline.md new file mode 100644 index 0000000000..7ffd3b446e --- /dev/null +++ b/content/docs/user-guide/basic-concepts/pipeline.md @@ -0,0 +1,7 @@ +--- +name: Pipeline (DAG) +match: [DAG, pipeline, 'data pipeline'] +tooltip: >- + A set of inter-dependent stages. This is also called a + [dependency graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph). +--- diff --git a/content/docs/user-guide/contributing/docs.md b/content/docs/user-guide/contributing/docs.md index fee50e0fbf..9fcf9afe3c 100644 --- a/content/docs/user-guide/contributing/docs.md +++ b/content/docs/user-guide/contributing/docs.md @@ -16,7 +16,7 @@ To contribute documentation, these are the relevant locations: - [Content](https://github.com/iterative/dvc.org/tree/master/content/docs) (`content/docs/`): [Markdown](https://guides.github.com/features/mastering-markdown/) files. One - file - one page of the documentation. + file β€” one page of the documentation. - [Images](https://github.com/iterative/dvc.org/tree/master/static/img) (`static/img/`): Add new images (`.png`, `.svg`, etc.) here. Use them in Markdown files like this: `![](/img/.gif)`. From c67bcee70d8c5524bd3d9c188c5b6caae1bf9227 Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Wed, 7 Apr 2021 00:48:18 +0100 Subject: [PATCH 02/29] docs: more get started tweaks & fixes --- content/docs/start/data-and-model-access.md | 2 +- .../docs/start/data-and-model-versioning.md | 4 +- content/docs/start/data-pipelines.md | 4 +- content/docs/start/experiments.md | 57 +++++++------- .../docs/start/metrics-parameters-plots.md | 74 ++++++++++--------- .../user-guide/basic-concepts/experiment.md | 4 +- .../user-guide/basic-concepts/pipeline.md | 4 +- .../user-guide/basic-concepts/run-cache.md | 2 +- 8 files changed, 77 insertions(+), 74 deletions(-) diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index d95310ac80..91345115d5 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -1,5 +1,5 @@ --- -title: Data and Model Access +title: 'Get Started: Data and Model Access' --- # Data and Model Access diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 8a2edd4128..b9206fd0f9 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -103,7 +103,7 @@ $ git commit -m "Configure remote storage"
-### βš™οΈ Expand to set up a remote storage provider. +### βš™οΈ Expand to set up a remote storage provider ☁ DVC remotes let you store a copy of the data tracked by DVC outside of the local cache (usually a cloud storage service). For simplicity, let's set up a _local @@ -219,7 +219,7 @@ $ dvc checkout
-### βš™οΈ Expand to get the previous version of the dataset. +### βš™οΈ Expand to get the previous version of the dataset πŸ•‘ Let's go back to the original version of the data: diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 5d59149a18..4d571498d5 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -202,7 +202,7 @@ The changes to the `dvc.yaml` should look like this:
-### βš™οΈ Expand to add more stages. +### βš™οΈ Expand to add more stages ☰ Let's add the training itself. Nothing new this time; just the same `dvc run` command with the same set of options: @@ -233,7 +233,7 @@ $ dvc repro
-### βš™οΈ Expand to have some fun with it. +### βš™οΈ Expand to have some fun with it πŸ’ƒ Let's try to play a little bit with it. First, let's try to change one of the parameters for the training stage: diff --git a/content/docs/start/experiments.md b/content/docs/start/experiments.md index 5648c8ec2d..8f8fbd618a 100644 --- a/content/docs/start/experiments.md +++ b/content/docs/start/experiments.md @@ -8,18 +8,19 @@ title: 'Get Started: Experiments' Experiments proliferate quickly in ML projects where there are many parameters to tune or other permutations of the code. We can organize such -projects and only keep what we ultimately need with `dvc experiments`. DVC can +projects and keep only what we ultimately need with `dvc experiments`. DVC can track experiments for you so there's no need to commit each one to Git. This way your repo doesn't become polluted with all of them. You can discard experiments once they're no longer needed. -> πŸ“– See [Experiment Management](/doc/user-guide/experiment-management) for more +> πŸ“– See [experiment management](/doc/user-guide/experiment-management) for more > information on DVC's approach. ## Running experiments -In the previous page, we learned how to tune -[ML pipelines](/doc/start/data-pipelines) and compare the changes. Let's further +Previously, we learned how to tune ML data +[pipelines](/doc/start/data-pipelines) and +[compare the changes](/doc/start/metrics-parameters-plots). Let's further increase the number of features in the `featurize` stage to see how it compares. `dvc exp run` makes it easy to change hyperparameters and run a new @@ -31,16 +32,15 @@ $ dvc exp run --set-param featurize.max_features=3000
-### πŸ’‘ Expand to see what this command does. +### πŸ’‘ Expand to see what happens under the hood. `dvc exp run` is similar to `dvc repro` but with some added conveniences for running experiments. The `--set-param` (or `-S`) flag sets the values for -[parameters](/doc/command-reference/params) as a shortcut to editing -`params.yaml`. +parameters as a shortcut for editing `params.yaml`. Check that the `featurize.max_features` value has been updated in `params.yaml`: -```git +```diff featurize: - max_features: 1500 + max_features: 3000 @@ -66,10 +66,10 @@ params.yaml featurize.max_features 3000 1500 ## Queueing experiments So far, we have been tuning the `featurize` stage, but there are also parameters -for the `train` stage, which trains a -[random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). +for the `train` stage (which trains a +[random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)). -These are the `train` parameters in `params.yaml`: +These are the `train` parameters from `params.yaml`: ```yaml train: @@ -78,9 +78,9 @@ train: min_split: 2 ``` -Let's setup experiments with different hyperparameters. We can define all the -combinations we want to try without executing anything, by using the `--queue` -flag: +Let's setup experiments with different hyperparameters. We can use the `--queue` +flag to define all the combinations we want to try without executing anything +(yet): ```dvc $ dvc exp run --queue -S train.min_split=8 @@ -95,8 +95,7 @@ $ dvc exp run --queue -S train.min_split=64 -S train.n_est=100 Queued experiment '0cdee86' for future execution. ``` -Next, run all queued experiments using `--run-all` (and in parallel with -`--jobs`): +Next, run all (`--run-all`) queued experiments in parallel (`--jobs`): ```dvc $ dvc exp run --run-all --jobs 2 @@ -108,7 +107,7 @@ To compare all of these experiments, we need more than `diff`. `dvc exp show` compares any number of experiments in one table: ```dvc -$ dvc exp show --no-timestamp +$ dvc exp show --no-timestamp \ --include-params train.n_est,train.min_split ┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Experiment ┃ avg_prec ┃ roc_auc ┃ train.n_est┃ train.min_split ┃ @@ -146,11 +145,11 @@ Changes for experiment 'exp-98a96' have been applied to your workspace.
-### πŸ’‘ Expand to see what this command does. +### πŸ’‘ Expand to see what happens under the hood. -`dvc exp apply` is similar to `dvc checkout` but it works with experiments. DVC -tracks everything in the pipeline for each experiment (parameters, metrics, -dependencies, and outputs) and can later retrieve it as needed. +`dvc exp apply` is similar to `dvc checkout`, but works with experiments +instead. DVC tracks everything in the pipeline for each experiment (parameters, +metrics, dependencies, and outputs) and can later retrieve them as needed. Check that `scores.json` reflects the metrics in the table above: @@ -194,7 +193,7 @@ Storage, HTTP, HDFS, etc.). The Git remote is often a central Git server
-Experiments that have not been made persistent will not be stored or shared +Experiments which have not been made persistent will not be stored or shared remotely through `dvc push` or `git push`. `dvc exp push` enables storing and sharing any experiment remotely. @@ -204,7 +203,7 @@ $ dvc exp push gitremote exp-bfe64 Pushed experiment 'exp-bfe64' to Git remote 'gitremote'. ``` -`dvc exp list` shows all experiments that have been saved. +`dvc exp list` shows all experiments which have been saved. ```dvc $ dvc exp list gitremote --all @@ -219,7 +218,7 @@ $ dvc exp pull gitremote exp-bfe64 Pulled experiment 'exp-bfe64' from Git remote 'gitremote'. ``` -> All these commands take a Git remote as an argument. A default DVC remote is +> All these commands take a Git remote as an argument. A `dvc remote default` is > also required to share the experiment data. ## Cleaning up @@ -227,7 +226,7 @@ Pulled experiment 'exp-bfe64' from Git remote 'gitremote'. Let's take another look at the experiments table: ```dvc -$ dvc exp show --no-timestamp +$ dvc exp show --no-timestamp \ --include-params train.n_est,train.min_split ┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Experiment ┃ avg_prec ┃ roc_auc ┃ train.n_est┃ train.min_split ┃ @@ -243,7 +242,7 @@ experiments since the last commit, but don't worry. The experiments remain experiments from the previous _n_ commits: ```dvc -$ dvc exp show -n 2 --no-timestamp +$ dvc exp show -n 2 --no-timestamp \ --include-params train.n_est,train.min_split ┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Experiment ┃ avg_prec ┃ roc_auc ┃ train.n_est┃ train.min_split ┃ @@ -266,7 +265,7 @@ Eventually, old experiments may clutter the experiments table. ```dvc $ dvc exp gc --workspace -$ dvc exp show -n 2 --no-timestamp +$ dvc exp show -n 2 --no-timestamp \ --include-params train.n_est,train.min_split ┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Experiment ┃ avg_prec ┃ roc_auc ┃ train.n_est┃ train.min_split ┃ @@ -277,5 +276,5 @@ $ dvc exp show -n 2 --no-timestamp β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` -> `dvc exp gc` only removes references to the experiments, not the cached -> objects associated to them. To clean up the cache, use `dvc gc`. +> `dvc exp gc` only removes references to the experiments; not the cached +> objects associated with them. To clean up the object cache, use `dvc gc`. diff --git a/content/docs/start/metrics-parameters-plots.md b/content/docs/start/metrics-parameters-plots.md index b01cd49b4d..272c8c7606 100644 --- a/content/docs/start/metrics-parameters-plots.md +++ b/content/docs/start/metrics-parameters-plots.md @@ -5,10 +5,11 @@ title: 'Get Started: Metrics, Parameters, and Plots' # Metrics, Parameters, and Plots DVC makes it easy to track [metrics](/doc/command-reference/metrics), update -[parameters](/doc/command-reference/params), and visualize performance with -[plots](/doc/command-reference/plots). These concepts are introduced below, and -[Experiments](/doc/start/experiments) shows how to combine them to run and -compare many iterations of your ML project. +parameters, and visualize performance with +[plots](/doc/command-reference/plots). These concepts are introduced below. + +> All of the above can be combined into experiments to run and +> compare many iterations of your ML project. Read on to see how it's done! @@ -16,7 +17,7 @@ Read on to see how it's done! First, let's see what is the mechanism to capture values for these ML attributes. Let's add a final evaluation stage to our -[pipeline](/doc/start/data-pipelines): +[pipeline from before](/doc/start/data-pipelines): ```dvc $ dvc run -n evaluate \ @@ -33,7 +34,7 @@ $ dvc run -n evaluate \ ### πŸ’‘ Expand to see what happens under the hood. The `-M` option here specifies a metrics file, while `--plots-no-cache` -specifies a plots file produced by this stage that will not be +specifies a plots file produced by this stage which will not be cached by DVC. `dvc run` generates a new stage in the `dvc.yaml` file: @@ -57,7 +58,7 @@ evaluate: The biggest difference to previous stages in our pipeline is in two new sections: `metrics` and `plots`. These are used to mark certain files containing ML "telemetry". Metrics files contain scalar values (e.g. `AUC`) and plots files -contain matrices and data series (e.g. `ROC curves` or model loss plots) that +contain matrices and data series (e.g. `ROC curves` or model loss plots) which are meant to be visualized and compared. > With `cache: false`, DVC skips caching the output, as we want `scores.json`, @@ -70,15 +71,17 @@ writes the model's [ROC-AUC](https://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc) and [average precision](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-and-f-measures) -to `scores.json`, which is marked as a metrics file with `-M`: +to `scores.json`, which in turn is marked as a `metrics` file with `-M`. Its +contents are: ```json { "avg_prec": 0.5204838673030754, "roc_auc": 0.9032012604172255 } ``` -It also writes `precision`, `recall`, and `thresholds` arrays (obtained using +`evaluate.py` also writes `precision`, `recall`, and `thresholds` arrays +(obtained using [`precision_recall_curve`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html)) -into plots file `prc.json`: +into a `plots` file `prc.json`: ```json { @@ -94,9 +97,9 @@ Similarly, it writes arrays for the [roc_curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) into `roc.json` for an additional plot. -> DVC doesn't force you to use any specific file names, or even format or -> structure of a metrics or plots file β€” it's pretty much user- and -> case-defined. Please refer to `dvc metrics` and `dvc plots` for more details. +> DVC doesn't force you to use any specific file names, nor does it enforce a +> format or structure of a metrics or plots file. It's completely user- and +> case-defined. R to `dvc metrics` and `dvc plots` for more details. You can view tracked metrics and plots with DVC. Let's start with the metrics: @@ -133,16 +136,18 @@ $ git add scores.json prc.json roc.json $ git commit -a -m "Create evaluation stage" ``` -Later we will see how these and other can be used to compare and visualize -different pipeline iterations. For now, let's see how can we capture another -important piece of information that will be useful for comparison: parameters. +Later we will see how to +[compare and visualize different pipeline iterations](#comparing-iterations). +For now, let's see how can we capture another important piece of information +which will be useful for comparison: parameters. ## Defining stage parameters It's pretty common for data science pipelines to include configuration files -that define adjustable parameters to train a model, do pre-processing, etc. DVC -provides a mechanism for stages to depend on the values of specific sections of -such a config file (YAML, JSON, TOML, and Python formats are supported). +which define adjustable parameters to train a model, do +pre-processing, etc. DVC provides a mechanism for stages to depend on the values +of specific sections of such a config file (YAML, JSON, TOML, and Python formats +are supported). Luckily, we should already have a stage with [parameters](/doc/command-reference/params) in `dvc.yaml`: @@ -162,7 +167,7 @@ featurize:
-### πŸ’‘ Expand to recall how it was generated. +### πŸ’‘ Expand to recall how it was generated πŸ€” The `featurize` stage [was created](/doc/start/data-pipelines#dependency-graphs-dags) with this @@ -179,13 +184,12 @@ $ dvc run -n featurize \
-The `params` section defines the [parameter](/doc/command-reference/params) -dependencies of the `featurize` stage. By default DVC reads those values -(`featurize.max_features` and `featurize.ngrams`) from a `params.yaml` file. But -as with metrics and plots, parameter file names and structure can also be user -and case defined. +The `params` section defines the parameter dependencies of the `featurize` +stage. By default, DVC reads those values (`featurize.max_features` and +`featurize.ngrams`) from a `params.yaml` file. But as with metrics and plots, +parameter file names and structure can also be user- and case-defined. -This is how our `params.yaml` file looks like: +Here's the contents of our `params.yaml` file: ```yaml prepare: @@ -215,24 +219,24 @@ We are definitely not happy with the AUC value we got so far! Let's edit the + ngrams: 2 ``` -And the beauty of `dvc.yaml` is that all you need to do now is to run: +The beauty of `dvc.yaml` is that all you need to do now is run: ```dvc $ dvc repro ``` -It'll analyze the changes, use existing cache of previous runs, and execute only -the commands that are needed to get the new results (model, metrics, plots). +It'll analyze the changes, use existing run-caches, and execute +only the commands which are needed to produce new results (model, metrics, +plots). The same logic applies to other possible adjustments β€” edit source code, update -datasets β€” you do the changes, use `dvc repro`, and DVC runs what needs to be -run. +datasets β€” you do the changes, use `dvc repro`, and DVC runs what needs to be. ## Comparing iterations Finally, let's see how the updates improved performance. DVC has a few commands -to see metrics and parameter changes, and to visualize plots, for one or more -pipeline iterations. Let's compare the current "bigrams" run with the last +to see metrics & parameter changes and to visualize plots (for one or more +pipeline iterations). Let's compare the current "bigrams" run with the last committed "baseline" iteration: ```dvc @@ -271,5 +275,5 @@ file:///Users/dvc/example-get-started/plots.html > [Git revisions](https://git-scm.com/docs/gitrevisions) (commits, tags, branch > names) to compare. -In the next page, learn advanced ways to track, organize, and compare more -experiment iterations. +On the next page, you can learn advanced ways to track, organize, and compare +more experiment iterations. diff --git a/content/docs/user-guide/basic-concepts/experiment.md b/content/docs/user-guide/basic-concepts/experiment.md index e4130a9683..dc66a0c360 100644 --- a/content/docs/user-guide/basic-concepts/experiment.md +++ b/content/docs/user-guide/basic-concepts/experiment.md @@ -4,8 +4,8 @@ match: [experiment, experiments] tooltip: >- An attempt to reach desired/better/interesting results during data pipelining or ML model development. DVC is designed to help [manage - experiments](/doc/user-guide/experiment-management), having built-in - mechanisms like the + experiments](/doc/start/experiments), having [built-in + mechanisms](/doc/user-guide/experiment-management) like the [run-cache](/doc/user-guide/project-structure/internal-files#run-cache) and the `dvc experiments` commands (coming in DVC 2.0). --- diff --git a/content/docs/user-guide/basic-concepts/pipeline.md b/content/docs/user-guide/basic-concepts/pipeline.md index 7ffd3b446e..16817ffa97 100644 --- a/content/docs/user-guide/basic-concepts/pipeline.md +++ b/content/docs/user-guide/basic-concepts/pipeline.md @@ -2,6 +2,6 @@ name: Pipeline (DAG) match: [DAG, pipeline, 'data pipeline'] tooltip: >- - A set of inter-dependent stages. This is also called a - [dependency graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph). + A set of inter-dependent stages. This is also called a [dependency + graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph). --- diff --git a/content/docs/user-guide/basic-concepts/run-cache.md b/content/docs/user-guide/basic-concepts/run-cache.md index 148e1a0378..e3d7f72660 100644 --- a/content/docs/user-guide/basic-concepts/run-cache.md +++ b/content/docs/user-guide/basic-concepts/run-cache.md @@ -1,6 +1,6 @@ --- name: 'Run-cache' -match: ['run-cache'] +match: ['run-cache', 'run-caches'] tooltip: >- A log of stages that have been run in the project. It's comprised of `dvc.lock` file backups, identified as combinations of dependencies, commands, From f1648fcd9f33b0d7998ca4778454f33e0370a6a2 Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Wed, 7 Apr 2021 18:11:47 +0100 Subject: [PATCH 03/29] basic-concepts: fix & cleanup matches --- content/docs/user-guide/basic-concepts/dependency.md | 2 +- content/docs/user-guide/basic-concepts/file-link.md | 3 +-- content/docs/user-guide/basic-concepts/pipeline.md | 2 +- 3 files changed, 3 insertions(+), 4 deletions(-) diff --git a/content/docs/user-guide/basic-concepts/dependency.md b/content/docs/user-guide/basic-concepts/dependency.md index 72ed629911..453fda173f 100644 --- a/content/docs/user-guide/basic-concepts/dependency.md +++ b/content/docs/user-guide/basic-concepts/dependency.md @@ -1,6 +1,6 @@ --- name: Dependency -match: [dependency, dependencies, depends, input, inputs] +match: [dependency, dependencies, depends, input] tooltip: >- A file or directory (possibly tracked by DVC) recorded in the `deps` section of a stage (in `dvc.yaml`) or `.dvc` file file. See `dvc run`. Stages are diff --git a/content/docs/user-guide/basic-concepts/file-link.md b/content/docs/user-guide/basic-concepts/file-link.md index 7f0ecc3d6d..17f48ea150 100644 --- a/content/docs/user-guide/basic-concepts/file-link.md +++ b/content/docs/user-guide/basic-concepts/file-link.md @@ -1,7 +1,6 @@ --- name: File Linking -match: - [link, symlink, hardlink, reflink, linked, symlinked, hardlinked, reflinked] +match: [linked] tooltip: >- A way to have a file appear in multiple different folders without occupying more physical space on the storage disk. This is both fast and economical. See diff --git a/content/docs/user-guide/basic-concepts/pipeline.md b/content/docs/user-guide/basic-concepts/pipeline.md index 16817ffa97..f66383479f 100644 --- a/content/docs/user-guide/basic-concepts/pipeline.md +++ b/content/docs/user-guide/basic-concepts/pipeline.md @@ -1,6 +1,6 @@ --- name: Pipeline (DAG) -match: [DAG, pipeline, 'data pipeline'] +match: [DAG, pipeline, 'data pipeline', 'data pipelines'] tooltip: >- A set of inter-dependent stages. This is also called a [dependency graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph). From f699438a228e5f300cc3899f7ac4938e75276ccf Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Thu, 8 Apr 2021 11:44:27 +0100 Subject: [PATCH 04/29] Apply suggestions from code review Co-authored-by: Jorge Orpinel --- content/docs/start/data-and-model-access.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index 91345115d5..991d290724 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -20,7 +20,7 @@ with DVC. https://youtu.be/EE7Gk84OZY8 Remember those `.dvc` files `dvc add` generates? Those files (and `dvc.lock`, -which we'll cover later), have their history in Git. DVC's remote storage config +which we'll cover later) have their history in Git. DVC's remote storage config is also saved in Git, and contains all the information needed to access and download any version of datasets, files, and models. It means that a Git repository with DVC files becomes an entry point, and can be used @@ -62,7 +62,7 @@ the data came from or whether new versions are available. ## Import file or directory `dvc import` also downloads any file or directory, while also creating a `.dvc` -file which can be saved in the project: +file (which can be saved in the project): ```dvc $ dvc import https://github.com/iterative/dataset-registry \ From 6c9fb03247ebd5953a7b8f8acf13555321b45bd9 Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Thu, 8 Apr 2021 21:13:00 +0100 Subject: [PATCH 05/29] docs: GS: revert tile prefixes - partially reverts da7d0ea - vis. https://github.com/iterative/dvc.org/pull/2359#discussion_r609189403 --- content/docs/start/data-and-model-access.md | 2 +- content/docs/start/data-and-model-versioning.md | 2 +- content/docs/start/data-pipelines.md | 2 +- content/docs/start/experiments.md | 2 +- content/docs/start/metrics-parameters-plots.md | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index 991d290724..5bee10b590 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -2,7 +2,7 @@ title: 'Get Started: Data and Model Access' --- -# Data and Model Access +# Get Started: Data and Model Access We've learned how to _track_ data and models with DVC, and how to commit their versions to Git. The next questions are: How can we _use_ these artifacts diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index b9206fd0f9..6096e14fe5 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -5,7 +5,7 @@ regular Git workflow for datasets and ML models, without storing large files in Git.' --- -# Data Versioning +# Get Started: Data Versioning How cool would it be to make Git handle arbitrary large files and directories with the same performance as with small code files? Imagine doing a `git clone` diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 4d571498d5..0a4ebc2c3b 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -4,7 +4,7 @@ description: 'Learn how to build and use DVC pipelines to capture, organize, version, and reproduce your data science and machine learning workflows.' --- -# Data Pipelines +# Get Started: Data Pipelines Versioning large data files and directories for data science is great, but not enough. How is data filtered, transformed, or used to train ML models? DVC diff --git a/content/docs/start/experiments.md b/content/docs/start/experiments.md index 8f8fbd618a..73b5df00b1 100644 --- a/content/docs/start/experiments.md +++ b/content/docs/start/experiments.md @@ -2,7 +2,7 @@ title: 'Get Started: Experiments' --- -# Experiments +# Get Started: Experiments ⚠️ This feature is only available in DVC 2.0 ⚠️ diff --git a/content/docs/start/metrics-parameters-plots.md b/content/docs/start/metrics-parameters-plots.md index 272c8c7606..cae1538e59 100644 --- a/content/docs/start/metrics-parameters-plots.md +++ b/content/docs/start/metrics-parameters-plots.md @@ -2,7 +2,7 @@ title: 'Get Started: Metrics, Parameters, and Plots' --- -# Metrics, Parameters, and Plots +# Get Started: Metrics, Parameters, and Plots DVC makes it easy to track [metrics](/doc/command-reference/metrics), update parameters, and visualize performance with From 8350d914347352567cf3bb78b108bf486916f912 Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Fri, 9 Apr 2021 13:53:22 +0100 Subject: [PATCH 06/29] respond to misc review comments --- content/docs/start/data-and-model-access.md | 2 +- .../docs/start/data-and-model-versioning.md | 24 ++++++++++--------- 2 files changed, 14 insertions(+), 12 deletions(-) diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index 5bee10b590..bdc36ea536 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -114,5 +114,5 @@ with dvc.api.open( 'get-started/data.xml', repo='https://github.com/iterative/dataset-registry' ) as fd: - # fd is a file descriptor which can be used here + # fd is a file descriptor which can be processed normally ``` diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 6096e14fe5..faddf580cd 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -25,8 +25,8 @@ To start tracking a file or directory, use `dvc add`: ### βš™οΈ Expand to get an example dataset. -Having initialized a project in the previous section, get the data file which we -will be using later like this: +Having initialized a project in the previous section, we can get the data file +(which we'll be using later) like this: ```dvc $ dvc get https://github.com/iterative/dataset-registry \ @@ -48,16 +48,18 @@ $ dvc add data/data.xml ``` DVC stores information about the added file (or a directory) in a special `.dvc` -file named `data/data.xml.dvc` - a small text file with a human-readable -[format](/doc/user-guide/project-structure/dvc-files). This metadata file can be -easily versioned like source code with Git. The original data, meanwhile, is -listed in `.gitignore`: +file named `data/data.xml.dvc` β€” a small text file with a human-readable +[format](/doc/user-guide/project-structure/dvc-files). This metadata file is a +placeholder for the original data, and can be easily versioned like source code +with Git: ```dvc $ git add data/data.xml.dvc data/.gitignore $ git commit -m "Add raw data" ``` +The original data, meanwhile, is listed in `.gitignore`. +
### πŸ’‘ Expand to see what happens under the hood. @@ -89,7 +91,7 @@ outs: You can upload DVC-tracked data or model files with `dvc push`, so they're safely stored [remotely](/doc/command-reference/remote). This also means they can be retrieved on other environments later with `dvc pull`. First, we need to -setup a storage provider: +setup a storage location: ```dvc $ dvc remote add -d storage s3://mybucket/dvcstore @@ -103,7 +105,7 @@ $ git commit -m "Configure remote storage"
-### βš™οΈ Expand to set up a remote storage provider ☁ +### βš™οΈ Expand to set up a remote storage location. DVC remotes let you store a copy of the data tracked by DVC outside of the local cache (usually a cloud storage service). For simplicity, let's set up a _local @@ -156,7 +158,7 @@ run it after `git clone` and `git pull`.
-### βš™οΈ Expand to refresh the project ⟳ +### βš™οΈ Expand to delete locally cached data. If you've run `dvc push`, you can delete the cache (`.dvc/cache`) and `data/data.xml` to experiment with `dvc pull`: @@ -237,8 +239,8 @@ $ git commit data/data.xml.dvc -m "Revert dataset updates"
-Yes, DVC is technically not even a version control system! `.dvc` files' content -defines data file versions. Git itself provides the version control. DVC in turn +Yes, DVC is technically not even a version control system! `.dvc` file contents +define data file versions. Git itself provides the version control. DVC in turn creates these `.dvc` files, updates them, and synchronizes DVC-tracked data in the workspace efficiently to match them. From 79d39100d11b7ee098c0f2cf06dea5ebe7f47a7b Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 14 Apr 2021 00:51:47 -0500 Subject: [PATCH 07/29] Update content/docs/start/data-and-model-access.md --- content/docs/start/data-and-model-access.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index bdc36ea536..ad56a3b25e 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -12,7 +12,7 @@ a specific version of a model? Or reuse datasets across different projects? > These questions tend to come up when you browse the files which DVC saves to > remote storage (e.g. > `s3://dvc-public/remote/get-started/fb/89904ef053f04d64eafcc3d70db673` 😱 -> instead of the original filename such as `model.pkl` or `data.xml`). +> instead of the original file name such as `model.pkl` or `data.xml`). Read on or watch our video to see how to find and access models and datasets with DVC. From c087649023ea43c3c15903940364eeb6522dee03 Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Thu, 15 Apr 2021 00:31:45 +0100 Subject: [PATCH 08/29] Apply suggestions from code review Co-authored-by: Jorge Orpinel --- content/docs/start/data-and-model-versioning.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index faddf580cd..1354e19646 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -91,7 +91,7 @@ outs: You can upload DVC-tracked data or model files with `dvc push`, so they're safely stored [remotely](/doc/command-reference/remote). This also means they can be retrieved on other environments later with `dvc pull`. First, we need to -setup a storage location: +setup a remote storage location: ```dvc $ dvc remote add -d storage s3://mybucket/dvcstore @@ -105,7 +105,7 @@ $ git commit -m "Configure remote storage"
-### βš™οΈ Expand to set up a remote storage location. +### βš™οΈ Expand to set up remote storage. DVC remotes let you store a copy of the data tracked by DVC outside of the local cache (usually a cloud storage service). For simplicity, let's set up a _local From 17edc41f86df1effa6bdb6705b79f815dca3dfb4 Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Thu, 15 Apr 2021 00:33:37 +0100 Subject: [PATCH 09/29] more code review updates - vis https://github.com/iterative/dvc.org/pull/2359/files#r612875491 --- content/docs/start/data-and-model-versioning.md | 6 +++--- content/docs/start/data-pipelines.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 1354e19646..63a956f0a5 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -99,9 +99,9 @@ $ git add .dvc/config $ git commit -m "Configure remote storage" ``` -> DVC supports many remote storage types, including: Google Drive, Amazon S3, -> Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP. See -> `dvc remote add` for more details and examples. +> DVC supports many remote storage types, including Amazon S3, SSH, Google +> Drive, Azure Blob Storage, and HDFS. See `dvc remote add` for more details and +> examples.
diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 0a4ebc2c3b..0b12d14ca2 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -314,7 +314,7 @@ important problems: and which commands will generate the pipeline results (such as an ML model). Storing these files in Git makes it easy to version and share. - _Continuous Delivery and Continuous Integration (CI/CD) for ML_: describing - projects in way which it can be reproduced (built) is the first necessary step + projects in way that can be reproduced (built) is the first necessary step before introducing CI/CD systems. See our sister project [CML](https://cml.dev/) for some examples. From d111851c6710404a7f60680803ed64a2b2008a7d Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Thu, 15 Apr 2021 00:40:18 +0100 Subject: [PATCH 10/29] Do You Feel Entitled? --- content/docs/start/data-and-model-versioning.md | 4 ++-- content/docs/start/data-pipelines.md | 2 +- content/docs/start/experiments.md | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 63a956f0a5..1180a05f21 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -175,7 +175,7 @@ $ dvc pull ``` > πŸ“– See also -> [sharing data and model files](/doc/use-cases/sharing-data-and-model-files) +> [Sharing Data and Model Files](/doc/use-cases/sharing-data-and-model-files) > for more on basic collaboration workflows. ## Making changes @@ -256,5 +256,5 @@ You can learn more about advanced workflows using these links: efficiently. - A quite advanced scenario is to track and version data directly on the remote storage (e.g. S3). See - [managing external data](https://dvc.org/doc/user-guide/managing-external-data) + [Managing External Data](https://dvc.org/doc/user-guide/managing-external-data) to learn more. diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 0b12d14ca2..eb55d974b3 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -87,7 +87,7 @@ The command options used above mean the following: - `-p prepare.seed,prepare.split` defines special types of dependencies β€” [parameters](/doc/command-reference/params). We'll get to them later in the - [metrics, parameters, and plots](/doc/start/metrics-parameters-plots) page, + [Metrics, Parameters, and Plots](/doc/start/metrics-parameters-plots) page, but the idea is that the stage can depend on field values from a parameters file (`params.yaml` by default): diff --git a/content/docs/start/experiments.md b/content/docs/start/experiments.md index 73b5df00b1..b50c46fe16 100644 --- a/content/docs/start/experiments.md +++ b/content/docs/start/experiments.md @@ -13,7 +13,7 @@ track experiments for you so there's no need to commit each one to Git. This way your repo doesn't become polluted with all of them. You can discard experiments once they're no longer needed. -> πŸ“– See [experiment management](/doc/user-guide/experiment-management) for more +> πŸ“– See [Experiment Management](/doc/user-guide/experiment-management) for more > information on DVC's approach. ## Running experiments From bace74bfd487f1822ad2a3bcda4f735414b403c0 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 22 Apr 2021 00:24:58 -0500 Subject: [PATCH 11/29] Update content/docs/start/data-and-model-versioning.md --- content/docs/start/data-and-model-versioning.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 1180a05f21..51d58d38c9 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -13,7 +13,7 @@ and seeing data files and machine learning models in the workspace. Or switching to a different version of a 100Gb file in less than a second with a `git checkout`. -The foundation of DVC consists of a few commands which you can run along with +The foundation of DVC consists of a few commands you can run along with `git` to track large files, directories, or ML model files. Think "Git for data". Read on or watch our video to learn about versioning data with DVC! From e6b5545eda1fac00488177e8e009c2777859c892 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 22 Apr 2021 01:34:25 -0500 Subject: [PATCH 12/29] typo per https://github.com/iterative/dvc.org/pull/2359#discussion_r613348919 --- content/docs/start/data-pipelines.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index eb55d974b3..9bf7797097 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -8,7 +8,7 @@ version, and reproduce your data science and machine learning workflows.' Versioning large data files and directories for data science is great, but not enough. How is data filtered, transformed, or used to train ML models? DVC -introduces a mechanism to capture data pipelines β€” a series of data +introduces a mechanism to capture data pipelines β€” series of data processes which produce a final result. DVC pipelines and their data can also be easily versioned (using Git). This From adb0283d6da9fd3b82e2ae3d43b3b57c2e57ad44 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 22 Apr 2021 01:39:28 -0500 Subject: [PATCH 13/29] Update content/docs/start/data-pipelines.md --- content/docs/start/data-pipelines.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 9bf7797097..7f1af6878c 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -9,7 +9,7 @@ version, and reproduce your data science and machine learning workflows.' Versioning large data files and directories for data science is great, but not enough. How is data filtered, transformed, or used to train ML models? DVC introduces a mechanism to capture data pipelines β€” series of data -processes which produce a final result. +processes that produce a final result. DVC pipelines and their data can also be easily versioned (using Git). This allows you to better organize projects, and reproduce your workflow and results From 56f952b30272a772160576b6ba2ce6bece0ad37e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 22 Apr 2021 01:48:02 -0500 Subject: [PATCH 14/29] which->that --- content/docs/start/data-and-model-access.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index ad56a3b25e..986c92dd3b 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -9,7 +9,7 @@ versions to Git. The next questions are: How can we _use_ these artifacts outside of the project? How do we download a model to deploy it? How to download a specific version of a model? Or reuse datasets across different projects? -> These questions tend to come up when you browse the files which DVC saves to +> These questions tend to come up when you browse the files that DVC saves to > remote storage (e.g. > `s3://dvc-public/remote/get-started/fb/89904ef053f04d64eafcc3d70db673` 😱 > instead of the original file name such as `model.pkl` or `data.xml`). From 25d2933c356ce67752b2d188fa92b452cf060b84 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 22 Apr 2021 02:49:23 -0500 Subject: [PATCH 15/29] Update content/docs/start/data-pipelines.md Co-authored-by: Casper da Costa-Luis --- content/docs/start/data-pipelines.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 7f1af6878c..5f0a794c31 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -308,8 +308,7 @@ important problems: - _Automation_: run a sequence of steps in a "smart" way which makes iterating on your project faster. DVC automatically determines which parts of a project - need to be run, and it caches "runs" and their results (in the aptly-named - run-cache) to avoid unnecessary reruns. + need to be run, and it caches "runs" and their results to avoid unnecessary reruns. - _Reproducibility_: `dvc.yaml` and `dvc.lock` files describe what data to use and which commands will generate the pipeline results (such as an ML model). Storing these files in Git makes it easy to version and share. From 1a96bbdae1709338cef590c25e881baad47b4670 Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Thu, 22 Apr 2021 13:14:27 +0100 Subject: [PATCH 16/29] misc review responses --- content/docs/start/data-and-model-access.md | 4 ++-- content/docs/start/data-pipelines.md | 18 +++++++++--------- content/docs/start/experiments.md | 2 +- 3 files changed, 12 insertions(+), 12 deletions(-) diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index 986c92dd3b..1fe2092183 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -82,8 +82,8 @@ bring in changes from the data source later using `dvc update`. > doesn't actually contain a `get-started/data.xml` file. Like `dvc get`, > `dvc import` downloads from [remote storage](/doc/command-reference/remote). -`.dvc` files created by `dvc import` have special fields β€” such as the data -source `repo`, and `path` (under `deps`): +`.dvc` files created by `dvc import` have special fields, such as the data +source `repo` and `path` (under `deps`): ```git +deps: diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 5f0a794c31..2505b2bc4c 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -105,7 +105,7 @@ prepare: - `-o data/prepared` specifies an output directory for this script, which writes two files in it. This is how the workspace should look like now: - ```diff + ```git . β”œβ”€β”€ data β”‚ β”œβ”€β”€ data.xml @@ -238,8 +238,8 @@ $ dvc repro Let's try to play a little bit with it. First, let's try to change one of the parameters for the training stage: -- Open `params.yaml` and change `n_est` to `100`, and -- (re)run `dvc repro`. +1. Open `params.yaml` and change `n_est` to `100`, and +2. (re)run `dvc repro`. You should see: @@ -261,10 +261,9 @@ Stage 'prepare' didn't change, skipping Stage 'featurize' didn't change, skipping ``` -This is the same result as before (no need to rerun `prepare`, `featurize`, -etc.) but it also doesn't run `train` again this time either! DVC maintains a -run-cache where it saved the previous run with the same set of -inputs (parameters + data) for later reuse. +As before, there was no need to rerun `prepare`, `featurize`, etc. But this time +it also doesn't rerun `train`! The previous run with the same set of inputs +(parameters & data) was saved in DVC's run-cache, and reused here.
@@ -308,7 +307,8 @@ important problems: - _Automation_: run a sequence of steps in a "smart" way which makes iterating on your project faster. DVC automatically determines which parts of a project - need to be run, and it caches "runs" and their results to avoid unnecessary reruns. + need to be run, and it caches "runs" and their results to avoid unnecessary + reruns. - _Reproducibility_: `dvc.yaml` and `dvc.lock` files describe what data to use and which commands will generate the pipeline results (such as an ML model). Storing these files in Git makes it easy to version and share. @@ -320,7 +320,7 @@ important problems: ## Visualize Having built our pipeline, we need a good way to understand its structure. -Seeing a graph of connected stages would help. DVC lets you do just that without +Seeing a graph of connected stages would help. DVC lets you do so without leaving the terminal! ```dvc diff --git a/content/docs/start/experiments.md b/content/docs/start/experiments.md index b50c46fe16..828e740323 100644 --- a/content/docs/start/experiments.md +++ b/content/docs/start/experiments.md @@ -40,7 +40,7 @@ running experiments. The `--set-param` (or `-S`) flag sets the values for Check that the `featurize.max_features` value has been updated in `params.yaml`: -```diff +```git featurize: - max_features: 1500 + max_features: 3000 From 1f240d8a402a95a5943b5bcfdb495a45b9e9e43f Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Thu, 22 Apr 2021 13:15:04 +0100 Subject: [PATCH 17/29] purge the emojis! --- content/docs/start/data-and-model-versioning.md | 8 ++++---- content/docs/start/data-pipelines.md | 4 ++-- content/docs/start/metrics-parameters-plots.md | 2 +- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 51d58d38c9..d720c7780a 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -13,9 +13,9 @@ and seeing data files and machine learning models in the workspace. Or switching to a different version of a 100Gb file in less than a second with a `git checkout`. -The foundation of DVC consists of a few commands you can run along with -`git` to track large files, directories, or ML model files. Think "Git for -data". Read on or watch our video to learn about versioning data with DVC! +The foundation of DVC consists of a few commands you can run along with `git` to +track large files, directories, or ML model files. Think "Git for data". Read on +or watch our video to learn about versioning data with DVC! https://youtu.be/kLKBcPonMYw @@ -221,7 +221,7 @@ $ dvc checkout
-### βš™οΈ Expand to get the previous version of the dataset πŸ•‘ +### βš™οΈ Expand to get the previous version of the dataset. Let's go back to the original version of the data: diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 2505b2bc4c..d226812ac7 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -202,7 +202,7 @@ The changes to the `dvc.yaml` should look like this:
-### βš™οΈ Expand to add more stages ☰ +### βš™οΈ Expand to add more stages. Let's add the training itself. Nothing new this time; just the same `dvc run` command with the same set of options: @@ -233,7 +233,7 @@ $ dvc repro
-### βš™οΈ Expand to have some fun with it πŸ’ƒ +### βš™οΈ Expand to have some fun with it. Let's try to play a little bit with it. First, let's try to change one of the parameters for the training stage: diff --git a/content/docs/start/metrics-parameters-plots.md b/content/docs/start/metrics-parameters-plots.md index cae1538e59..8b64aebe1d 100644 --- a/content/docs/start/metrics-parameters-plots.md +++ b/content/docs/start/metrics-parameters-plots.md @@ -167,7 +167,7 @@ featurize:
-### πŸ’‘ Expand to recall how it was generated πŸ€” +### βš™οΈ Expand to recall how it was generated. The `featurize` stage [was created](/doc/start/data-pipelines#dependency-graphs-dags) with this From 9f7e83579bc1e133dac5aa3953cfe707f94590c5 Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Thu, 22 Apr 2021 13:20:37 +0100 Subject: [PATCH 18/29] minor simplification Co-authored-by: Jorge Orpinel --- content/docs/start/data-and-model-versioning.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index d720c7780a..4601e25dba 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -211,7 +211,7 @@ $ dvc push ## Switching between versions The regular workflow is to use `git checkout` first (to switch a branch or -checkout a commit/revision of a `.dvc` file) and then run `dvc checkout` to sync +checkout a `.dvc` file version) and then run `dvc checkout` to sync data: ```dvc From 28bbdfc45ef375005428c627847dd1d440824234 Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Thu, 22 Apr 2021 18:50:58 +0100 Subject: [PATCH 19/29] revert some s --- content/docs/start/data-and-model-versioning.md | 3 +-- content/docs/start/data-pipelines.md | 16 +++++++--------- content/docs/start/metrics-parameters-plots.md | 7 +++---- .../docs/user-guide/basic-concepts/parameter.md | 2 +- 4 files changed, 12 insertions(+), 16 deletions(-) diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 4601e25dba..67f7eea98f 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -211,8 +211,7 @@ $ dvc push ## Switching between versions The regular workflow is to use `git checkout` first (to switch a branch or -checkout a `.dvc` file version) and then run `dvc checkout` to sync -data: +checkout a `.dvc` file version) and then run `dvc checkout` to sync data: ```dvc $ git checkout <...> diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index d226812ac7..bc76784ce5 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -8,8 +8,8 @@ version, and reproduce your data science and machine learning workflows.' Versioning large data files and directories for data science is great, but not enough. How is data filtered, transformed, or used to train ML models? DVC -introduces a mechanism to capture data pipelines β€” series of data -processes that produce a final result. +introduces a mechanism to capture _data pipelines_ β€” series of data processes +that produce a final result. DVC pipelines and their data can also be easily versioned (using Git). This allows you to better organize projects, and reproduce your workflow and results @@ -23,10 +23,9 @@ https://youtu.be/71IGzyH95UY ## Pipeline stages -Use `dvc run` to create stages. These represent processes (source -code tracked with Git) which form the steps of a pipeline. Stages -also connect code to its corresponding data input and -output. Let's transform a Python script into a +Use `dvc run` to create _stages_. These represent processes (source code tracked +with Git) which form the steps of a _pipeline_. Stages also connect code to its +corresponding data _input_ and _output_. Let's transform a Python script into a [stage](/doc/command-reference/run):
@@ -151,9 +150,8 @@ in this case); `dvc run` already took care of this. You only need to run By using `dvc run` multiple times, and specifying outputs of a stage as dependencies of another one, we can describe a sequence of -commands which gets to a desired result. This is what we call a data -pipeline or -[dependency graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph). +commands which gets to a desired result. This is what we call a _data pipeline_ +or [_dependency graph_](https://en.wikipedia.org/wiki/Directed_acyclic_graph). Let's create a second stage chained to the outputs of `prepare`, to perform feature extraction: diff --git a/content/docs/start/metrics-parameters-plots.md b/content/docs/start/metrics-parameters-plots.md index 8b64aebe1d..7148527e94 100644 --- a/content/docs/start/metrics-parameters-plots.md +++ b/content/docs/start/metrics-parameters-plots.md @@ -144,10 +144,9 @@ which will be useful for comparison: parameters. ## Defining stage parameters It's pretty common for data science pipelines to include configuration files -which define adjustable parameters to train a model, do -pre-processing, etc. DVC provides a mechanism for stages to depend on the values -of specific sections of such a config file (YAML, JSON, TOML, and Python formats -are supported). +which define adjustable parameters to train a model, do pre-processing, etc. DVC +provides a mechanism for stages to depend on the values of specific sections of +such a config file (YAML, JSON, TOML, and Python formats are supported). Luckily, we should already have a stage with [parameters](/doc/command-reference/params) in `dvc.yaml`: diff --git a/content/docs/user-guide/basic-concepts/parameter.md b/content/docs/user-guide/basic-concepts/parameter.md index a48c4793e0..7cc47a9cf3 100644 --- a/content/docs/user-guide/basic-concepts/parameter.md +++ b/content/docs/user-guide/basic-concepts/parameter.md @@ -5,5 +5,5 @@ tooltip: >- Pipeline stages (defined in `dvc.yaml`) can depend on specific values inside an arbitrary YAML, JSON, TOML, or Python file (`params.yaml` by default). Stages are invalid (considered outdated) when any of their parameter values - change. See `dvc param`. + change. See [`dvc param`](/doc/command-reference/params). --- From 3aa0163e340f2219b68a7cc2f7c7d5ef6e868b0b Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 5 May 2021 23:38:33 -0500 Subject: [PATCH 20/29] Update content/docs/user-guide/basic-concepts/parameter.md --- content/docs/user-guide/basic-concepts/parameter.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/user-guide/basic-concepts/parameter.md b/content/docs/user-guide/basic-concepts/parameter.md index 7cc47a9cf3..ead9a4d26a 100644 --- a/content/docs/user-guide/basic-concepts/parameter.md +++ b/content/docs/user-guide/basic-concepts/parameter.md @@ -5,5 +5,5 @@ tooltip: >- Pipeline stages (defined in `dvc.yaml`) can depend on specific values inside an arbitrary YAML, JSON, TOML, or Python file (`params.yaml` by default). Stages are invalid (considered outdated) when any of their parameter values - change. See [`dvc param`](/doc/command-reference/params). + change. See `dvc params`. --- From e22f2724cbfbc1c7346f2be16898020891c7ad74 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 6 May 2021 00:13:33 -0500 Subject: [PATCH 21/29] Apply suggestions from code review --- content/docs/start/data-pipelines.md | 8 ++++---- content/docs/start/metrics-parameters-plots.md | 10 +++++----- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index bc76784ce5..c630e1b2af 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -269,12 +269,12 @@ it also doesn't rerun `train`! The previous run with the same set of inputs ### πŸ’‘ Expand to see what happens under the hood. -`dvc repro` relies on the DAG definition which it reads from +`dvc repro` relies on the DAG definition that is read from `dvc.yaml`, and uses `dvc.lock` to determine what exactly needs to be run. -Meanwhile, the `dvc.lock` file is similar to a `.dvc` file β€” it captures hashes -(in most cases `md5`s) of the dependencies and values of the parameters which -were used. It can be considered a _state_ of the pipeline: +The `dvc.lock` file is similar to a `.dvc` file β€” it captures hashes (in most +cases `md5`s) of the dependencies and values of the parameters that were used. +It can be considered a _state_ of the pipeline: ```yaml schema: '2.0' diff --git a/content/docs/start/metrics-parameters-plots.md b/content/docs/start/metrics-parameters-plots.md index 7148527e94..4cf2943597 100644 --- a/content/docs/start/metrics-parameters-plots.md +++ b/content/docs/start/metrics-parameters-plots.md @@ -58,8 +58,8 @@ evaluate: The biggest difference to previous stages in our pipeline is in two new sections: `metrics` and `plots`. These are used to mark certain files containing ML "telemetry". Metrics files contain scalar values (e.g. `AUC`) and plots files -contain matrices and data series (e.g. `ROC curves` or model loss plots) which -are meant to be visualized and compared. +contain matrices and data series (e.g. `ROC curves` or model loss plots) meant +to be visualized and compared. > With `cache: false`, DVC skips caching the output, as we want `scores.json`, > `prc.json`, and `roc.json` to be versioned by Git. @@ -81,7 +81,7 @@ contents are: `evaluate.py` also writes `precision`, `recall`, and `thresholds` arrays (obtained using [`precision_recall_curve`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html)) -into a `plots` file `prc.json`: +into plots file `prc.json`: ```json { @@ -144,7 +144,7 @@ which will be useful for comparison: parameters. ## Defining stage parameters It's pretty common for data science pipelines to include configuration files -which define adjustable parameters to train a model, do pre-processing, etc. DVC +to define adjustable parameters to train a model, do pre-processing, etc. DVC provides a mechanism for stages to depend on the values of specific sections of such a config file (YAML, JSON, TOML, and Python formats are supported). @@ -225,7 +225,7 @@ $ dvc repro ``` It'll analyze the changes, use existing run-caches, and execute -only the commands which are needed to produce new results (model, metrics, +only the commands that are needed to produce new results (model, metrics, plots). The same logic applies to other possible adjustments β€” edit source code, update From 57befbf61dbc5a26b9ea51d0d2e1a1b59a108caa Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Thu, 6 May 2021 13:50:40 +0100 Subject: [PATCH 22/29] restore link Partially reverts 7840a466f83dbba1ac4704935a45a8acd90b2092 --- content/docs/user-guide/basic-concepts/parameter.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/user-guide/basic-concepts/parameter.md b/content/docs/user-guide/basic-concepts/parameter.md index ead9a4d26a..bd1da45ce5 100644 --- a/content/docs/user-guide/basic-concepts/parameter.md +++ b/content/docs/user-guide/basic-concepts/parameter.md @@ -5,5 +5,5 @@ tooltip: >- Pipeline stages (defined in `dvc.yaml`) can depend on specific values inside an arbitrary YAML, JSON, TOML, or Python file (`params.yaml` by default). Stages are invalid (considered outdated) when any of their parameter values - change. See `dvc params`. + change. See [`dvc params`](/doc/command-reference/params). --- From 5a27352fef4f353783718c7412e4f3923426240b Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Thu, 6 May 2021 14:34:56 +0100 Subject: [PATCH 23/29] misc corrections --- content/docs/start/data-pipelines.md | 4 ++-- content/docs/start/metrics-parameters-plots.md | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index c630e1b2af..7849a6cb48 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -269,8 +269,8 @@ it also doesn't rerun `train`! The previous run with the same set of inputs ### πŸ’‘ Expand to see what happens under the hood. -`dvc repro` relies on the DAG definition that is read from -`dvc.yaml`, and uses `dvc.lock` to determine what exactly needs to be run. +`dvc repro` relies on the DAG definition from `dvc.yaml`, and uses +`dvc.lock` to determine what exactly needs to be run. The `dvc.lock` file is similar to a `.dvc` file β€” it captures hashes (in most cases `md5`s) of the dependencies and values of the parameters that were used. diff --git a/content/docs/start/metrics-parameters-plots.md b/content/docs/start/metrics-parameters-plots.md index 4cf2943597..f05bdf6aca 100644 --- a/content/docs/start/metrics-parameters-plots.md +++ b/content/docs/start/metrics-parameters-plots.md @@ -81,7 +81,7 @@ contents are: `evaluate.py` also writes `precision`, `recall`, and `thresholds` arrays (obtained using [`precision_recall_curve`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html)) -into plots file `prc.json`: +into the plots file `prc.json`: ```json { @@ -143,8 +143,8 @@ which will be useful for comparison: parameters. ## Defining stage parameters -It's pretty common for data science pipelines to include configuration files -to define adjustable parameters to train a model, do pre-processing, etc. DVC +It's pretty common for data science pipelines to include configuration files to +define adjustable parameters to train a model, do pre-processing, etc. DVC provides a mechanism for stages to depend on the values of specific sections of such a config file (YAML, JSON, TOML, and Python formats are supported). From e7db7cca294e01f715c11a4b3089ea65394d6573 Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Thu, 6 May 2021 14:37:25 +0100 Subject: [PATCH 24/29] Update content/docs/start/metrics-parameters-plots.md Co-authored-by: Jorge Orpinel --- content/docs/start/metrics-parameters-plots.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/start/metrics-parameters-plots.md b/content/docs/start/metrics-parameters-plots.md index f05bdf6aca..15e2a9a621 100644 --- a/content/docs/start/metrics-parameters-plots.md +++ b/content/docs/start/metrics-parameters-plots.md @@ -34,7 +34,7 @@ $ dvc run -n evaluate \ ### πŸ’‘ Expand to see what happens under the hood. The `-M` option here specifies a metrics file, while `--plots-no-cache` -specifies a plots file produced by this stage which will not be +specifies a plots file (produced by this stage) which will not be cached by DVC. `dvc run` generates a new stage in the `dvc.yaml` file: From 5c2d69bbec312fc61d7a6aa4b4f5329feb622844 Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Mon, 10 May 2021 19:56:29 +0100 Subject: [PATCH 25/29] minor review feedback --- content/docs/start/metrics-parameters-plots.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/content/docs/start/metrics-parameters-plots.md b/content/docs/start/metrics-parameters-plots.md index 15e2a9a621..f9257229f4 100644 --- a/content/docs/start/metrics-parameters-plots.md +++ b/content/docs/start/metrics-parameters-plots.md @@ -11,8 +11,6 @@ DVC makes it easy to track [metrics](/doc/command-reference/metrics), update > All of the above can be combined into experiments to run and > compare many iterations of your ML project. -Read on to see how it's done! - ## Collecting metrics First, let's see what is the mechanism to capture values for these ML @@ -98,8 +96,8 @@ Similarly, it writes arrays for the into `roc.json` for an additional plot. > DVC doesn't force you to use any specific file names, nor does it enforce a -> format or structure of a metrics or plots file. It's completely user- and -> case-defined. R to `dvc metrics` and `dvc plots` for more details. +> format or structure of a metrics or plots file. It's completely +> user/case-defined. R to `dvc metrics` and `dvc plots` for more details. You can view tracked metrics and plots with DVC. Let's start with the metrics: From 58f9ac9358854112577a4155c8446852dca30192 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 10 May 2021 14:15:14 -0500 Subject: [PATCH 26/29] term: which -> that 2 last reverts for #2359 --- content/docs/start/experiments.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/start/experiments.md b/content/docs/start/experiments.md index 828e740323..cc1d572602 100644 --- a/content/docs/start/experiments.md +++ b/content/docs/start/experiments.md @@ -193,7 +193,7 @@ Storage, HTTP, HDFS, etc.). The Git remote is often a central Git server
-Experiments which have not been made persistent will not be stored or shared +Experiments that have not been made persistent will not be stored or shared remotely through `dvc push` or `git push`. `dvc exp push` enables storing and sharing any experiment remotely. @@ -203,7 +203,7 @@ $ dvc exp push gitremote exp-bfe64 Pushed experiment 'exp-bfe64' to Git remote 'gitremote'. ``` -`dvc exp list` shows all experiments which have been saved. +`dvc exp list` shows all experiments that have been saved. ```dvc $ dvc exp list gitremote --all From 2bf19354b4399c850714b3afa58fd2bd451a6ed8 Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Mon, 10 May 2021 20:25:46 +0100 Subject: [PATCH 27/29] more review tweaks --- content/docs/start/experiments.md | 3 ++- content/docs/start/metrics-parameters-plots.md | 10 +++++----- content/docs/user-guide/basic-concepts/run-cache.md | 2 +- 3 files changed, 8 insertions(+), 7 deletions(-) diff --git a/content/docs/start/experiments.md b/content/docs/start/experiments.md index cc1d572602..0a240528f7 100644 --- a/content/docs/start/experiments.md +++ b/content/docs/start/experiments.md @@ -277,4 +277,5 @@ $ dvc exp show -n 2 --no-timestamp \ ``` > `dvc exp gc` only removes references to the experiments; not the cached -> objects associated with them. To clean up the object cache, use `dvc gc`. +> objects associated with them. To clean up the cache, use +> `dvc gc`. diff --git a/content/docs/start/metrics-parameters-plots.md b/content/docs/start/metrics-parameters-plots.md index f9257229f4..5bd39f924d 100644 --- a/content/docs/start/metrics-parameters-plots.md +++ b/content/docs/start/metrics-parameters-plots.md @@ -222,8 +222,8 @@ The beauty of `dvc.yaml` is that all you need to do now is run: $ dvc repro ``` -It'll analyze the changes, use existing run-caches, and execute -only the commands that are needed to produce new results (model, metrics, +It'll analyze the changes, use existing results from the run-cache, +and execute only the commands needed to produce new results (model, metrics, plots). The same logic applies to other possible adjustments β€” edit source code, update @@ -232,9 +232,9 @@ datasets β€” you do the changes, use `dvc repro`, and DVC runs what needs to be. ## Comparing iterations Finally, let's see how the updates improved performance. DVC has a few commands -to see metrics & parameter changes and to visualize plots (for one or more -pipeline iterations). Let's compare the current "bigrams" run with the last -committed "baseline" iteration: +to see changes in and visualize metrics, parameters, and plots. These commands +can work for one or across multiple pipeline iteration(s). Let's compare the +current "bigrams" run with the last committed "baseline" iteration: ```dvc $ dvc params diff diff --git a/content/docs/user-guide/basic-concepts/run-cache.md b/content/docs/user-guide/basic-concepts/run-cache.md index e3d7f72660..148e1a0378 100644 --- a/content/docs/user-guide/basic-concepts/run-cache.md +++ b/content/docs/user-guide/basic-concepts/run-cache.md @@ -1,6 +1,6 @@ --- name: 'Run-cache' -match: ['run-cache', 'run-caches'] +match: ['run-cache'] tooltip: >- A log of stages that have been run in the project. It's comprised of `dvc.lock` file backups, identified as combinations of dependencies, commands, From 6315fd82536b4b2728215949d69d84b040ed003e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 10 May 2021 16:23:33 -0500 Subject: [PATCH 28/29] Update content/docs/start/experiments.md Co-authored-by: Casper da Costa-Luis --- content/docs/start/experiments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/start/experiments.md b/content/docs/start/experiments.md index 0a240528f7..afdb7471a9 100644 --- a/content/docs/start/experiments.md +++ b/content/docs/start/experiments.md @@ -149,7 +149,7 @@ Changes for experiment 'exp-98a96' have been applied to your workspace. `dvc exp apply` is similar to `dvc checkout`, but works with experiments instead. DVC tracks everything in the pipeline for each experiment (parameters, -metrics, dependencies, and outputs) and can later retrieve them as needed. +metrics, dependencies, and outputs), retrieving things later as needed. Check that `scores.json` reflects the metrics in the table above: From fab5421502abc074dc13ba4d86020ef9f3a59bee Mon Sep 17 00:00:00 2001 From: Casper da Costa-Luis Date: Wed, 12 May 2021 07:34:29 +0100 Subject: [PATCH 29/29] final review comments --- content/docs/start/experiments.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/content/docs/start/experiments.md b/content/docs/start/experiments.md index afdb7471a9..7c0b1a2e58 100644 --- a/content/docs/start/experiments.md +++ b/content/docs/start/experiments.md @@ -18,8 +18,7 @@ once they're no longer needed. ## Running experiments -Previously, we learned how to tune ML data -[pipelines](/doc/start/data-pipelines) and +Previously, we learned how to tune [ML pipelines](/doc/start/data-pipelines) and [compare the changes](/doc/start/metrics-parameters-plots). Let's further increase the number of features in the `featurize` stage to see how it compares. @@ -95,7 +94,7 @@ $ dvc exp run --queue -S train.min_split=64 -S train.n_est=100 Queued experiment '0cdee86' for future execution. ``` -Next, run all (`--run-all`) queued experiments in parallel (`--jobs`): +Next, run all (`--run-all`) queued experiments in parallel (using `--jobs`): ```dvc $ dvc exp run --run-all --jobs 2