From ec369a45c0bf8b95fb51e1be28489d2ee7be3f4b Mon Sep 17 00:00:00 2001 From: Naba7 Date: Mon, 22 Jul 2019 23:56:40 +0530 Subject: [PATCH 1/8] bulleted points end with : or . --- static/docs/get-started/agenda.md | 8 +++--- static/docs/understanding-dvc/how-it-works.md | 2 +- static/docs/user-guide/dvc-file-format.md | 28 +++++++++---------- 3 files changed, 19 insertions(+), 19 deletions(-) diff --git a/static/docs/get-started/agenda.md b/static/docs/get-started/agenda.md index ea411e5157..c4b41c27d0 100644 --- a/static/docs/get-started/agenda.md +++ b/static/docs/get-started/agenda.md @@ -27,11 +27,11 @@ If you have data files or data sets and/or you produce other data files, models, data sets and you want to: - capture and save those data artifacts the same way we capture - code, -- track and switch between different versions of the data easily, + code. +- track and switch between different versions of the data easily. - being able to answer the question of how data artifacts (e.g. ML models) were - built in the first place, -- being able to compare them, + built in the first place. +- being able to compare them. - bring best practices to your team and get everyone on the same page. Then you are in a good place! Click the `Next` button below to start ↘. diff --git a/static/docs/understanding-dvc/how-it-works.md b/static/docs/understanding-dvc/how-it-works.md index 6e27d394e0..fd889b0620 100644 --- a/static/docs/understanding-dvc/how-it-works.md +++ b/static/docs/understanding-dvc/how-it-works.md @@ -90,4 +90,4 @@ -r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv ``` -8. DVC works on Mac, Linux ,and Windows. +8. DVC works on Mac, Linux and Windows. diff --git a/static/docs/user-guide/dvc-file-format.md b/static/docs/user-guide/dvc-file-format.md index 89f3b81d84..75cfb7d60c 100644 --- a/static/docs/user-guide/dvc-file-format.md +++ b/static/docs/user-guide/dvc-file-format.md @@ -45,30 +45,30 @@ locked: True On the top level, `.dvc` file consists of such fields: -- `cmd`: a command that is being run in this stage; -- `deps`: a list of dependencies for this stage; -- `outs`: a list of outputs for this stage; -- `md5`: md5 checksum for this DVC-file; -- `locked`: whether or not this stage is locked from reproduction; -- `wdir`: a directory to run command in (default `.`); +- `cmd`: a command that is being run in this stage. +- `deps`: a list of dependencies for this stage. +- `outs`: a list of outputs for this stage. +- `md5`: md5 checksum for this DVC-file. +- `locked`: whether or not this stage is locked from reproduction. +- `wdir`: a directory to run command in (default `.`). A dependency entry consists of such fields: -- `path`: path to the dependency, relative to the `wdir` path; -- `md5`: md5 checksum for the dependency; +- `path`: path to the dependency, relative to the `wdir` path. +- `md5`: md5 checksum for the dependency. An output entry consists of such fields: -- `path`: path to the output, relative to the `wdir` path; -- `md5`: md5 checksum for the output; -- `cache`: whether or not dvc should cache the output; -- `metric`: whether or not this file is a metric file; +- `path`: path to the output, relative to the `wdir` path. +- `md5`: md5 checksum for the output. +- `cache`: whether or not dvc should cache the output. +- `metric`: whether or not this file is a metric file. A metric entry consists of such fields: -- `type`: type of the metrics file (e.g. raw/json/tsv/htsv/csv/hcsv); +- `type`: type of the metrics file (e.g. raw/json/tsv/htsv/csv/hcsv). - `xpath`: path within the metrics file to the metrics data(e.g. `AUC.value` for - `{"AUC": {"value": 0.624321}}`); + `{"AUC": {"value": 0.624321}}`). A `meta` entry consists of `key: value` pairs such as `name: John`. A meta entry can have any valid YAML structure containing any number of attributes. From c75c3cd4b71d674910b69a69c9191ee94ceb26ee Mon Sep 17 00:00:00 2001 From: Naba7 Date: Tue, 23 Jul 2019 18:54:03 +0530 Subject: [PATCH 2/8] editted --- static/docs/get-started/agenda.md | 8 ++---- static/docs/tutorial/define-ml-pipeline.md | 6 ++-- static/docs/understanding-dvc/how-it-works.md | 2 +- static/docs/user-guide/autocomplete.md | 8 +++--- static/docs/user-guide/dvc-file-format.md | 28 +++++++++---------- 5 files changed, 25 insertions(+), 27 deletions(-) diff --git a/static/docs/get-started/agenda.md b/static/docs/get-started/agenda.md index c4b41c27d0..e2ad9bbe13 100644 --- a/static/docs/get-started/agenda.md +++ b/static/docs/get-started/agenda.md @@ -24,14 +24,12 @@ Let the NLP nature of the example not to discourage you from using DVC in other Data Science areas. There was no strong reason behind picking the NLP area. On contrary, DVC is designed to be pretty agnostic of frameworks, languages, etc. If you have data files or data sets and/or you produce other data files, models, -data sets and you want to: +datasets, etc., then you may want to: - capture and save those data artifacts the same way we capture code. - track and switch between different versions of the data easily. -- being able to answer the question of how data artifacts (e.g. ML models) were +- be able to answer the question of how data artifacts (e.g. ML models) were built in the first place. -- being able to compare them. +- be able to compare them. - bring best practices to your team and get everyone on the same page. - -Then you are in a good place! Click the `Next` button below to start ↘. diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index b4ac31df58..31388da8a7 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -213,11 +213,11 @@ outs: Sections of the file above include: -- `cmd` — the command to run. +- `cmd` — the command to run -- `deps` — dependencies with md5 checksums. +- `deps` — dependencies with md5 checksums -- `outs` — outputs with md5 checksums. +- `outs` — outputs with md5 checksums And (as with the `dvc add` command) the `data/.gitignore` file was modified. Now it includes the unarchived command output file `Posts.xml`. diff --git a/static/docs/understanding-dvc/how-it-works.md b/static/docs/understanding-dvc/how-it-works.md index fd889b0620..a5f6dc6980 100644 --- a/static/docs/understanding-dvc/how-it-works.md +++ b/static/docs/understanding-dvc/how-it-works.md @@ -90,4 +90,4 @@ -r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv ``` -8. DVC works on Mac, Linux and Windows. +8. DVC works on Mac, Linux, and Windows. diff --git a/static/docs/user-guide/autocomplete.md b/static/docs/user-guide/autocomplete.md index a374f1da62..18fc699d7d 100644 --- a/static/docs/user-guide/autocomplete.md +++ b/static/docs/user-guide/autocomplete.md @@ -18,15 +18,15 @@ run -- Generate a stage file from a command and execute the command. Depending on what you typed on the command line so far, it completes: -- Available DVC commands. +- Available DVC commands -- Options that are available for a particular command. +- Options that are available for a particular command - File names that make sense in a given context, such as using them as a target - for some commands. + for some commands - Arguments for selected options. For example, `dvc repro` completes with stage - files to reproduce. + files to reproduce Depending upon your preference and the availability of both Bash and Zsh on your system, follow the steps given below to Configure Bash and/or Zsh. diff --git a/static/docs/user-guide/dvc-file-format.md b/static/docs/user-guide/dvc-file-format.md index 75cfb7d60c..b16c2ada5f 100644 --- a/static/docs/user-guide/dvc-file-format.md +++ b/static/docs/user-guide/dvc-file-format.md @@ -45,30 +45,30 @@ locked: True On the top level, `.dvc` file consists of such fields: -- `cmd`: a command that is being run in this stage. -- `deps`: a list of dependencies for this stage. -- `outs`: a list of outputs for this stage. -- `md5`: md5 checksum for this DVC-file. -- `locked`: whether or not this stage is locked from reproduction. -- `wdir`: a directory to run command in (default `.`). +- `cmd`: a command that is being run in this stage +- `deps`: a list of dependencies for this stage +- `outs`: a list of outputs for this stage +- `md5`: md5 checksum for this DVC-file +- `locked`: whether or not this stage is locked from reproduction +- `wdir`: a directory to run command in (default `.`) A dependency entry consists of such fields: -- `path`: path to the dependency, relative to the `wdir` path. -- `md5`: md5 checksum for the dependency. +- `path`: path to the dependency, relative to the `wdir` path +- `md5`: md5 checksum for the dependency An output entry consists of such fields: -- `path`: path to the output, relative to the `wdir` path. -- `md5`: md5 checksum for the output. -- `cache`: whether or not dvc should cache the output. -- `metric`: whether or not this file is a metric file. +- `path`: path to the output, relative to the `wdir` path +- `md5`: md5 checksum for the output +- `cache`: whether or not dvc should cache the output +- `metric`: whether or not this file is a metric file A metric entry consists of such fields: -- `type`: type of the metrics file (e.g. raw/json/tsv/htsv/csv/hcsv). +- `type`: type of the metrics file (e.g. raw/json/tsv/htsv/csv/hcsv) - `xpath`: path within the metrics file to the metrics data(e.g. `AUC.value` for - `{"AUC": {"value": 0.624321}}`). + `{"AUC": {"value": 0.624321}}`) A `meta` entry consists of `key: value` pairs such as `name: John`. A meta entry can have any valid YAML structure containing any number of attributes. From 065cae3f33ef5bef6dde4303dc661166ce219123 Mon Sep 17 00:00:00 2001 From: Naba7 Date: Fri, 26 Jul 2019 17:01:37 +0530 Subject: [PATCH 3/8] modified --- static/docs/get-started/example-pipeline.md | 30 ++++++------ .../understanding-dvc/related-technologies.md | 8 ++-- static/docs/user-guide/analytics.md | 16 +++---- static/docs/user-guide/autocomplete.md | 12 ++--- static/docs/user-guide/contributing.md | 2 +- .../user-guide/dvc-files-and-directories.md | 48 ++++++++++--------- 6 files changed, 59 insertions(+), 57 deletions(-) diff --git a/static/docs/get-started/example-pipeline.md b/static/docs/get-started/example-pipeline.md index cdf6d027e8..cb4692cce9 100644 --- a/static/docs/get-started/example-pipeline.md +++ b/static/docs/get-started/example-pipeline.md @@ -69,15 +69,15 @@ that are described in earlier [get started](/doc/get-started) chapters. > will be determined by the interdependencies between DVC-files, mentioned > below. -- Initialize DVC repository (run it inside your Git repository): +Initialize DVC repository (run it inside your Git repository): ```dvc $ dvc init $ git commit -m "initialize DVC" ``` -- Download an input data set to the `data` directory and take it under DVC - control: +Download an input data set to the `data` directory and take it under DVC +control: ```dvc $ mkdir data @@ -134,7 +134,7 @@ described by providing a command to run, input data it takes and a list of output files. DVC is not Python or any other language specific and can wrap any command runnable via CLI. -- The first stage is to extract XML from the archive. Note that we don't need to + The first stage is to extract XML from the archive. Note that we don't need to run `dvc add` on `Posts.xml` below, `dvc run` saves the data automatically (commits into the cache, takes the file under DVC control): @@ -188,7 +188,7 @@ data files. -- Next stage: let's convert XML into TSV to make feature extraction easier: +Next stage: let's convert XML into TSV to make feature extraction easier: ```dvc $ dvc run -d code/xml_to_tsv.py -d data/Posts.xml \ @@ -197,8 +197,8 @@ $ dvc run -d code/xml_to_tsv.py -d data/Posts.xml \ python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv ``` -- Split training and test data sets. Here `0.2` is a test dataset split ratio, - `20170426` is a seed for randomization. There are two output files: +Split training and test data sets. Here `0.2` is a test dataset split ratio, +`20170426` is a seed for randomization. There are two output files: ```dvc $ dvc run -d code/split_train_test.py -d data/Posts.tsv \ @@ -208,8 +208,8 @@ $ dvc run -d code/split_train_test.py -d data/Posts.tsv \ data/Posts-train.tsv data/Posts-test.tsv ``` -- Extract features and labels from the data. Two TSV as inputs with two pickle - matrices as outputs: +Extract features and labels from the data. Two TSV as inputs with two pickle +matrices as outputs: ```dvc $ dvc run -d code/featurization.py -d data/Posts-train.tsv -d data/Posts-test.tsv \ @@ -219,7 +219,7 @@ $ dvc run -d code/featurization.py -d data/Posts-train.tsv -d data/Posts-test.ts data/matrix-train.pkl data/matrix-test.pkl ``` -- Train ML model on the training data set. 20170426 is a seed value here: +Train ML model on the training data set. 20170426 is a seed value here: ```dvc $ dvc run -d code/train_model.py -d data/matrix-train.pkl \ @@ -228,7 +228,7 @@ $ dvc run -d code/train_model.py -d data/matrix-train.pkl \ python code/train_model.py data/matrix-train.pkl 20170426 data/model.pkl ``` -- Finally, evaluate the model on the test data set and get the metrics file: +Finally, evaluate the model on the test data set and get the metrics file: ```dvc $ dvc run -d code/evaluate.py -d data/model.pkl -d data/matrix-test.pkl \ @@ -300,7 +300,7 @@ $ dvc pipeline show --ascii evaluate.dvc > simpler to run this pipeline, exact metric number may vary sufficiently > depending on Python version you are using and other environment parameters. -- An easy way to see metrics across different branches: +An easy way to see metrics across different branches: ```dvc $ dvc metrics show @@ -322,7 +322,7 @@ $ git commit -am "create pipeline" All stages could be automatically and efficiently reproduced even if some source files have been modified. For example: -- Let's improve the feature extraction algorithm by making some modification to + Let's improve the feature extraction algorithm by making some modification to the `code/featurization.py`: ```dvc @@ -337,7 +337,7 @@ bag_of_words = CountVectorizer(stop_words='english', ngram_range=(1, 2)) ``` -- Reproduce all required stages to get our target metrics file: +Reproduce all required stages to get our target metrics file: ```dvc $ dvc repro evaluate.dvc @@ -347,7 +347,7 @@ $ dvc repro evaluate.dvc > to run this pipeline, exact metric numbers may vary significantly depending on > the Python version you are using and other environment parameters. -- Take a look at the target metric improvement: +Take a look at the target metric improvement: ```dvc $ dvc metrics show -a diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index 0ca460b1d8..bd431be574 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -9,9 +9,9 @@ process. 1. **Git**. The difference is: - - DVC extends Git by introducing the concept of _data files_ - large files - that should NOT be stored in a Git repository but still need to be tracked - and versioned. + - DVC extends Git by introducing the concept of _data files_ which are large + files that should NOT be stored in a Git repository but still needs to be + tracked and versioned. 2. **Workflow management tools** (pipelines and DAGs): Airflow, Luigi, etc. The differences are: @@ -35,7 +35,7 @@ process. - DVC doesn't need to run any services. No graphical user interface as a result, but we expect some GUI services will be created on top of DVC. - - DVC has transparent design: + - DVC has transparent design which [meta files and directories](/doc/user-guide/dvc-files-and-directories) (including the data cache) have a human-readable format and can be easily reused by external tools. diff --git a/static/docs/user-guide/analytics.md b/static/docs/user-guide/analytics.md index 0bc1a27583..b35a09ff10 100644 --- a/static/docs/user-guide/analytics.md +++ b/static/docs/user-guide/analytics.md @@ -25,14 +25,14 @@ User and event data have a 14 month retention period. DVC's analytics record the following information per event: -- The DVC version, e.g. `0.22.0` -- The operating system information, e.g. `linux`, `ubuntu`, `14.04`, etc -- The underlying version control system, e.g. `git` -- Command type, e.g. `CmdDataPull` -- Command return code, e.g. `1` -- Way the DVC was installed, e.g. `binary` -- A DVC analytics user ID, e.g. `8ca59a29-ddd9-4247-992a-9b4775732aad`. This is - generated by [`uuid`](https://docs.python.org/3/library/uuid.html). +- the DVC version, e.g. `0.22.0` +- the operating system information, e.g. `linux`, `ubuntu`, `14.04`, etc +- the underlying version control system, e.g. `git` +- command type, e.g. `CmdDataPull` +- command return code, e.g. `1` +- way the DVC was installed, e.g. `binary` +- a DVC analytics user ID, e.g. `8ca59a29-ddd9-4247-992a-9b4775732aad` which +is generated by [`uuid`](https://docs.python.org/3/library/uuid.html) This _does not allow us to track individual users_ but does enable us to accurately measure user counts vs. event counts. diff --git a/static/docs/user-guide/autocomplete.md b/static/docs/user-guide/autocomplete.md index 18fc699d7d..caabf91e00 100644 --- a/static/docs/user-guide/autocomplete.md +++ b/static/docs/user-guide/autocomplete.md @@ -18,14 +18,14 @@ run -- Generate a stage file from a command and execute the command. Depending on what you typed on the command line so far, it completes: -- Available DVC commands +- available DVC commands -- Options that are available for a particular command +- options that are available for a particular command -- File names that make sense in a given context, such as using them as a target +- file names that make sense in a given context, such as using them as a target for some commands -- Arguments for selected options. For example, `dvc repro` completes with stage +- arguments for selected options. For example, `dvc repro` completes with stage files to reproduce Depending upon your preference and the availability of both Bash and Zsh on your @@ -46,10 +46,10 @@ In this case, follow the steps to configure Bash as it is your active shell. First, make sure Bash completion support is installed: -- On a current Linux OS (in a non-minimal installation), bash completion should +- on a current Linux OS (in a non-minimal installation), bash completion should be available. -- On a Mac, install with `brew install bash-completion`. +- on a Mac, install with `brew install bash-completion`. The DVC specific completion script is located in this path of our main repository: diff --git a/static/docs/user-guide/contributing.md b/static/docs/user-guide/contributing.md index f52fc4ecca..6c2effdc15 100644 --- a/static/docs/user-guide/contributing.md +++ b/static/docs/user-guide/contributing.md @@ -31,7 +31,7 @@ contributing! ## Development environment -- Get the latest development version. Fork and clone the repo: +- Get the latest development version. Fork and clone the repo. ```dvc $ git clone git@github.com:/dvc.git ``` diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index d7d22b2f22..10ee2fedaa 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -1,43 +1,45 @@ # DVC Files and Directories Once initialized in a project, DVC populates its installation directory -(`.dvc/`) with special DVC internal files and directories: +(`.dvc/`) with special DVC internal files and directories. -- `.dvc/config` - this is a configuration file. The config file can be edited by - hand or with a special command: `dvc config`. +### Special DVC internal files and directories -- `.dvc/config.local` - this is a local configuration file, that will overwrite - options in `.dvc/config`. This is useful when you need to specify private - options in your config that you don't want to track and share through Git - (credentials, private locations, etc). The local config file can be edited by - hand or with a special command: `dvc config --local`. +`.dvc/config` - this is a configuration file. The config file can be edited by +hand or with a special command: `dvc config`. -- `.dvc/cache` - the [cache directory](#structure-of-cache-directory) will - contain your data files. (The data directories of DVC repositories will only - contain links to the data files in the cache, refer to - [Large Dataset Optimization](/docs/user-guide/large-dataset-optimization).) +`.dvc/config.local` - this is a local configuration file, that will overwrite +options in `.dvc/config`. This is useful when you need to specify private +options in your config that you don't want to track and share through Git +(credentials, private locations, etc). The local config file can be edited by +hand or with a special command: `dvc config --local`. + +`.dvc/cache` - the [cache directory](#structure-of-cache-directory) will +contain your data files. (The data directories of DVC repositories will only +contain links to the data files in the cache, refer to +[Large Dataset Optimization](/docs/user-guide/large-dataset-optimization).) > Note that DVC includes the cache directory in `.gitignore` during the > initialization. No data files (with actual content) will ever be pushed to > the Git repository, only [DVC-files](/doc/user-guide/dvc-file-format) that > are needed to reproduce them. -- `.dvc/state` - this file is used for optimization. It is a SQLite db, that - contains checksums for files in a project with respective timestamps and - inodes to avoid unnecessary checksum computations. It also contains a list of - links (from cache to workspace) created by dvc and is used to cleanup your - workspace when calling `dvc checkout`. +`.dvc/state` - this file is used for optimization. It is a SQLite db, that +contains checksums for files in a project with respective timestamps and +inodes to avoid unnecessary checksum computations. It also contains a list of +links (from cache to workspace) created by dvc and is used to cleanup your +workspace when calling `dvc checkout`. -- `.dvc/state-journal` - temporary file for SQLite operations +`.dvc/state-journal` - temporary file for SQLite operations -- `.dvc/state-wal` - another SQLite temporary file +`.dvc/state-wal` - another SQLite temporary file -- `.dvc/updater` - this file is used store latest available version of dvc, - which is used to remind user to upgrade. +`.dvc/updater` - this file is used store latest available version of dvc, +which is used to remind user to upgrade. -- `.dvc/updater.lock` - a lock file for `.dvc/updater`. +`.dvc/updater.lock` - a lock file for `.dvc/updater`. -- `.dvc/lock` - a lock file for the whole dvc project. +`.dvc/lock` - a lock file for the whole dvc project. ## Structure of cache directory From ac6bbe546a413bd35f3047baba070c4bddd36da7 Mon Sep 17 00:00:00 2001 From: Naba7 Date: Mon, 5 Aug 2019 21:54:03 +0530 Subject: [PATCH 4/8] resolve conflict --- static/docs/get-started/agenda.md | 20 ----------- static/docs/tutorial/define-ml-pipeline.md | 14 -------- static/docs/user-guide/analytics.md | 20 ----------- static/docs/user-guide/autocomplete.md | 34 ------------------- static/docs/user-guide/contributing.md | 6 ---- .../user-guide/dvc-files-and-directories.md | 18 ---------- 6 files changed, 112 deletions(-) diff --git a/static/docs/get-started/agenda.md b/static/docs/get-started/agenda.md index 8e25123b05..b8742c46cc 100644 --- a/static/docs/get-started/agenda.md +++ b/static/docs/get-started/agenda.md @@ -26,25 +26,6 @@ contrary, DVC is designed to be pretty agnostic of frameworks, languages, etc. If you have data files or data sets and/or you produce other data files, models, datasets, etc., then you may want to: -<<<<<<< HEAD -- capture and save those data artifacts the same way we capture - code. -- track and switch between different versions of the data easily. -- be able to answer the question of how data artifacts (e.g. ML models) were - built in the first place. -- be able to compare them. -- bring best practices to your team and get everyone on the same page. -||||||| merged common ancestors -- capture and save those data artifacts the same way we capture - code, -- track and switch between different versions of the data easily, -- being able to answer the question of how data artifacts (e.g. ML models) were - built in the first place, -- being able to compare them, -- bring best practices to your team and get everyone on the same page. - -Then you are in a good place! Click the `Next` button below to start ↘. -======= - Capture and save those data artifacts the same way we capture code - Track and switch between different versions of the data easily @@ -54,4 +35,3 @@ Then you are in a good place! Click the `Next` button below to start ↘. - Bring best practices to your team and get everyone on the same page Then you are in a good place! Click the `Next` button below to start ↘ ->>>>>>> 88fdf845e2173c49aec0b867db81dc311f20b304 diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index b617125dae..1b689ebe09 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -213,23 +213,9 @@ outs: Sections of the file above include: -<<<<<<< HEAD -- `cmd` — the command to run - -- `deps` — dependencies with md5 checksums - -- `outs` — outputs with md5 checksums -||||||| merged common ancestors -- `cmd` — the command to run. - -- `deps` — dependencies with md5 checksums. - -- `outs` — outputs with md5 checksums. -======= - `cmd` — the command to run - `deps` — dependencies with md5 checksums - `outs` — outputs with md5 checksums ->>>>>>> 88fdf845e2173c49aec0b867db81dc311f20b304 And (as with the `dvc add` command) the `data/.gitignore` file was modified. Now it includes the unarchived command output file `Posts.xml`. diff --git a/static/docs/user-guide/analytics.md b/static/docs/user-guide/analytics.md index 465bb4ee5c..cd06d47dfc 100644 --- a/static/docs/user-guide/analytics.md +++ b/static/docs/user-guide/analytics.md @@ -25,25 +25,6 @@ User and event data have a 14 month retention period. DVC's analytics record the following information per event: -<<<<<<< HEAD -- the DVC version, e.g. `0.22.0` -- the operating system information, e.g. `linux`, `ubuntu`, `14.04`, etc -- the underlying version control system, e.g. `git` -- command type, e.g. `CmdDataPull` -- command return code, e.g. `1` -- way the DVC was installed, e.g. `binary` -- a DVC analytics user ID, e.g. `8ca59a29-ddd9-4247-992a-9b4775732aad` which -is generated by [`uuid`](https://docs.python.org/3/library/uuid.html) -||||||| merged common ancestors -- The DVC version, e.g. `0.22.0` -- The operating system information, e.g. `linux`, `ubuntu`, `14.04`, etc -- The underlying version control system, e.g. `git` -- Command type, e.g. `CmdDataPull` -- Command return code, e.g. `1` -- Way the DVC was installed, e.g. `binary` -- A DVC analytics user ID, e.g. `8ca59a29-ddd9-4247-992a-9b4775732aad`. This is - generated by [`uuid`](https://docs.python.org/3/library/uuid.html). -======= - The DVC version, e.g. `0.22.0` - The operating system information, e.g. `linux`, `ubuntu`, `14.04`, etc - The underlying version control system, e.g. `git` @@ -52,7 +33,6 @@ is generated by [`uuid`](https://docs.python.org/3/library/uuid.html) - Way the DVC was installed, e.g. `binary` - A DVC analytics user ID (e.g. `8ca59a29-ddd9-4247-992a-9b4775732aad`), generated by [`uuid`](https://docs.python.org/3/library/uuid.html) ->>>>>>> 88fdf845e2173c49aec0b867db81dc311f20b304 This _does not allow us to track individual users_ but does enable us to accurately measure user counts vs. event counts. diff --git a/static/docs/user-guide/autocomplete.md b/static/docs/user-guide/autocomplete.md index cda25a79f5..e49dc5c4b6 100644 --- a/static/docs/user-guide/autocomplete.md +++ b/static/docs/user-guide/autocomplete.md @@ -18,34 +18,12 @@ run -- Generate a stage file from a command and execute the command. Depending on what you typed on the command line so far, it completes: -<<<<<<< HEAD -- available DVC commands - -- options that are available for a particular command - -- file names that make sense in a given context, such as using them as a target - for some commands - -- arguments for selected options. For example, `dvc repro` completes with stage - files to reproduce -||||||| merged common ancestors -- Available DVC commands. - -- Options that are available for a particular command. - -- File names that make sense in a given context, such as using them as a target - for some commands. - -- Arguments for selected options. For example, `dvc repro` completes with stage - files to reproduce. -======= - Available DVC commands - Options that are available for a particular command - File names that make sense in a given context, such as using them as a target for some commands - Arguments for selected options. For example, `dvc repro` completes with stage files to reproduce ->>>>>>> 88fdf845e2173c49aec0b867db81dc311f20b304 Depending upon your preference and the availability of both Bash and Zsh on your system, follow the steps given below to Configure Bash and/or Zsh. @@ -65,21 +43,9 @@ In this case, follow the steps to configure Bash as it is your active shell. First, make sure Bash completion support is installed: -<<<<<<< HEAD -- on a current Linux OS (in a non-minimal installation), bash completion should - be available. - -- on a Mac, install with `brew install bash-completion`. -||||||| merged common ancestors -- On a current Linux OS (in a non-minimal installation), bash completion should - be available. - -- On a Mac, install with `brew install bash-completion`. -======= - On a current Linux OS (in a non-minimal installation), bash completion should be available; - On a Mac, install with `brew install bash-completion`. ->>>>>>> 88fdf845e2173c49aec0b867db81dc311f20b304 The DVC specific completion script is located in this path of our main repository: diff --git a/static/docs/user-guide/contributing.md b/static/docs/user-guide/contributing.md index c5042a959e..5f5225d842 100644 --- a/static/docs/user-guide/contributing.md +++ b/static/docs/user-guide/contributing.md @@ -31,14 +31,8 @@ contributing! ## Development environment -<<<<<<< HEAD -- Get the latest development version. Fork and clone the repo. -||||||| merged common ancestors -- Get the latest development version. Fork and clone the repo: -======= - Get the latest development version. Fork and clone the repo: ->>>>>>> 88fdf845e2173c49aec0b867db81dc311f20b304 ```dvc $ git clone git@github.com:/dvc.git ``` diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index e15b8c2848..0523949afb 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -8,29 +8,11 @@ Once initialized in a project, DVC populates its installation directory `.dvc/config` - this is a configuration file. The config file can be edited by hand or with a special command: `dvc config`. -<<<<<<< HEAD -`.dvc/config.local` - this is a local configuration file, that will overwrite -options in `.dvc/config`. This is useful when you need to specify private -options in your config that you don't want to track and share through Git -(credentials, private locations, etc). The local config file can be edited by -hand or with a special command: `dvc config --local`. - -`.dvc/cache` - the [cache directory](#structure-of-cache-directory) will -contain your data files. (The data directories of DVC repositories will only -contain links to the data files in the cache, refer to -[Large Dataset Optimization](/docs/user-guide/large-dataset-optimization).) -||||||| merged common ancestors -- `.dvc/cache` - the [cache directory](#structure-of-cache-directory) will - contain your data files. (The data directories of DVC repositories will only - contain links to the data files in the cache, refer to - [Large Dataset Optimization](/docs/user-guide/large-dataset-optimization).) -======= - `.dvc/cache` - the [cache directory](#structure-of-cache-directory) will contain your data files. (The data directories of DVC repositories will only contain links to the data files in the cache, refer to [Large Dataset Optimization](/docs/user-guide/large-dataset-optimization).) See `dvc config cache` for related configuration options. ->>>>>>> 88fdf845e2173c49aec0b867db81dc311f20b304 > Note that DVC includes the cache directory in `.gitignore` during the > initialization. No data files (with actual content) will ever be pushed to From a8f3e126c1774d1874be47b61de80920604999af Mon Sep 17 00:00:00 2001 From: Naba7 Date: Wed, 7 Aug 2019 10:49:41 +0530 Subject: [PATCH 5/8] hypen change, needs->need --- static/docs/understanding-dvc/related-technologies.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index bd431be574..a77e5b6ee8 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -10,7 +10,7 @@ process. 1. **Git**. The difference is: - DVC extends Git by introducing the concept of _data files_ which are large - files that should NOT be stored in a Git repository but still needs to be + –files that should NOT be stored in a Git repository but still need to be tracked and versioned. 2. **Workflow management tools** (pipelines and DAGs): Airflow, Luigi, etc. The @@ -35,9 +35,9 @@ process. - DVC doesn't need to run any services. No graphical user interface as a result, but we expect some GUI services will be created on top of DVC. - - DVC has transparent design which + - DVC has transparent design of [meta files and directories](/doc/user-guide/dvc-files-and-directories) - (including the data cache) have a human-readable format and can be easily + (including the data cache), have a human-readable format and can be easily reused by external tools. 4. **Git workflows** and Git usage methodologies such as Gitflow. The From 36d4d03a5999ff8c929142a493deef018b911a69 Mon Sep 17 00:00:00 2001 From: Naba7 Date: Wed, 7 Aug 2019 11:58:43 +0530 Subject: [PATCH 6/8] changed ; to . --- static/docs/commands-reference/fetch.md | 2 +- static/docs/commands-reference/import-url.md | 8 ++--- static/docs/commands-reference/index.md | 12 +++---- static/docs/commands-reference/install.md | 4 +-- static/docs/commands-reference/pull.md | 1 - static/docs/commands-reference/push.md | 1 - static/docs/commands-reference/run.md | 2 +- static/docs/commands-reference/unprotect.md | 6 ++-- static/docs/commands-reference/version.md | 1 + static/docs/get-started/agenda.md | 4 +-- .../user-guide/dvc-files-and-directories.md | 32 +++++++++++-------- 11 files changed, 39 insertions(+), 34 deletions(-) diff --git a/static/docs/commands-reference/fetch.md b/static/docs/commands-reference/fetch.md index f4dff8707c..d0e3d65c90 100644 --- a/static/docs/commands-reference/fetch.md +++ b/static/docs/commands-reference/fetch.md @@ -100,7 +100,7 @@ specified in DVC-files currently in the workspace are considered by `dvc fetch` of a DVC-file ([experiments](/doc/get-started/experiments)), not just the current one. -- `-T`, `--all-tags` - fetch cache for all tags. Similar to `-a` above +- `-T`, `--all-tags` - fetch cache for all tags. Similar to `-a` above. - `--show-checksums` - show checksums instead of file names when printing the download progress. diff --git a/static/docs/commands-reference/import-url.md b/static/docs/commands-reference/import-url.md index 4530273202..264f289145 100644 --- a/static/docs/commands-reference/import-url.md +++ b/static/docs/commands-reference/import-url.md @@ -23,10 +23,10 @@ In some cases it's convenient to add a data file or directory from a remote location into the workspace, such that it will be automatically updated (by `dvc repro`) when the external data source changes. Examples: -- a remote system may produce occasional data files that are used in other - projects; -- a batch process running regularly updates a data file to import; and -- a shared dataset on a remote storage that is managed and updated outside DVC. +- A remote system may produce occasional data files that are used in other + projects. +- A batch process running regularly updates a data file to import. +- A shared dataset on a remote storage that is managed and updated outside DVC. The `dvc import-url` command helps the user create such an external data dependency. The `url` argument specifies the external location of the data to be diff --git a/static/docs/commands-reference/index.md b/static/docs/commands-reference/index.md index 87f5243bbd..3ae493b0cb 100644 --- a/static/docs/commands-reference/index.md +++ b/static/docs/commands-reference/index.md @@ -1,17 +1,17 @@ # Using DVC Commands -DVC is a command-line tool. The typical use case for DVC goes as follows +DVC is a command-line tool. The typical use case for DVC goes as follows: -- In an existing Git repository, initialize a DVC repository with `dvc init`, +- In an existing Git repository, initialize a DVC repository with `dvc init`. - Copy source code files for modeling into the repository and convert the files - into DVC data files with `dvc add` command; + into DVC data files with `dvc add` command. - Process raw data files through your data processing and modeling code using - the `dvc run` command; + the `dvc run` command. - Use `--outs` option to specify `dvc run` command outputs which will be - converted to DVC data files after the code runs; + converted to DVC data files after the code runs. - Clone a git repo with the code of your ML application pipeline. However, this will not copy your DVC cache. Use [data remotes](/doc/commands-reference/remote) and `dvc push` to share the - cache (data); + cache (data). - Use `dvc repro` to quickly reproduce your pipeline on a new iteration, after your data item files or source code of your ML application are modified. diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index 6e684d9771..9923ae68d9 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -46,9 +46,9 @@ The installed Git hook automates executing `dvc push`. ## Installed Git hooks - Git `pre-commit` hook executes `dvc status` before `git commit` to inform the - user about the workspace status; + user about the workspace status. - Git `post-checkout` hook executes `dvc checkout` after `git checkout` to - automatically synchronize the data files with the new workspace state; + automatically synchronize the data files with the new workspace state. - Git `pre-push` hook executes `dvc push` before `git push` to upload files and directories under DVC control to remote. diff --git a/static/docs/commands-reference/pull.md b/static/docs/commands-reference/pull.md index 0e6f244666..7f668957a7 100644 --- a/static/docs/commands-reference/pull.md +++ b/static/docs/commands-reference/pull.md @@ -200,4 +200,3 @@ the `model.p.dvc` stage occurs later, its data was not pulled. Then we ran `dvc pull` specifying the last stage, `model.p.dvc`, and its data was downloaded. Finally, we ran `dvc pull` with no options to make sure that all data was already pulled with the previous commands. - diff --git a/static/docs/commands-reference/push.md b/static/docs/commands-reference/push.md index 9fcaa14c42..8153cae3ca 100644 --- a/static/docs/commands-reference/push.md +++ b/static/docs/commands-reference/push.md @@ -339,4 +339,3 @@ Pipelines are up to date. Nothing to reproduce. And running `dvc status --cloud` verifies that indeed there are no more files to upload to the remote cache. - diff --git a/static/docs/commands-reference/run.md b/static/docs/commands-reference/run.md index 1895f865e9..724fcfa047 100644 --- a/static/docs/commands-reference/run.md +++ b/static/docs/commands-reference/run.md @@ -58,7 +58,7 @@ pipeline. dependencies can be specified like this: `-d data.csv -d process.py`. Usually, each dependency is a file or a directory with data, or a code file, or a configuration file. DVC also supports certain - [external dependencies](/doc/user-guide/external-dependencies) + [external dependencies](/doc/user-guide/external-dependencies). DVC builds a computation graph and this list of dependencies is a way to connect different stages with each other. When you run `dvc repro` to diff --git a/static/docs/commands-reference/unprotect.md b/static/docs/commands-reference/unprotect.md index a292e7d210..c8f8262d36 100644 --- a/static/docs/commands-reference/unprotect.md +++ b/static/docs/commands-reference/unprotect.md @@ -29,10 +29,10 @@ on this process. `dvc unprotect` can be an expensive operation (involves copying data), check first whether your task matches one of the cases that are considered safe, even -when cache protected mode is enabled: +when cache protected mode is enabled by: -- Adding more files to a directory input data set (say, images or videos) -- Deleting files from a directory data set +- Adding more files to a directory input data set (say, images or videos). +- Deleting files from a directory data set. ## Options diff --git a/static/docs/commands-reference/version.md b/static/docs/commands-reference/version.md index 8baa7e1598..bfeb7ea2f2 100644 --- a/static/docs/commands-reference/version.md +++ b/static/docs/commands-reference/version.md @@ -125,3 +125,4 @@ Platform: Linux-4.15.0-50-generic-x86_64-with-debian-buster-sid Binary: False Filesystem type (workspace): ('ext4', '/dev/sdb3') ``` + diff --git a/static/docs/get-started/agenda.md b/static/docs/get-started/agenda.md index b8742c46cc..11a7c75bce 100644 --- a/static/docs/get-started/agenda.md +++ b/static/docs/get-started/agenda.md @@ -29,9 +29,9 @@ datasets, etc., then you may want to: - Capture and save those data artifacts the same way we capture code - Track and switch between different versions of the data easily -- Being able to answer the question of how data artifacts (e.g. ML models) were +- Be able to answer the question of how data artifacts (e.g. ML models) were built in the first place -- Being able to compare them +- Be able to compare them - Bring best practices to your team and get everyone on the same page Then you are in a good place! Click the `Next` button below to start ↘ diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index 0523949afb..842c709dbf 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -5,8 +5,14 @@ Once initialized in a project, DVC populates its installation directory ### Special DVC internal files and directories -`.dvc/config` - this is a configuration file. The config file can be edited by -hand or with a special command: `dvc config`. +- `.dvc/config` - this is a configuration file. The config file can be edited by + hand or with a special command: `dvc config`. + +- `.dvc/config.local` - this is a local configuration file, that will overwrite + options in `.dvc/config`. This is useful when you need to specify private + options in your config that you don't want to track and share through Git + (credentials, private locations, etc). The local config file can be edited by + hand or with a special command: `dvc config --local`. - `.dvc/cache` - the [cache directory](#structure-of-cache-directory) will contain your data files. (The data directories of DVC repositories will only @@ -19,22 +25,22 @@ hand or with a special command: `dvc config`. > the Git repository, only [DVC-files](/doc/user-guide/dvc-file-format) that > are needed to reproduce them. -`.dvc/state` - this file is used for optimization. It is a SQLite db, that -contains checksums for files in a project with respective timestamps and -inodes to avoid unnecessary checksum computations. It also contains a list of -links (from cache to workspace) created by dvc and is used to cleanup your -workspace when calling `dvc checkout`. +- `.dvc/state` - this file is used for optimization. It is a SQLite db, that + contains checksums for files in a project with respective timestamps and + inodes to avoid unnecessary checksum computations. It also contains a list of + links (from cache to workspace) created by dvc and is used to cleanup your + workspace when calling `dvc checkout`. -`.dvc/state-journal` - temporary file for SQLite operations +- `.dvc/state-journal` - temporary file for SQLite operations -`.dvc/state-wal` - another SQLite temporary file +- `.dvc/state-wal` - another SQLite temporary file -`.dvc/updater` - this file is used store latest available version of dvc, -which is used to remind user to upgrade. +- `.dvc/updater` - this file is used store latest available version of dvc, + which is used to remind user to upgrade. -`.dvc/updater.lock` - a lock file for `.dvc/updater`. +- `.dvc/updater.lock` - a lock file for `.dvc/updater`. -`.dvc/lock` - a lock file for the whole dvc project. +- `.dvc/lock` - a lock file for the whole dvc project. ## Structure of cache directory From eab74525af5004fb4c08afe95677f3c67ba7e671 Mon Sep 17 00:00:00 2001 From: Naba7 Date: Sat, 10 Aug 2019 15:29:02 +0530 Subject: [PATCH 7/8] all bullets --- static/docs/commands-reference/status.md | 8 ++++---- static/docs/commands-reference/version.md | 12 +++++------ static/docs/get-started/agenda.md | 4 ++-- .../understanding-dvc/related-technologies.md | 4 ++-- static/docs/understanding-dvc/resources.md | 20 +++++++++---------- static/docs/user-guide/analytics.md | 4 ++-- .../user-guide/contributing-documentation.md | 4 ++-- static/docs/user-guide/contributing.md | 18 ++++++++--------- 8 files changed, 37 insertions(+), 37 deletions(-) diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index 99a0950093..d4e1f6f0aa 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -71,13 +71,13 @@ outputs described in it. commands like `dvc commit` or `dvc repro`, `dvc run` should be run to update the file. Possible states are: - - _new_: output exists in workspace, but there is no corresponding checksum + - _new_: Output exists in workspace, but there is no corresponding checksum calculated and saved in the DVC-file for this output yet. - - _modified_: output or dependency exists in workspace, but the corresponding + - _modified_: Output or dependency exists in workspace, but the corresponding checksum in the DVC-file is not up to date. - - _deleted_: output or dependency does not exist in workspace, but still + - _deleted_: Output or dependency does not exist in workspace, but still referred in the DVC-file. - - _not in cache_: output exists in workspace and the corresponding checksum in + - _not in cache_: Output exists in workspace and the corresponding checksum in the DVC-file is up to date, but there is no corresponding cache entry. diff --git a/static/docs/commands-reference/version.md b/static/docs/commands-reference/version.md index bfeb7ea2f2..16a364329e 100644 --- a/static/docs/commands-reference/version.md +++ b/static/docs/commands-reference/version.md @@ -61,11 +61,11 @@ The detail of `Binary` depends on the way DVC was downloading and - **`Binary: True`** - displayed when DVC is downloaded/installed as one of: - Debian package (`.deb`) - file used to install packages in several Linux - distributions, like Ubuntu. + distributions, like Ubuntu - Red Hat package (`.rpm`) - file used to install packages in some Linux based distributions, such as Fedora, CentOS, etc. - - PKG file (`.pkg`) - file used to install apps on macOS. - - Windows executable (`.exe`) - file used to install applications on Windows. + - PKG file (`.pkg`) - file used to install apps on macOS + - Windows executable (`.exe`) - file used to install applications on Windows These downloads are available from our [home page](/). They ultimately contain a binary bundle, which is the executable version of a software program, @@ -76,11 +76,11 @@ The detail of `Binary` depends on the way DVC was downloading and - **`Binary: False`** - shown when DVC is downloaded and installed from: - [DVC's GitHub repository](https://github.com/iterative/dvc) - where core - source code is hosted. + source code is hosted - [The Python Package Index (PyPI)](https://pypi.org/project/dvc/) - source - code is stored as a Python package. + code is stored as a Python package - [Homebrew package manager](https://github.com/iterative/homebrew-dvc) (for - macOS systems) - source code is stored as Python package. + macOS systems) - source code is stored as Python package This method of installation involves downloading DVC source code, and following certain setup instructions (See the diff --git a/static/docs/get-started/agenda.md b/static/docs/get-started/agenda.md index 647654a5f6..f86f8a17c9 100644 --- a/static/docs/get-started/agenda.md +++ b/static/docs/get-started/agenda.md @@ -29,9 +29,9 @@ datasets and you want to: - Capture and save those data artifacts the same way we capture code - Track and switch between different versions of the data easily -- Being able to answer the question of how data artifacts (e.g. ML models) were +- Be able to answer the question of how data artifacts (e.g. ML models) were built in the first place -- Being able to compare them +- Be able to compare them - Bring best practices to your team and get everyone on the same page Then you are in a good place! Click the `Next` button below to start ↘ diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index a77e5b6ee8..8d5bfe83e5 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -35,9 +35,9 @@ process. - DVC doesn't need to run any services. No graphical user interface as a result, but we expect some GUI services will be created on top of DVC. - - DVC has transparent design of + - DVC has transparent design: [meta files and directories](/doc/user-guide/dvc-files-and-directories) - (including the data cache), have a human-readable format and can be easily + (including the data cache) have a human-readable format and can be easily reused by external tools. 4. **Git workflows** and Git usage methodologies such as Gitflow. The diff --git a/static/docs/understanding-dvc/resources.md b/static/docs/understanding-dvc/resources.md index d63ce08f8a..80f21ae405 100644 --- a/static/docs/understanding-dvc/resources.md +++ b/static/docs/understanding-dvc/resources.md @@ -27,16 +27,16 @@ ## Articles -- [Using DVC to create an efficient version control system for data projects](https://medium.com/qonto-engineering/using-dvc-to-create-an-efficient-version-control-system-for-data-projects-96efd94355fe); -- [Our Machine Learning Workflow: DVC, MLFlow and Training in Docker Containers](https://medium.com/ixorthink/our-machine-learning-workflow-dvc-mlflow-and-training-in-docker-containers-5b9c80cdf804); -- [Principled Machine Learning: Practices and Tools for Efficient Collaboration](https://dev.to/robogeek/principled-machine-learning-4eho); -- [Data version control with DVC. What do the authors have to say?](https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee); -- [Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis](https://towardsdatascience.com/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8); -- [My First Try at DVC](https://stdiff.net/MB2019051301.html); -- [Machine Learning Reproducibility crisis](https://petewarden.com/2018/03/19/the-machine-learning-reproducibility-crisis/); -- [Data Science Workflow](http://fouryears.eu/2018/11/29/the-data-science-workflow/); -- [The Data Science Workflow](https://towardsdatascience.com/the-data-science-workflow-43859db0415); -- [Data Versioning Notebook](https://www.kaggle.com/rtatman/kerneld4769833fe); +- [Using DVC to create an efficient version control system for data projects](https://medium.com/qonto-engineering/using-dvc-to-create-an-efficient-version-control-system-for-data-projects-96efd94355fe) +- [Our Machine Learning Workflow: DVC, MLFlow and Training in Docker Containers](https://medium.com/ixorthink/our-machine-learning-workflow-dvc-mlflow-and-training-in-docker-containers-5b9c80cdf804) +- [Principled Machine Learning: Practices and Tools for Efficient Collaboration](https://dev.to/robogeek/principled-machine-learning-4eho) +- [Data version control with DVC. What do the authors have to say?](https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee) +- [Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis](https://towardsdatascience.com/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8) +- [My First Try at DVC](https://stdiff.net/MB2019051301.html) +- [Machine Learning Reproducibility crisis](https://petewarden.com/2018/03/19/the-machine-learning-reproducibility-crisis/) +- [Data Science Workflow](http://fouryears.eu/2018/11/29/the-data-science-workflow/) +- [The Data Science Workflow](https://towardsdatascience.com/the-data-science-workflow-43859db0415) +- [Data Versioning Notebook](https://www.kaggle.com/rtatman/kerneld4769833fe) - [First Impressions of Data Science Version Control](https://medium.com/@christopher.samiullah/first-impressions-of-data-science-version-control-dvc-fe96ab29cdda?sk=05e1f1d1ba16c9037046f3568956f16c) ## Slides diff --git a/static/docs/user-guide/analytics.md b/static/docs/user-guide/analytics.md index cd06d47dfc..3b154a4b66 100644 --- a/static/docs/user-guide/analytics.md +++ b/static/docs/user-guide/analytics.md @@ -11,9 +11,9 @@ current work. Anonymous aggregate user analytics allow us to prioritize fixes and features based on how, where and when people use DVC. For example: - If reflinks (depends on a file system type) are supported for most users, we - can keep cache protected mode off by default (see `dvc unprotect`); + can keep cache protected mode off by default (see `dvc unprotect`). - Collecting the OS version and the way DVC was installed allows us to decide - what versions of OS to prioritize and support; + what versions of OS to prioritize and support. - If usage of some command is negligible small it makes us think about issues with a command or documentation. diff --git a/static/docs/user-guide/contributing-documentation.md b/static/docs/user-guide/contributing-documentation.md index 01c854acc6..78dad9c614 100644 --- a/static/docs/user-guide/contributing-documentation.md +++ b/static/docs/user-guide/contributing-documentation.md @@ -12,10 +12,10 @@ To contribute documentation you need to know these locations: - [Content](https://github.com/iterative/dvc.org/tree/master/static/docs) (`/static/docs`) - [Markdown](https://guides.github.com/features/mastering-markdown/) files of - the different pages to render dynamically in the browser; + the different pages to render dynamically in the browser - [Images](https://github.com/iterative/dvc.org/tree/master/static/img) (`/static/img`) - add new images, gif, svg, etc here. Reference them from the - Markdown files like this: `![](/static/img/reproducibility.png)`; + Markdown files like this: `![](/static/img/reproducibility.png)` - [Sections](https://github.com/iterative/dvc.org/tree/master/src/Documentation/sidebar.json) (`.../sidebar.json`) - edit it to register a new section for the navigation menu. diff --git a/static/docs/user-guide/contributing.md b/static/docs/user-guide/contributing.md index f95b845ec0..d81782d4a9 100644 --- a/static/docs/user-guide/contributing.md +++ b/static/docs/user-guide/contributing.md @@ -17,13 +17,13 @@ to learn how to submit your changes. ## Submitting changes - Open a new issue in the - [issue tracker](https://github.com/iterative/dvc/issues); + [issue tracker](https://github.com/iterative/dvc/issues). - Setup the [development environment](#development-environment) if you need to - run tests or [run](#running-development-version) the DVC with your changes; + run tests or [run](#running-development-version) the DVC with your changes. - Fork [DVC](https://github.com/iterative/dvc.git) and prepare necessary - changes; -- Add tests for your changes to `tests/test_*.py`; -- [Run tests](#running-tests) and make sure all of them pass; + changes. +- Add tests for your changes to `tests/test_*.py`. +- [Run tests](#running-tests) and make sure all of them pass. - Submit a pull request, referencing any issues it addresses. We will review your pull request as soon as possible. Thank you for @@ -288,11 +288,11 @@ Fixes #(Github issue id). Message types: - *component* - name of a component that this patch is affecting. Use `dvc` in a - general case; -- _short description_ - short description of the patch; + general case. +- _short description_ - short description of the patch. - _long description_ - If needed, longer message describing the patch in more - details; -- _github issue id_ - An id of the Github issue that this patch is addressing + details. +- _github issue id_ - An id of the Github issue that this patch is addressing. Example: From 29f0aa555bf36b7d78b344c9439b373857b2005a Mon Sep 17 00:00:00 2001 From: Naba7 Date: Mon, 12 Aug 2019 10:49:57 +0530 Subject: [PATCH 8/8] modified --- .../understanding-dvc/related-technologies.md | 146 +++++++++--------- static/docs/user-guide/autocomplete.md | 2 +- .../user-guide/contributing-documentation.md | 10 +- static/docs/user-guide/contributing.md | 10 +- static/docs/user-guide/dvc-file-format.md | 28 ++-- .../user-guide/dvc-files-and-directories.md | 6 +- 6 files changed, 101 insertions(+), 101 deletions(-) diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index 8d5bfe83e5..506131e762 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -9,119 +9,119 @@ process. 1. **Git**. The difference is: - - DVC extends Git by introducing the concept of _data files_ which are large - –files that should NOT be stored in a Git repository but still need to be - tracked and versioned. + - DVC extends Git by introducing the concept of _data files_ – large files + that should NOT be stored in a Git repository but still need to be tracked + and versioned. 2. **Workflow management tools** (pipelines and DAGs): Airflow, Luigi, etc. The differences are: - - DVC is focused on data science and modeling. As a result, DVC pipelines are - lightweight, easy to create and modify. However, DVC lacks pipeline - execution features like execution monitoring, execution error handling, and - recovering. + - DVC is focused on data science and modeling. As a result, DVC pipelines are + lightweight, easy to create and modify. However, DVC lacks pipeline + execution features like execution monitoring, execution error handling, and + recovering. - - DVC is purely a command line tool without a graphical user interface (GUI) - and doesn't run any daemons or servers. Nevertheless, DVC can generate - images with pipeline and experiment workflow visualization. + - DVC is purely a command line tool without a graphical user interface (GUI) + and doesn't run any daemons or servers. Nevertheless, DVC can generate + images with pipeline and experiment workflow visualization. 3. **Experiment management software** today is mostly designed for enterprise usage. An open-sourced experimentation tool example: http://studio.ml/. The differences are: - - DVC uses Git as the underlying platform for experiment tracking instead of - a web application. + - DVC uses Git as the underlying platform for experiment tracking instead of + a web application. - - DVC doesn't need to run any services. No graphical user interface as a - result, but we expect some GUI services will be created on top of DVC. + - DVC doesn't need to run any services. No graphical user interface as a + result, but we expect some GUI services will be created on top of DVC. - - DVC has transparent design: - [meta files and directories](/doc/user-guide/dvc-files-and-directories) - (including the data cache) have a human-readable format and can be easily - reused by external tools. + - DVC has transparent design: + [meta files and directories](/doc/user-guide/dvc-files-and-directories) + (including the data cache) have a human-readable format and can be easily + reused by external tools. 4. **Git workflows** and Git usage methodologies such as Gitflow. The differences are: - - DVC supports a new experimentation methodology that integrates easily with - a Git workflow. A separate branch should be created for each experiment, - with a subsequent merge of this branch if it was successful. + - DVC supports a new experimentation methodology that integrates easily with + a Git workflow. A separate branch should be created for each experiment, + with a subsequent merge of this branch if it was successful. - - DVC innovates by giving experimenters the ability to easily navigate - through past experiments without recomputing them. + - DVC innovates by giving experimenters the ability to easily navigate + through past experiments without recomputing them. 5) **Makefile** (and it's analogues). The differences are: - - DVC utilizes a DAG: + - DVC utilizes a DAG: - - The DAG is defined by [DVC-files](/doc/user-guide/dvc-file-format) (with - file names `.dvc` or `Dvcfile`). + - The DAG is defined by [DVC-files](/doc/user-guide/dvc-file-format) (with + file names `.dvc` or `Dvcfile`). - - One DVC-file defines one node in the DAG. All DVC-files in a repository - make up a single pipeline (think a single Makefile). All DVC-files (and - corresponding pipeline commands) are implicitly combined through their - inputs and outputs, to simplify conflict resolving during merges. + - One DVC-file defines one node in the DAG. All DVC-files in a repository + make up a single pipeline (think a single Makefile). All DVC-files (and + corresponding pipeline commands) are implicitly combined through their + inputs and outputs, to simplify conflict resolving during merges. - - DVC provides a simple command `dvc run CMD` to generate a DVC-file - automatically based on the provided command, dependencies, and outputs. + - DVC provides a simple command `dvc run CMD` to generate a DVC-file + automatically based on the provided command, dependencies, and outputs. - - File tracking: + - File tracking: - - DVC tracks files based on checksum (md5) instead of file timestamps. This - helps avoid running into heavy processes like model re-training when you - checkout a previous, trained version of a modeling code (Makefile will - retrain the model). + - DVC tracks files based on checksum (md5) instead of file timestamps. This + helps avoid running into heavy processes like model re-training when you + checkout a previous, trained version of a modeling code (Makefile will + retrain the model). - - DVC uses file timestamps and inodes for optimization. This allows DVC to - avoid recomputing all dependency files checksum, which would be highly - problematic when working with large files (10 GB+). + - DVC uses file timestamps and inodes for optimization. This allows DVC to + avoid recomputing all dependency files checksum, which would be highly + problematic when working with large files (10 GB+). 6. **Git-annex**. The differences are: - - DVC uses the idea of storing the content of large files (that you don't - want to see in your Git repository) in a local key-value store and use file - symlinks instead of the actual files. + - DVC uses the idea of storing the content of large files (that you don't + want to see in your Git repository) in a local key-value store and use file + symlinks instead of the actual files. - - DVC can use reflinks\* or hardlinks (depending on the system) instead of - symlinks to improve performance and make the user experience better. + - DVC can use reflinks\* or hardlinks (depending on the system) instead of + symlinks to improve performance and make the user experience better. - - DVC optimizes checksum calculation. + - DVC optimizes checksum calculation. - - Git-annex is a datafile-centric system whereas DVC is focused on providing - a workflow for machine learning and reproducible experiments. When a DVC or - Git-annex repository is cloned via git clone, data files won't be copied to - the local machine as file content is stored in separate data remotes. - However, [DVC-files](/doc/user-guide/dvc-file-format) (which provide the - reproducible workflow) are always included in the cloned Git repository and - hence can be recreated locally with minimal effort. + - Git-annex is a datafile-centric system whereas DVC is focused on providing + a workflow for machine learning and reproducible experiments. When a DVC or + Git-annex repository is cloned via git clone, data files won't be copied to + the local machine as file content is stored in separate data remotes. + However, [DVC-files](/doc/user-guide/dvc-file-format) (which provide the + reproducible workflow) are always included in the cloned Git repository and + hence can be recreated locally with minimal effort. - - DVC is not fundamentally bound to Git, having the option of changing the - repository format. + - DVC is not fundamentally bound to Git, having the option of changing the + repository format. 7) **Git-LFS** (Large File Storage). The differences are: - - DVC does not require special Git servers like Git-LFS demands. Any cloud - storage like S3, GCS, or on-premises SSH server can be used as a backend - for datasets and models, no additional databases, servers or infrastructure - are required. + - DVC does not require special Git servers like Git-LFS demands. Any cloud + storage like S3, GCS, or on-premises SSH server can be used as a backend + for datasets and models, no additional databases, servers or infrastructure + are required. - - DVC is not fundamentally bound to Git, having the option of changing the - repository format. + - DVC is not fundamentally bound to Git, having the option of changing the + repository format. - - DVC does not add any hooks to Git by default. To checkout data files, the - `dvc checkout` command has to be run after each `git checkout` and - `git clone` command. It gives more granularity on managing data and code - separately. Hooks could be configured to make workflow simpler. + - DVC does not add any hooks to Git by default. To checkout data files, the + `dvc checkout` command has to be run after each `git checkout` and + `git clone` command. It gives more granularity on managing data and code + separately. Hooks could be configured to make workflow simpler. - - DVC attempts to use reflinks\* and has other - [file linking options](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache). - This way the `dvc checkout` command does not actually copy data files from - cache to the workspace, as copying files is a heavy operation for large - files (30 GB+). + - DVC attempts to use reflinks\* and has other + [file linking options](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache). + This way the `dvc checkout` command does not actually copy data files from + cache to the workspace, as copying files is a heavy operation for large + files (30 GB+). - - `git-lfs` was not made with data science scenarios in mind, so it does not - provide related features (e.g. pipelines, metrics), and thus Github has a - limit of 2 GB per repository. + - `git-lfs` was not made with data science scenarios in mind, so it does not + provide related features (e.g. pipelines, metrics), and thus Github has a + limit of 2 GB per repository. --- diff --git a/static/docs/user-guide/autocomplete.md b/static/docs/user-guide/autocomplete.md index e49dc5c4b6..ee36122961 100644 --- a/static/docs/user-guide/autocomplete.md +++ b/static/docs/user-guide/autocomplete.md @@ -44,7 +44,7 @@ In this case, follow the steps to configure Bash as it is your active shell. First, make sure Bash completion support is installed: - On a current Linux OS (in a non-minimal installation), bash completion should - be available; + be available. - On a Mac, install with `brew install bash-completion`. The DVC specific completion script is located in this path of our main diff --git a/static/docs/user-guide/contributing-documentation.md b/static/docs/user-guide/contributing-documentation.md index 78dad9c614..a7b2db3087 100644 --- a/static/docs/user-guide/contributing-documentation.md +++ b/static/docs/user-guide/contributing-documentation.md @@ -10,14 +10,14 @@ run the website. To contribute documentation you need to know these locations: - [Content](https://github.com/iterative/dvc.org/tree/master/static/docs) - (`/static/docs`) - + (`/static/docs`): [Markdown](https://guides.github.com/features/mastering-markdown/) files of - the different pages to render dynamically in the browser + the different pages to render dynamically in the browser. - [Images](https://github.com/iterative/dvc.org/tree/master/static/img) - (`/static/img`) - add new images, gif, svg, etc here. Reference them from the - Markdown files like this: `![](/static/img/reproducibility.png)` + (`/static/img`): Add new images, gif, svg, etc here. Reference them from the + Markdown files like this: `![](/static/img/reproducibility.png)`. - [Sections](https://github.com/iterative/dvc.org/tree/master/src/Documentation/sidebar.json) - (`.../sidebar.json`) - edit it to register a new section for the navigation + (`.../sidebar.json`): Edit it to register a new section for the navigation menu. Merging the appropriate changes to these files into the master branch is enough diff --git a/static/docs/user-guide/contributing.md b/static/docs/user-guide/contributing.md index d81782d4a9..67e02fad8d 100644 --- a/static/docs/user-guide/contributing.md +++ b/static/docs/user-guide/contributing.md @@ -288,11 +288,11 @@ Fixes #(Github issue id). Message types: - *component* - name of a component that this patch is affecting. Use `dvc` in a - general case. -- _short description_ - short description of the patch. -- _long description_ - If needed, longer message describing the patch in more - details. -- _github issue id_ - An id of the Github issue that this patch is addressing. + general case +- _short description_ - short description of the patch +- _long description_ - if needed, longer message describing the patch in more + details +- _github issue id_ - id of the GitHub issue that this patch is addressing Example: diff --git a/static/docs/user-guide/dvc-file-format.md b/static/docs/user-guide/dvc-file-format.md index 57ad6bcc6b..dc6e1794fb 100644 --- a/static/docs/user-guide/dvc-file-format.md +++ b/static/docs/user-guide/dvc-file-format.md @@ -45,25 +45,25 @@ outs: On the top level, `.dvc` file consists of these fields: -- `cmd`: a command that is being run in this stage -- `deps`: a list of dependencies for this stage -- `outs`: a list of outputs for this stage +- `cmd`: Command that is being run in this stage +- `deps`: List of dependencies for this stage +- `outs`: List of outputs for this stage - `md5`: md5 checksum for this DVC-file -- `locked`: whether or not this stage is locked from reproduction -- `wdir`: directory to run command in (default `.`) +- `locked`: Whether or not this stage is locked from reproduction +- `wdir`: Directory to run command in (default `.`) A dependency entry consists of a pair of fields: -- `path`: path to the dependency, relative to the `wdir` path (always present) +- `path`: Path to the dependency, relative to the `wdir` path (always present) - `md5`: md5 checksum for the dependency (most [stages](/doc/commands-reference/run)) -- `etag`: strong ETag response header (only HTTP external +- `etag`: Strong ETag response header (only HTTP external dependencies created with `dvc import-url`) -- `repo`: this entry is only for DVC repository external dependencies created +- `repo`: This entry is only for DVC repository external dependencies created with `dvc import`, and in itself contains the following fields: - `url`: URL of Git repository with source DVC project - - `rev_lock`: revision or version (Git commit hash) of the DVC repo at the + - `rev_lock`: Revision or version (Git commit hash) of the DVC repo at the time of importing the dependency > See the examples in @@ -72,15 +72,15 @@ A dependency entry consists of a pair of fields: An output entry consists of these fields: -- `path`: path to the output, relative to the `wdir` path +- `path`: Path to the output, relative to the `wdir` path - `md5`: md5 checksum for the output -- `cache`: whether or not dvc should cache the output -- `metric`: whether or not this file is a metric file +- `cache`: Whether or not dvc should cache the output +- `metric`: Whether or not this file is a metric file A metric entry consists of these fields: -- `type`: type of the metrics file (e.g. raw/json/tsv/htsv/csv/hcsv) -- `xpath`: path within the metrics file to the metrics data(e.g. `AUC.value` for +- `type`: Type of the metrics file (e.g. raw/json/tsv/htsv/csv/hcsv) +- `xpath`: Path within the metrics file to the metrics data(e.g. `AUC.value` for `{"AUC": {"value": 0.624321}}`) A `meta` entry consists of `key: value` pairs such as `name: John`. A meta entry diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index 842c709dbf..d5f69652ad 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -1,7 +1,7 @@ # DVC Files and Directories Once initialized in a project, DVC populates its installation directory -(`.dvc/`) with special DVC internal files and directories. +(`.dvc/`) with special DVC internal files and directories: ### Special DVC internal files and directories @@ -38,9 +38,9 @@ Once initialized in a project, DVC populates its installation directory - `.dvc/updater` - this file is used store latest available version of dvc, which is used to remind user to upgrade. -- `.dvc/updater.lock` - a lock file for `.dvc/updater`. +- `.dvc/updater.lock` - lock file for `.dvc/updater` -- `.dvc/lock` - a lock file for the whole dvc project. +- `.dvc/lock` - lock file for the whole dvc project ## Structure of cache directory