Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd: update commit-related info. #1989

Merged
merged 23 commits into from
Dec 13, 2020
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
5f5c1ff
cmd: update commit-related info
jorgeorpinel Dec 1, 2020
68c9c1a
Merge branch 'master' into cmd/commit
jorgeorpinel Dec 8, 2020
94eb1f8
cmd: improve commit intro
jorgeorpinel Dec 8, 2020
9c22f7e
cmd: update commit description
jorgeorpinel Dec 8, 2020
45da240
cmd: shorten commit intro
jorgeorpinel Dec 9, 2020
9b06323
cmd: mention that commit is an alternative to add
jorgeorpinel Dec 9, 2020
6ba7ea0
cmd: generalize use case of commit (not just about stages)
jorgeorpinel Dec 9, 2020
98c464f
cmd: separate add from repro cases of commit
jorgeorpinel Dec 10, 2020
a0cb751
cmd: term: don't say "under development"
jorgeorpinel Dec 10, 2020
ba6e109
cmd: clarify commit scenarios
jorgeorpinel Dec 10, 2020
3711dc8
cmd: clarify diffs among -no-cache options in run, repro
jorgeorpinel Dec 10, 2020
ce8ccad
cmd: update import/run --no-exec regarding caching
jorgeorpinel Dec 10, 2020
1c7d4a1
cmd: reinstate note on caching in import refs.
jorgeorpinel Dec 10, 2020
0be1fe0
cmd: rephrase first p in commit
jorgeorpinel Dec 10, 2020
f63a664
cmd: simplify main scenario in commit desc.
jorgeorpinel Dec 10, 2020
12a1a8e
Merge branch 'master' into cmd/commit
jorgeorpinel Dec 12, 2020
97be1b9
cmd: more uses for run -O
jorgeorpinel Dec 12, 2020
144749f
cmd: mention import --no-exec in commit
jorgeorpinel Dec 12, 2020
859e874
cmd: restructure commit desc
jorgeorpinel Dec 12, 2020
cf34cf6
cmd: impro/add motivation to run/repro/import --no-commit/exec
jorgeorpinel Dec 12, 2020
11c2768
cmd: update motivation for --no-exec
jorgeorpinel Dec 13, 2020
36997ed
cmd: Other->Specifically in secondary commit scenarios
jorgeorpinel Dec 13, 2020
b45e324
cmd: simplify import* --no-exec
jorgeorpinel Dec 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ A `dvc add` target can be either a file or a directory. In the latter case, a
`.dvc` file is created for the top of the hierarchy (with default name
`<dir_name>.dvc`).

Every file inside is added to the cache (unless the `--no-commit` option is
Every file inside is stored in the cache (unless the `--no-commit` option is
used), but DVC does not produce individual `.dvc` files for each file in the
entire tree. Instead, the single `.dvc` file references a special JSON file in
the cache (with `.dir` extension), that in turn points to the added files.
Expand Down Expand Up @@ -128,10 +128,9 @@ not.
among the `targets`, this option is ignored. For each file found, a new `.dvc`
file is created using the process described in this command's description.

- `--no-commit` - do not save outputs to cache. A `.dvc` file is created, while
nothing is added to the cache. (`dvc status` will report that the file is
`not in cache`.) Use `dvc commit` when ready to commit outputs with DVC. This
is analogous to using `git add` before `git commit`.
- `--no-commit` - do not store `targets` in the cache (the `.dvc` file is still
created). Use `dvc commit` to finish the operation (similar to `git commit`
after `git add`).

- `--file <filename>` - specify name of the `.dvc` file it generates. This
option works only if there is a single target. By default the name of the
Expand Down
121 changes: 56 additions & 65 deletions content/docs/command-reference/commit.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# commit

Record changes to DVC-tracked files in the <abbr>project</abbr>, by saving them
to the <abbr>cache</abbr> and updating the `dvc.lock` or `.dvc` files.
Record changes to files or directories tracked by DVC.

## Synopsis

Expand All @@ -17,65 +16,56 @@ positional arguments:

## Description

The `dvc commit` command is useful for several scenarios, when data already
tracked by DVC changes: when a [stage](/doc/command-reference/run) or
[pipeline](/doc/command-reference/dag) is in development/experimentation; to
force-update the `dvc.lock` or `.dvc` files without reproducing stages or
pipelines; or to mark existing files/dirs as stage <abbr>outputs</abbr>. These
scenarios are further detailed below.

- Code or data for a stage is under active development, with multiple iterations
(experiments) in code, configuration, or data. Use the `--no-commit` option of
DVC commands (`dvc add`, `dvc run`, `dvc repro`) to avoid caching unnecessary
data repeatedly. Use `dvc commit` when the DVC-tracked data is final.

💡 For convenience, a pre-commit Git hook is available to remind you to
`dvc commit` when needed. See `dvc install` for more details.

- Sometimes we want to edit source code, config, or data files in a way that
doesn't cause changes in the results of their data pipeline. We might write
add code comments, change indentation, remove some debugging printouts, or any
other change that doesn't cause changed stage outputs. However, DVC will
notice that some <abbr>dependencies</abbr> have changed, and expect you to
reproduce the whole pipeline. If you're sure no pipeline results would change,
use `dvc commit` to force update the `dvc.lock` or `.dvc` files and cache.

- In some cases, we have previously executed a stage, and later notice that some
of the files/directories used by the stage as dependencies or created as
outputs are missing from `dvc.yaml`. It is possible to
[add missing data to an existing stage](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage),
and then `dvc commit` can be used to save outputs to the cache (and update
`dvc.lock`)

- It's always possible to manually execute the command or source code used in a
stage without DVC (outputs must be unprotected or removed first in certain
cases, see `dvc unprotect`). Once the desired result is reached, use
`dvc commit` to update the `dvc.lock` file(s) and store changed data to the
cache.

Let's take a look at what is happening in the first scenario closely. Normally
DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the
<abbr>cache</abbr> after creating or updating a `dvc.lock` or `.dvc` file. What
_commit_ means is that DVC:

- Computes a hash for the file/directory.
- Enters the hash value and file name in the `dvc.lock` or `.dvc` file.
- Tells Git to ignore the file/directory (adding them to `.gitignore`). (Note
that if the <abbr>project</abbr> was initialized with no Git support
(`dvc init --no-scm`), this does not happen.)
- Adds the file(s) in question to the cache.

There are many cases where the last step is not desirable (for example rapid
iterations on an experiment). The `--no-commit` option prevents it (on the
commands where it's available). The file hash is still computed and added to the
`dvc.lock` or `.dvc` file, but the actual data is not cached. And this is where
the `dvc commit` command comes into play: It performs that last step when
Stores the current contents of files and directories tracked by DVC in the
<abbr>cache</abbr>, and updates `dvc.lock` or `.dvc` files as needed. This
forces DVC to accept the contents of tracked data currently in the
<abbr>workspace</abbr>, even if they have changed. We explore the scenarios in
which this can be useful next.

DVC commands that track data (`dvc add`, `dvc repro`, `dvc run`) do the
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
following for each file or directory in question:

- Save the hash value of the file(s) in the `dvc.lock` or `.dvc` file.
- Store the file contents in the cache.

The last step can be skipped with the `--no-commit` option of those commands,
for example when testing/experimenting with data or
[stages](/doc/command-reference/run). This avoids caching unfinished data. And
that's where `dvc commit` comes into play: It performs that last step when
needed.
shcheklein marked this conversation as resolved.
Show resolved Hide resolved

Note that it's best to avoid the last three scenarios. They essentially
force-update the `dvc.lock` or `.dvc` files and save data to cache. They are
still useful, but keep in mind that DVC can't guarantee reproducibility in those
cases.
💡 For convenience, a pre-commit Git hook is available to remind you to
`dvc commit` when needed. See `dvc install` for more info.

Other scenarios include:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure Other applies here? it's more like a detailed explanation?

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Dec 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm they're meant as different scenarios. The part before this is about add/run/repro --no-commit + dvc commit — main use case. The bullets in this list are

  • dvc add *
  • force-accepting cosmetic changes to dependencies
  • adding missing deps/outs
  • executing commands manually (this one I guess is pretty similar to the main case)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of them (all of them?) are part of the main use case in this terminology.

add/run/repro --no-commit is not even a use case by itself, right? it doesn't explain a specific need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the main reason commit exists as a stand-alone command is to complement the --no-commit options of add/run/repro?

In terms of the "story", its explained as "forces DVC to accept the contents of tracked data currently in the workspace" in the first p of the description. So do you mean they're all different flavors of that explanation and that there should be a single bullet list (including add/run/repro --no-commit)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the main reason commit exists as a stand-alone command is to complement the --no-commit options of add/run/repro?

probably initially it was the case, now we reference it in other places? (like --no-exec?) So we might need to revisit, generalize? (not 100% sure, just asking)

So do you mean they're all different flavors of that explanation

it seems so (not sure about all)

that there should be a single bullet list

ah, not necessarily. Just removing Other might help? or rephrasing it a bit if the first paragraph is already general.

again, not very constructive feedback here - just highlighting stuff as I read it, a think that seemed strange

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I iterated on this to make clear that the first p is a generalization, then a main scenario is explained, and finally a list with other scenarios. PTAL.


- As an alternative to `dvc add` for data that's already tracked. For example,
you can "`dvc add`" all the changed files or directories already tracked by
DVC without having to name each `target`.

- Often we edit source code, configuration, or other files that are specified as
<abbr>dependencies</abbr> in `dvc.yaml` (`deps` field) in a way that doesn't
cause any changes to stage <abbr>outputs</abbr>. For example: reformatting
input data, adding code comments, etc. However, DVC notices all changes to
dependencies and expects you to reproduce the corresponding pipeline
(`dvc repro`). You can use `dvc commit` instead to force accepting these new
versions without having to execute stage commands.

- Sometimes after executing a stage, we realize that not all of its dependencies
or outputs are defined in `dvc.yaml`. It is possible to
[add the missing deps/outs](/docs/user-guide/how-to/add-deps-or-outs-to-a-stage),
and `dvc commit` may be needed to finalize the remedy (see link).

- It's also possible to execute stage commands by hand (without `dvc repro`), or
to manually modify their output files or directories. Use `dvc commit` to
register the changes with DVC once you're done.

> Note that `dvc unprotect` (or removing the outputs) is usually required
> before rewriting files/dirs tracked by DVC.

Note that it's best to try avoiding these scenarios, where the
<abbr>cache</abbr>, `dvc.lock`, and `.dvc` files are force-updated. DVC can't
guarantee reproducibility in those cases.

## Options

Expand Down Expand Up @@ -228,20 +218,21 @@ ba000ba83b341a423a81eed8ff9238
We've verified that `dvc commit` has saved the changes into the cache, and that
the new instance of `model.pkl` is there.

## Example: Running commands without DVC
## Example: Executing stage commands without DVC

It is also possible to execute the commands that are executed by `dvc repro` by
hand. You won't have DVC helping you, but you have the freedom to run any
command you like, even ones not defined in `dvc.yaml` stages. For example:
Sometimes you may want to execute stage commands manually (instead of using
`dvc repro`). You won't have DVC helping you, but you'll have the freedom to run
any command, even ones not defined in `dvc.yaml`. For example:

```dvc
$ python src/featurization.py data/prepared data/features
$ python src/train.py data/features model.pkl
$ python src/evaluate.py model.pkl data/features auc.metric
```

As before, `dvc status` will show which files have changed, and when your work
is finalized `dvc commit` will commit everything to the <abbr>cache</abbr>.
As before, `dvc status` will show which tracked files/dirs have changed, and
when your work is finalized, `dvc commit` will save the outputs the
<abbr>cache</abbr>.

## Example: Updating dependencies

Expand Down
11 changes: 5 additions & 6 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,12 +125,11 @@ source.
default file name: `<file>.dvc`, where `<file>` is the desired file name of
the imported data (`out`).

- `--no-exec` - create `.dvc` file without actually downloading `url`. E.g. if
the file or directory already exists, this can be used to skip the download.
The data hash is not calculated when this option is used, only the import
metadata is saved to the `.dvc` file. `dvc commit <out>.dvc` can be used if
the data hashes are needed in the `.dvc` file, and to save existing data to
the cache.
- `--no-exec` - create `.dvc` file without actually downloading `url`. The data
hash is not calculated when this option is used, only the import metadata is
saved to the `.dvc` file. It can be useful to skip the download if the file or
directory already exists locally, for example, along with `dvc commit` to
store it in the cache and record its hash value in the `.dvc` file.

- `--desc <text>` - user description of the data (optional). This doesn't
affect any DVC operations.
Expand Down
10 changes: 5 additions & 5 deletions content/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,11 +104,11 @@ repo at `url`) are not supported.
> example below).

- `--no-exec` - create the import `.dvc` file without actually downloading the
file or directory. E.g. if the file or directory already exists, this can be
used to skip the download. The data hash is not calculated when this option is
used, only the import metadata is saved to the `.dvc` file.
`dvc commit <out>.dvc` can be used if the data hashes are needed in the `.dvc`
file, and to save existing data to the cache.
file or directory. The data hash is not calculated when this option is used,
only the import metadata is saved to the `.dvc` file. It can be useful to skip
the download if the file or directory already exists locally, for example,
along with `dvc commit` to store it in the cache and record its hash value in
the `.dvc` file.

- `--desc <text>` - user description of the data (optional). This doesn't affect
any DVC operations.
Expand Down
30 changes: 14 additions & 16 deletions content/docs/command-reference/repro.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,8 @@ implicitly defined by the stages listed in `dvc.yaml`. The commands defined in
these stages can then be executed in the correct order, reproducing pipeline
results.

> Pipeline stages are defined in a
> [`dvc.yaml` file](/doc/user-guide/dvc-files-and-directories#dvcyaml-file)
> (either manually or by using `dvc run`) while initial data dependencies can be
> registered with `dvc add`.
> Pipeline stages are defined in a `dvc.yaml` file (either manually or by using
> `dvc run`) while initial data dependencies can be registered with `dvc add`.

This command is similar to [Make](https://www.gnu.org/software/make/) in
software build automation, but DVC captures build requirements
Expand All @@ -54,9 +52,10 @@ options.
> Note that stages without dependencies are considered _always changed_, so
> `dvc repro` always executes them.

It saves all the data files, intermediate or final results into the <abbr>DVC
cache</abbr> (unless the `--no-commit` option is used), and updates the hash
values of changed dependencies and outputs in the `dvc.lock` and `.dvc` files.
It stores all the data files, intermediate or final results in the
<abbr>cache</abbr> (unless the `--no-commit` option is used), and updates the
hash values of changed dependencies and outputs in the `dvc.lock` and `.dvc`
files.

### Parallel stage execution

Expand Down Expand Up @@ -105,11 +104,10 @@ up-to-date and only execute the final stage.
target directory and its subdirectories for stages (in `dvc.yaml`) to inspect.
If there are no directories among the targets, this option is ignored.

- `--no-commit` - do not save outputs to cache. A DVC-file is created, while
nothing is added to the cache. (`dvc status` will report that the file is
`not in cache`.) Use `dvc commit` when ready to commit outputs with DVC.
Useful to avoid caching unnecessary data repeatedly when running multiple
experiments.
- `--no-commit` - do not store the outputs of this execution in the cache
(`dvc.yaml` and `dvc.lock` are still created or updated); useful to avoid
caching unnecessary data when executing tests or experiments. Use `dvc commit`
to finish the operation.

- `-m`, `--metrics` - show metrics after reproduction. The target pipelines must
have at least one metrics file defined either with the `dvc metrics` command,
Expand Down Expand Up @@ -141,10 +139,10 @@ up-to-date and only execute the final stage.
stages (`A` and below) depend on `requirements.txt`, we can specify it in `A`,
and omit it in `B` and `C`.

Like with the same option on `dvc run`, this is a way to force-execute stages
without changes. This can also be useful for pipelines containing stages that
produce non-deterministic (semi-random) outputs, where outputs can vary on
each execution, meaning the cache cannot be trusted for such stages.
Like with the `--force` option on `dvc run`, this is a way to force-execute
stages without changes. This can also be useful for pipelines containing
stages that produce non-deterministic (semi-random) outputs, where outputs can
vary on each execution, meaning the cache cannot be trusted for such stages.

- `--downstream` - only execute the stages after the given `targets` in their
corresponding pipelines, including the target stages themselves. This option
Expand Down
46 changes: 21 additions & 25 deletions content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,15 +178,16 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'
`dvc add`).

- `-O <path>`, `--outs-no-cache <path>` - the same as `-o` except that outputs
are not tracked by DVC. It means that they are not cached, and it's up to a
user to manage them separately. This is useful if the outputs are small enough
to be tracked by Git directly, or if these files are not of future interest.
are not tracked by DVC. This means that they are never cached, so it's up to
the user to manage them separately. This is useful if the outputs are small
enough to be tracked by Git directly, or if these files are not of future
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
interest.

- `--outs-persist <path>` - declare output file or directory that will not be
removed upon `dvc repro`.

- `--outs-persist-no-cache <path>` - the same as `-outs-persist` except that
outputs are not tracked by DVC.
outputs are not tracked by DVC (same as with `-O` above).

- `-p [<path>:]<params_list>`, `--params [<path>:]<params_list>` - specify a set
of [parameter dependencies](/doc/command-reference/params) the stage depends
Expand All @@ -204,10 +205,10 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'
more about _metrics_.

- `-M <path>`, `--metrics-no-cache <path>` - the same as `-m` except that DVC
does not track the metrics file. This means that the file is not cached, so
it's up to the user to manage them separately. This is typically desirable
with _metrics_ because they are small enough to be tracked with Git directly.
See also the difference between `-o` and `-O`.
does not track the metrics file (same as with `-O` above). This means that
they are never cached, so it's up to the user to manage them separately. This
is typically desirable with _metrics_ because they are small enough to be
tracked with Git directly.

- `--plots <path>` - specify a plot metrics file produces by this stage. This
option behaves like `-o` but registers the file in a `plots` field inside the
Expand All @@ -217,24 +218,23 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'
plots.

- `--plots-no-cache <path>` - the same as `--plots` except that DVC does not
track the plots metrics file. This means that the file is not cached, so it's
up to the user to manage them separately. See also the difference between `-o`
and `-O`.
track the plots file (same as with `-O` and `-M` above). This may be desirable
with _plots_, if they are small enough to be tracked with Git directly.

- `-w <path>`, `--wdir <path>` - specifies a working directory for the `command`
to run in (uses the `wdir` field in `dvc.yaml`). Dependency and output files
(including metrics and plots) should be specified relative to this directory.
It's used by `dvc repro` to change the working directory before executing the
`command`.

- `--no-exec` - create a stage file, but do not execute the `command` defined in
it, nor cache dependencies or outputs (like with `--no-commit`, explained
below). DVC will also add your outputs to `.gitignore`, same as it would do
without `--no-exec`. Use `dvc commit` to force committing existing output file
versions to cache.
- `--no-exec` - write the stage to `dvc.yaml`, but do not execute its `command`.
Any dependencies and outputs will be entered in `.gitignore`, but won't be
cached (like with `--no-commit` below) or recorded in `dvc.lock`. You can use
`dvc commit` to save any existing dep/out files to the cache and record their
hashes to the lock file.

This is useful if, for example, you need to build a pipeline quickly first,
and run it all at once later.
and run it all at once later (with `dvc repro`).

- `-f`, `--force` - overwrite an existing stage in `dvc.yaml` file without
asking for confirmation.
Expand All @@ -244,14 +244,10 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'
command's code is non-deterministic
([not recommended](#avoiding-unexpected-behavior)).

- `--no-commit` - do not save outputs to cache. A stage created, while nothing
is added to the cache. In the stage file, the file hash values will be empty;
They will be populated the next time this stage is actually executed, or
`dvc commit` can be used to force committing existing output file versions to
cache.

This is useful to avoid caching unnecessary data repeatedly when running
multiple experiments.
- `--no-commit` - do not store the outputs of this execution in the cache
(`dvc.yaml` and `dvc.lock` are still created or updated); useful to avoid
caching unnecessary data when executing tests or experiments. Use `dvc commit`
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
to finish the operation.

- `--always-changed` - always consider this stage as changed (uses the
`always_changed` field in `dvc.yaml`). As a result `dvc status` will report it
Expand Down
3 changes: 2 additions & 1 deletion content/docs/user-guide/dvc-files-and-directories.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,8 @@ the possible following fields:
- `outs`: List of <abbr>output</abbr> file or directory paths of this stage
(relative to `wdir` which defaults to the file's location), and optionally,
whether or not this file or directory is <abbr>cached</abbr> (`true` by
default, if not present). See the `--no-commit` option of `dvc run`.
default, if not present). See the `--no-commit` option of `dvc run` and
`dvc repro`.
- `metrics`: List of [metrics files](/doc/command-reference/metrics), and
optionally, whether or not this metrics file is <abbr>cached</abbr> (`true` by
default, if not present). See the `--metrics-no-cache` (`-M`) option of
Expand Down