Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: clarifications around external outputs info. #2154

Merged
merged 4 commits into from
Mar 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,8 +148,10 @@ not.
- `--external` - allow `targets` that are outside of the DVC repository. See
[Managing External Data](/doc/user-guide/managing-external-data).

> Note that external outputs typically require an external cache setup. See
> link above for more details.
> ⚠️ Note that this is an advanced feature for very specific situations and
> not recommended except if there's absolutely no other alternative.
> Additionally, this typically requires an external cache setup (see link
> above).

- `-o <path>`, `--out <path>` - destination `path` to make a local target copy,
or to [transfer](#example-transfer-to-cache) an external target into the cache
Expand Down
6 changes: 3 additions & 3 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,10 +208,10 @@ settings, and configuring a remote is the way that can be done.
- `cache.webhdfs` - name of an HDFS remote with WebHDFS enabled to use as
external cache.

> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for
> `dvc push`, `dvc pull`, etc.) as external cache, because it may cause file
> ⚠️ Avoid using the same [remote storage](/doc/command-reference/remote) used
> for `dvc push` and `dvc pull` as external cache, because it may cause file
> hash overlaps: the hash of an external <abbr>output</abbr> could collide with
> a hash generated locally for another file with different content.
> that of a local file with different content.

### state

Expand Down
14 changes: 7 additions & 7 deletions content/docs/user-guide/external-dependencies.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# External Dependencies

There are cases when data is so large, or its processing is organized in such a
way, that its preferable to avoid moving it from its original location. For
example data on a network attached storage (NAS), processing data on HDFS,
way, that its preferable to avoid moving it from its current external location.
For example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or for a script that streams data
from S3 to process it.

Expand All @@ -12,14 +12,14 @@ and version data outside of the <abbr>project</abbr>.

## How external dependencies work

External <abbr>dependencies</abbr> are considered part of the (extended) DVC
project: DVC will track them, detecting when they change (triggering stage
executions on `dvc repro`, for example).
External <abbr>dependencies</abbr> will be tracked by DVC, detecting when they
change (triggering stage executions on `dvc repro`, for example).

To define files or directories in an external location as
[stage](/doc/command-reference/run) dependencies, put their remote URLs or
[stage](/doc/command-reference/run) dependencies, specify their remote URLs or
external paths in `dvc.yaml` (`deps` field). Use the same format as the `url` of
certain `dvc remote` types. Currently, the following protocols are supported:
certain `dvc remote` types. Currently, the following supported `dvc remote`
types/protocols:

- Amazon S3
- Microsoft Azure Blob Storage
Expand Down
53 changes: 26 additions & 27 deletions content/docs/user-guide/managing-external-data.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,51 @@
# Managing External Data
# External Outputs

> ⚠️ This is an advanced feature that we don't recommend using unless you really
> know what you are doing. Artifacts added with --external are not affected by
> `dvc push/pull/status -c`. You are likely looking for straight
> ⚠️ This is an advanced feature for very specific situations and not
> recommended except if there's absolutely no other alternative. In most cases
> alternatives like the
> [to-cache](/doc/command-reference/add#example-transfer-to-the-cache) or
> [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage)
> transfers, or `dvc import-url`).
> strategies of `dvc add` and `dvc import-url` are more convenient. **Note**
> that external outputs are not pushed or pulled from/to
> [remote storage](/doc/command-reference/remote).

There are cases when data is so large, or its processing is organized in such a
way, that its preferable to avoid moving it from its original location. For
example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or for a script that streams data
from S3 to process it.
way, that its impossible to handle it in the local machine disk. For example
versioning existing data on a network attached storage (NAS), processing data on
HDFS, running [Dask](https://dask.org/) via SSH, or any code that generates
massive files directly to the cloud.

External outputs and
[external dependencies](/doc/user-guide/external-dependencies) provide ways to
External outputs (and
[external dependencies](/doc/user-guide/external-dependencies)) provide ways to
track and version data outside of the <abbr>project</abbr>.

## How external outputs work

External <abbr>outputs</abbr> are considered part of the (extended) DVC project:
DVC will track them for
External <abbr>outputs</abbr> are considered part of the (extended)
<abbr>workspace</abbr>: DVC will track them for
[versioning](/doc/use-cases/versioning-data-and-model-files), detecting when
they change (reported by `dvc status`, for example).

To use existing files or directories in an external location as
[stage](/doc/command-reference/run) outputs, give their remote URLs or external
paths to `dvc add`, or put them in `dvc.yaml` (`deps` field). Use the same
format as the `url` of certain `dvc remote` types. Currently, the following
protocols are supported:
To use existing files or directories in an external location as outputs, give
their remote URLs or external paths to `dvc add`, or put them in `dvc.yaml`
(`deps` field). Use the same format as the `url` of the following supported
`dvc remote` types/protocols:

- Amazon S3
- SSH
- HDFS
- Local files and directories outside the <abbr>workspace</abbr>
- Local files and directories outside the workspace

External outputs require an
⚠️ External outputs require an
[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file.

> Note that [remote storage](/doc/command-reference/remote) is a different
> feature, and that external outputs are not pushed or pulled from/to DVC
> remotes.
> Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. as
> external cache, because it may cause data collisions: the hash of an external
> output could collide with that of a local file with different content.

> ⚠️ Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. for
> external outputs, because it may cause data collisions: the hash of an
> external output could collide with that of a local file with different
> content.
> Note that [remote storage](/doc/command-reference/remote) is a different
> feature.

## Examples

Expand Down
11 changes: 5 additions & 6 deletions content/docs/user-guide/project-structure/dvc-files.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
# `.dvc` Files

You can use `dvc add` to track data files or directories located in your current
<abbr>workspace</abbr>, or in supported
[external locations](/doc/user-guide/managing-external-data). Additionally,
`dvc import` and `dvc import-url` let you bring data from external locations to
your project, and start tracking it locally.
<abbr>workspace</abbr>\*. Additionally, `dvc import` and `dvc import-url` let
you bring data from external locations to your project, and start tracking it
locally. See [Data Versioning](/doc/start/data-versioning) for more info.

> See [Data Versioning](/doc/start/data-versioning) and
> [Data Access](/doc/start/data-access) for more info.
> \* Certain [external locations](/doc/user-guide/managing-external-data) are
> also supported.

Files ending with the `.dvc` extension ("dot DVC file") are created by these
commands as data placeholders that can be versioned with Git. They contain the
Expand Down