diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 1d0ef04ad9..d59ded757c 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -148,8 +148,10 @@ not. - `--external` - allow `targets` that are outside of the DVC repository. See [Managing External Data](/doc/user-guide/managing-external-data). - > Note that external outputs typically require an external cache setup. See - > link above for more details. + > ⚠️ Note that this is an advanced feature for very specific situations and + > not recommended except if there's absolutely no other alternative. + > Additionally, this typically requires an external cache setup (see link + > above). - `-o `, `--out ` - destination `path` to make a local target copy, or to [transfer](#example-transfer-to-cache) an external target into the cache diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index 98831f1420..5893f42fe6 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -208,10 +208,10 @@ settings, and configuring a remote is the way that can be done. - `cache.webhdfs` - name of an HDFS remote with WebHDFS enabled to use as external cache. -> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for -> `dvc push`, `dvc pull`, etc.) as external cache, because it may cause file +> ⚠️ Avoid using the same [remote storage](/doc/command-reference/remote) used +> for `dvc push` and `dvc pull` as external cache, because it may cause file > hash overlaps: the hash of an external output could collide with -> a hash generated locally for another file with different content. +> that of a local file with different content. ### state diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 1d471955df..87b0645c54 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -1,8 +1,8 @@ # External Dependencies There are cases when data is so large, or its processing is organized in such a -way, that its preferable to avoid moving it from its original location. For -example data on a network attached storage (NAS), processing data on HDFS, +way, that its preferable to avoid moving it from its current external location. +For example data on a network attached storage (NAS), processing data on HDFS, running [Dask](https://dask.org/) via SSH, or for a script that streams data from S3 to process it. @@ -12,14 +12,14 @@ and version data outside of the project. ## How external dependencies work -External dependencies are considered part of the (extended) DVC -project: DVC will track them, detecting when they change (triggering stage -executions on `dvc repro`, for example). +External dependencies will be tracked by DVC, detecting when they +change (triggering stage executions on `dvc repro`, for example). To define files or directories in an external location as -[stage](/doc/command-reference/run) dependencies, put their remote URLs or +[stage](/doc/command-reference/run) dependencies, specify their remote URLs or external paths in `dvc.yaml` (`deps` field). Use the same format as the `url` of -certain `dvc remote` types. Currently, the following protocols are supported: +certain `dvc remote` types. Currently, the following supported `dvc remote` +types/protocols: - Amazon S3 - Microsoft Azure Blob Storage diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 1a1588358d..b7779ea02a 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -1,52 +1,51 @@ -# Managing External Data +# External Outputs -> ⚠️ This is an advanced feature that we don't recommend using unless you really -> know what you are doing. Artifacts added with --external are not affected by -> `dvc push/pull/status -c`. You are likely looking for straight +> ⚠️ This is an advanced feature for very specific situations and not +> recommended except if there's absolutely no other alternative. In most cases +> alternatives like the > [to-cache](/doc/command-reference/add#example-transfer-to-the-cache) or > [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage) -> transfers, or `dvc import-url`). +> strategies of `dvc add` and `dvc import-url` are more convenient. **Note** +> that external outputs are not pushed or pulled from/to +> [remote storage](/doc/command-reference/remote). There are cases when data is so large, or its processing is organized in such a -way, that its preferable to avoid moving it from its original location. For -example data on a network attached storage (NAS), processing data on HDFS, -running [Dask](https://dask.org/) via SSH, or for a script that streams data -from S3 to process it. +way, that its impossible to handle it in the local machine disk. For example +versioning existing data on a network attached storage (NAS), processing data on +HDFS, running [Dask](https://dask.org/) via SSH, or any code that generates +massive files directly to the cloud. -External outputs and -[external dependencies](/doc/user-guide/external-dependencies) provide ways to +External outputs (and +[external dependencies](/doc/user-guide/external-dependencies)) provide ways to track and version data outside of the project. ## How external outputs work -External outputs are considered part of the (extended) DVC project: -DVC will track them for +External outputs are considered part of the (extended) +workspace: DVC will track them for [versioning](/doc/use-cases/versioning-data-and-model-files), detecting when they change (reported by `dvc status`, for example). -To use existing files or directories in an external location as -[stage](/doc/command-reference/run) outputs, give their remote URLs or external -paths to `dvc add`, or put them in `dvc.yaml` (`deps` field). Use the same -format as the `url` of certain `dvc remote` types. Currently, the following -protocols are supported: +To use existing files or directories in an external location as outputs, give +their remote URLs or external paths to `dvc add`, or put them in `dvc.yaml` +(`deps` field). Use the same format as the `url` of the following supported +`dvc remote` types/protocols: - Amazon S3 - SSH - HDFS -- Local files and directories outside the workspace +- Local files and directories outside the workspace -External outputs require an +⚠️ External outputs require an [external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) in the same external/remote file. -> Note that [remote storage](/doc/command-reference/remote) is a different -> feature, and that external outputs are not pushed or pulled from/to DVC -> remotes. +> Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. as +> external cache, because it may cause data collisions: the hash of an external +> output could collide with that of a local file with different content. -> ⚠️ Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. for -> external outputs, because it may cause data collisions: the hash of an -> external output could collide with that of a local file with different -> content. +> Note that [remote storage](/doc/command-reference/remote) is a different +> feature. ## Examples diff --git a/content/docs/user-guide/project-structure/dvc-files.md b/content/docs/user-guide/project-structure/dvc-files.md index 2683d512a3..fd60691f8b 100644 --- a/content/docs/user-guide/project-structure/dvc-files.md +++ b/content/docs/user-guide/project-structure/dvc-files.md @@ -1,13 +1,12 @@ # `.dvc` Files You can use `dvc add` to track data files or directories located in your current -workspace, or in supported -[external locations](/doc/user-guide/managing-external-data). Additionally, -`dvc import` and `dvc import-url` let you bring data from external locations to -your project, and start tracking it locally. +workspace\*. Additionally, `dvc import` and `dvc import-url` let +you bring data from external locations to your project, and start tracking it +locally. See [Data Versioning](/doc/start/data-versioning) for more info. -> See [Data Versioning](/doc/start/data-versioning) and -> [Data Access](/doc/start/data-access) for more info. +> \* Certain [external locations](/doc/user-guide/managing-external-data) are +> also supported. Files ending with the `.dvc` extension ("dot DVC file") are created by these commands as data placeholders that can be versioned with Git. They contain the