Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added user guide page on remote storage #3337

Closed
wants to merge 5 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
@@ -154,6 +154,7 @@
"share-many-experiments"
]
},
"remote-storage",
"setup-google-drive-remote",
"large-dataset-optimization",
"external-dependencies",
200 changes: 200 additions & 0 deletions content/docs/user-guide/remote-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# Remote Storage

## Introduction

DVC can use remote storage instead of local disk space for storing previously
committed versions of your project. You may need to do this if, for example,
Comment on lines +5 to +6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

committed versions of your project

Let's clarify that (cache and) remote storage only stores the DVC-tracked data side of the project. The versioning side is done with Git. This is a key aspect of DVC storage in general.


- You don't have enough disk space for storing all the old versions locally.
Comment on lines +6 to +8
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to include the motivation/uses of remote storage! And you got some of the main ones right but

  1. the order should be inverse I think: sharing/collaboration first, then backup and "not enough space" which are basically the same case.
  2. a missing case is to allow for custom data management designs e.g. store raw data in a remote, features in another, and models in a 3rd one -- all with different access rights (the authentication layer provided by storage platforms is key).
  3. I'd try to keep the bullets much shorter to make the list more effective. If needed some of the details can be moved to later parts of the doc after the intro.
  4. (minor) previous versions aren't necessary "old" (that seems to imply "outdated"). They may just be different in other ways. This term is currently used throughout the doc and can be misleading.

Projects with very large data files that change frequently, or projects
involving datasets with a massive number of smaller files (e.g.
[ImageNet](https://www.image-net.org/)) that you are changing (e.g.
preprocessing ) in different ways may run in to this problem. The default
place that DVC stores this data is `.dvc\cache` in your local project folder,
and this is updated everytime you issue `dvc commit`. If you are committing
frequently, and making big changes with each commit, you could easily run out
of local storage after a while.
Comment on lines +14 to +16
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dvc commit is too low-level here. That command is a helper in fact. The operation is included in dvc add and when needed in dvc repro/exp run. We don't need to get so specific in this guide though, we can keep it general e.g. "every time a new version of data dependencies or outputs is saved with DVC" or something like that. Keep in mind the main workflows that lead to these data commits -- again, 1. adding base data and 2. somehow reproing a pipeline or experiment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, term "committing" is tricky: are you referring to git commit an entire project version? DVC 'commits' data internally (to cache) even when no Git commit happens (which is why repro --no-commit exists, for example).

Please review the entire doc with these comments in mind since at the moment there's lots of mentions of committing and dvc commit. Thanks

- You want a backup of your project. Just as git projects can be pushed to
[github.com](github.com) or similar after a `git commit`, DVC can push the
project to remote storage after a DVC commit. This is a convenient and simple
way to make frequent backups easily integrating with your normal workflows.
- You want to share large projects with your team while minimising both data
transfer and cloud storage charges. A huge benefit of DVC remote storage is
space optimisation. Using
[content-addressable storage](/doc/user-guide/project-structure/internal-files),
DVC ensures that duplicate files are stored once and once only, even if they
are meant to be different files with different names. This feature enables
efficient sharing of project files.

The following example illustrates the interplay between the project folder on
the local workstation, the local DVC cache, and remote storage:

![Local cache and remote storage](/img/remote_storage.png)

Multiple old versions of the project (six of them, in fact) are being archived
on remote storage. The current working version of the project (version 7) has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
on remote storage. The current working version of the project (version 7) has
on remote storage. The latest version of the project's data (7) has

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving the diagrams BTW 👍🏼

just been committed to the local cache with `dvc commit`, but not yet pushed to
remote storage. Locally, only the most recent three versions of the project are
stored. The data scientist recently cleared the local cache using DVC's garbage
collection command `dvc gc`. Periodically, she/he issues `dvc push` to send new
Comment on lines +38 to +39
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't complicate the intro with dvc gc mentions for now. That deserves a section or even page of its own, probably (not expected for this PR).

committed versions of the project to remote storage, and can retrieve old
versions using `dvc pull` if needed.

## Connecting and Pushing to Remote Storage

Multiple cloud storage providers can be used with DVC, and connecting is fairly
Comment on lines +43 to +45
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a section specific to setting up and connecting first? OK to mention dvc push in the examples (if needed) as the way to confirm connectivity, but the focus should be on general setup (remote add/modify commands, including general authentication info, ideally listing all supported types of remotes.

In fact it will probably be a series of pages for all that (extracted from https://dvc.org/doc/command-reference/remote/modify#available-parameters-per-storage-type) but no need to go nearly that far for now, we can keep it much more general in this PR.

Then the later section about Sharing can have all the details or example involving push.

straightforward using the `dvc remote` command. For example, to connect your
project to an [Amazon AWS](https://aws.amazon.com) S3 bucket, name the remote
`s3remote`, and set it as your default remote storage, you would issue the
following command:

```dvc
$ dvc remote add --default s3remote s3://path/to/cache
```

The `dvc remote add` docs outline connecting DVC to other specific providers
such as [Microsoft Azure](https://azure.microsoft.com/), and we've also provided
an example specifically about
[Google Drive as a DVC remote](/doc/user-guide/setup-google-drive-remote). Once
your remotes are configured, you can double check that the remote storage was
added correctly using `dvc remote list`. In this example, two remotes are set
up, one on the cloud and one on a `work` folder on a mounted volume:

```dvc
$ dvc remote list
s3cache s3://yourbucket/yourremotecache/
work /Volumes/ds_team/work
```

💡 Before adding cloud remote storage to your project, you need to ensure that
you have configured access to the remote storage correctly. This is dependent on
how your cloud storage provider works and will vary depending on the provider.
For example, for if you want to use an Amazon S3 bucket for remote storage then
you need to
[configure DVC for S3 access](/doc/command-reference/remote/add#supported-storage-types)
and [install the S3 version of DVC](/doc/install). Further instructions are
specific to the cloud service providers and may change from time to time.

One remote storage is correctly set up, there are three main steps you need to
take store versions of your project there:

1. Commit the project to your local cache with `dvc commit`; this makes a new
local version of the project.
2. Push this new version to the remote storage with `dvc push` (use the `-r`
option to specify which remote storage if you are not using the default
storage).
3. Optionally clear out your local workstation cache with `dvc gc`, the garbage
collector, to free up space on the local filesystem.

A local folder can also be used as remote storage as shown in the example above
where a remote storage named `work` has been configured. This is useful if, for
example, you have a network share mounted and want to use that instead of a
cloud provider.

## Content Addressable Storage Format

DVC optimises the storage space used in the local cache and remotes by ensuring
Comment on lines +94 to +96
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this section, it's definitely interesting content as I mentioned in the previous PR but still this is more about the DVC cache mechanism. Please make it into a separate page somewhere and leave it out of this doc or link to if if/where needed to make the review process easier.

no duplicate files are stored. Duplicate files accumulate for lots of reasons,
for example:

- You commit a new version of a project without any changes to a very large
file. It makes no sense to actually store two copies of these files, even
though the file appears in two different versions of the project.
- You copy an image dataset so you can add new data to it, but want to keep the
original dataset unchanged. The new version of the image dataset will contain
an entire copy of the entire original dataset.
- You copy a large file, but don't change the copy, so you have two files with
different names and other metadata but the exact same content.

The way that DVC determines if two different files are duplicates comes from a
computer science idea called
[content-addressable storage](/doc/user-guide/project-structure/internal-files).
Basically, a unique hash of the file's _content_ (ignoring file metadata) is
calculated. For every pair of files with different content, therefore, the
unique hashes should be different; but if two files have the same content (even
though they may differ in metadata such as file name and creation time), then
the hashes will be the same. DVC renames files with their hash and stores the
real names elsewhere. In this way, it can track files that change, track files
that don't change, and determine which (apparently different) files are
identical. DVC does a similar trick for directories, ensuring that directory
changes (e.g. adding new files) are also detectable and trackable.

💡 The specific type of hash used by DVC currently is the
[MD5 hash](https://en.wikipedia.org/wiki/MD5).

Let's look at a simple example: the file `a.dat`. After adding the file to the
project with `dvc add a.dat`, a new file `a.dat.dvc` is created in the project
folder. The new file contains the hash of `a.dat`, which can be looked at by
examining the file:

```dvc
$ cat a.dat.dvc
outs:
- md5: bba40b7807c80d3f44787b9c6a4aabee
size: 1047565
path: a.dat
```

The original `a.dat` has now been renamed to `bb\a40b7807c80d3f44787b9c6a4aabee`
and moved to the local cache. The version of `a.dat` that we now see in the
project directory is now a
[file link](doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
that refers to the hash-named version of the file.

💡 Note that DVC supports file links that are
[reflinks](https://blog.ram.rachum.com/post/620335081764077568/symlinks-and-hardlinks-move-over-make-room-for)
(the default, if possible), hard links, soft links, and basic copies. Please
check your operating systems capabilities and configure DVC to use the
appropriate file link type using `dvc config` if needed.

When the project is committed and pushed, the version of `a.dat` in the cache
will be copied to remote storage. If `a.dat` changes later and is committed
again, then a new hash for the file will be computed, a new version of the file
will be created in the cache, and the file link in the project directory is
updated.

The next brief example shows a directory `myDir` tracked by DVC containing two
files `a` and `b`:

![Local cache and remote storage](/img/cache_structure.png)
Comment on lines +156 to +159
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 2nd diagram is confusing though. But again, for now we can leave out all these details about the caching mechanism and storage optimization, which are not specifically about remote storage.


The cache in this example contains three versions of the directory. When the
project was first committed, the directory contained only the file `a`, as
illustrated by the directory in the cache with hash starting `6e..`. The second
version of the directory was committed after the file `b` was added and a new
hash `22...` for the directory was calculated. The third version of the
directory was committed after `b` was changed. This is reflected in new hashes
for both the directory and `b` (`ef...` and `6d...` specifically). DVC is
therefore storing two versions of `b` in the cache, but only one copy of `a`
(since, naturally, file `a` has not changed). Also shown in the figure is the
file `myDir.dvc`, the metadata file specifying the MD5 hash of the
``current'' version of `myDir` in the cache.

Both remote storage and local cache use the same format for organising and
naming your project files. If you have just committed and pushed your changes
from the local workstation to the remote storage, there should be identical
folders in both places.

## Sharing Files via Remote Storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, this section could probably be more hands-on as well, showing push and pull and maybe even another diagram. Take a look for example at this page which we recently removed because it wasn't in the right area of our docs (but the content is still relevant and can be recovered to some extent here).

Also, let's make the title more general? And it's nice to mention ML models some times in the context of data management (which includes local and remote storage). So here's a new title suggestion:

Suggested change
## Sharing Files via Remote Storage
## Sharing Data and Models


A disadvantage of content addressable storage is that the folder and file
structure of the local cache/remote storages is obfuscated and no longer
readable. The original directory and filenames are lost thanks to the use of
hashes as the new names, and there are many more objects than there are files in
the project because of the versioning (since deleted and changed files and
directories need to be kept around).

Consequently, you cannot simply browse the remote storage folders that DVC
controls in the same way you would browse a normal shared network/cloud drive.
Similarly, new files cannot be copied to the remote folder in the usual ways
(e.g. via drag and drop) if you want to share them. You can, however, share
files with your team using DVC. The basic approach is to follow the standard DVC
workflow:

1. Have the file(s) you want to share in your local project folder and add them
with `dvc add`.
2. Commit the project using `git` and `dvc`, and push the changes to the
DVC-configured remote storage.
3. Have your teammates checkout the latest version of the project, then pull
from the remote storage. They should receive the file(s) being shared into
their local project folder.
Binary file added static/img/cache_structure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/remote_storage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.