-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added user guide page on remote storage #3337
Changes from all commits
5fff5fd
3c2b48b
973f29e
4c3dad9
0706662
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,200 @@ | ||||||
# Remote Storage | ||||||
|
||||||
## Introduction | ||||||
|
||||||
DVC can use remote storage instead of local disk space for storing previously | ||||||
committed versions of your project. You may need to do this if, for example, | ||||||
|
||||||
- You don't have enough disk space for storing all the old versions locally. | ||||||
Comment on lines
+6
to
+8
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice to include the motivation/uses of remote storage! And you got some of the main ones right but
|
||||||
Projects with very large data files that change frequently, or projects | ||||||
involving datasets with a massive number of smaller files (e.g. | ||||||
[ImageNet](https://www.image-net.org/)) that you are changing (e.g. | ||||||
preprocessing ) in different ways may run in to this problem. The default | ||||||
place that DVC stores this data is `.dvc\cache` in your local project folder, | ||||||
and this is updated everytime you issue `dvc commit`. If you are committing | ||||||
frequently, and making big changes with each commit, you could easily run out | ||||||
of local storage after a while. | ||||||
Comment on lines
+14
to
+16
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, term "committing" is tricky: are you referring to Please review the entire doc with these comments in mind since at the moment there's lots of mentions of committing and |
||||||
- You want a backup of your project. Just as git projects can be pushed to | ||||||
[github.com](github.com) or similar after a `git commit`, DVC can push the | ||||||
project to remote storage after a DVC commit. This is a convenient and simple | ||||||
way to make frequent backups easily integrating with your normal workflows. | ||||||
- You want to share large projects with your team while minimising both data | ||||||
transfer and cloud storage charges. A huge benefit of DVC remote storage is | ||||||
space optimisation. Using | ||||||
[content-addressable storage](/doc/user-guide/project-structure/internal-files), | ||||||
DVC ensures that duplicate files are stored once and once only, even if they | ||||||
are meant to be different files with different names. This feature enables | ||||||
efficient sharing of project files. | ||||||
|
||||||
The following example illustrates the interplay between the project folder on | ||||||
the local workstation, the local DVC cache, and remote storage: | ||||||
|
||||||
 | ||||||
|
||||||
Multiple old versions of the project (six of them, in fact) are being archived | ||||||
on remote storage. The current working version of the project (version 7) has | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Loving the diagrams BTW 👍🏼 |
||||||
just been committed to the local cache with `dvc commit`, but not yet pushed to | ||||||
remote storage. Locally, only the most recent three versions of the project are | ||||||
stored. The data scientist recently cleared the local cache using DVC's garbage | ||||||
collection command `dvc gc`. Periodically, she/he issues `dvc push` to send new | ||||||
Comment on lines
+38
to
+39
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wouldn't complicate the intro with |
||||||
committed versions of the project to remote storage, and can retrieve old | ||||||
versions using `dvc pull` if needed. | ||||||
|
||||||
## Connecting and Pushing to Remote Storage | ||||||
|
||||||
Multiple cloud storage providers can be used with DVC, and connecting is fairly | ||||||
Comment on lines
+43
to
+45
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's make a section specific to setting up and connecting first? OK to mention
Then the later section about Sharing can have all the details or example involving |
||||||
straightforward using the `dvc remote` command. For example, to connect your | ||||||
project to an [Amazon AWS](https://aws.amazon.com) S3 bucket, name the remote | ||||||
`s3remote`, and set it as your default remote storage, you would issue the | ||||||
following command: | ||||||
|
||||||
```dvc | ||||||
$ dvc remote add --default s3remote s3://path/to/cache | ||||||
``` | ||||||
|
||||||
The `dvc remote add` docs outline connecting DVC to other specific providers | ||||||
such as [Microsoft Azure](https://azure.microsoft.com/), and we've also provided | ||||||
an example specifically about | ||||||
[Google Drive as a DVC remote](/doc/user-guide/setup-google-drive-remote). Once | ||||||
your remotes are configured, you can double check that the remote storage was | ||||||
added correctly using `dvc remote list`. In this example, two remotes are set | ||||||
up, one on the cloud and one on a `work` folder on a mounted volume: | ||||||
|
||||||
```dvc | ||||||
$ dvc remote list | ||||||
s3cache s3://yourbucket/yourremotecache/ | ||||||
work /Volumes/ds_team/work | ||||||
``` | ||||||
|
||||||
💡 Before adding cloud remote storage to your project, you need to ensure that | ||||||
you have configured access to the remote storage correctly. This is dependent on | ||||||
how your cloud storage provider works and will vary depending on the provider. | ||||||
For example, for if you want to use an Amazon S3 bucket for remote storage then | ||||||
you need to | ||||||
[configure DVC for S3 access](/doc/command-reference/remote/add#supported-storage-types) | ||||||
and [install the S3 version of DVC](/doc/install). Further instructions are | ||||||
specific to the cloud service providers and may change from time to time. | ||||||
|
||||||
One remote storage is correctly set up, there are three main steps you need to | ||||||
take store versions of your project there: | ||||||
|
||||||
1. Commit the project to your local cache with `dvc commit`; this makes a new | ||||||
local version of the project. | ||||||
2. Push this new version to the remote storage with `dvc push` (use the `-r` | ||||||
option to specify which remote storage if you are not using the default | ||||||
storage). | ||||||
3. Optionally clear out your local workstation cache with `dvc gc`, the garbage | ||||||
collector, to free up space on the local filesystem. | ||||||
|
||||||
A local folder can also be used as remote storage as shown in the example above | ||||||
where a remote storage named `work` has been configured. This is useful if, for | ||||||
example, you have a network share mounted and want to use that instead of a | ||||||
cloud provider. | ||||||
|
||||||
## Content Addressable Storage Format | ||||||
|
||||||
DVC optimises the storage space used in the local cache and remotes by ensuring | ||||||
Comment on lines
+94
to
+96
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for this section, it's definitely interesting content as I mentioned in the previous PR but still this is more about the DVC cache mechanism. Please make it into a separate page somewhere and leave it out of this doc or link to if if/where needed to make the review process easier. |
||||||
no duplicate files are stored. Duplicate files accumulate for lots of reasons, | ||||||
for example: | ||||||
|
||||||
- You commit a new version of a project without any changes to a very large | ||||||
file. It makes no sense to actually store two copies of these files, even | ||||||
though the file appears in two different versions of the project. | ||||||
- You copy an image dataset so you can add new data to it, but want to keep the | ||||||
original dataset unchanged. The new version of the image dataset will contain | ||||||
an entire copy of the entire original dataset. | ||||||
- You copy a large file, but don't change the copy, so you have two files with | ||||||
different names and other metadata but the exact same content. | ||||||
|
||||||
The way that DVC determines if two different files are duplicates comes from a | ||||||
computer science idea called | ||||||
[content-addressable storage](/doc/user-guide/project-structure/internal-files). | ||||||
Basically, a unique hash of the file's _content_ (ignoring file metadata) is | ||||||
calculated. For every pair of files with different content, therefore, the | ||||||
unique hashes should be different; but if two files have the same content (even | ||||||
though they may differ in metadata such as file name and creation time), then | ||||||
the hashes will be the same. DVC renames files with their hash and stores the | ||||||
real names elsewhere. In this way, it can track files that change, track files | ||||||
that don't change, and determine which (apparently different) files are | ||||||
identical. DVC does a similar trick for directories, ensuring that directory | ||||||
changes (e.g. adding new files) are also detectable and trackable. | ||||||
|
||||||
💡 The specific type of hash used by DVC currently is the | ||||||
[MD5 hash](https://en.wikipedia.org/wiki/MD5). | ||||||
|
||||||
Let's look at a simple example: the file `a.dat`. After adding the file to the | ||||||
project with `dvc add a.dat`, a new file `a.dat.dvc` is created in the project | ||||||
folder. The new file contains the hash of `a.dat`, which can be looked at by | ||||||
examining the file: | ||||||
|
||||||
```dvc | ||||||
$ cat a.dat.dvc | ||||||
outs: | ||||||
- md5: bba40b7807c80d3f44787b9c6a4aabee | ||||||
size: 1047565 | ||||||
path: a.dat | ||||||
``` | ||||||
|
||||||
The original `a.dat` has now been renamed to `bb\a40b7807c80d3f44787b9c6a4aabee` | ||||||
and moved to the local cache. The version of `a.dat` that we now see in the | ||||||
project directory is now a | ||||||
[file link](doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) | ||||||
that refers to the hash-named version of the file. | ||||||
|
||||||
💡 Note that DVC supports file links that are | ||||||
[reflinks](https://blog.ram.rachum.com/post/620335081764077568/symlinks-and-hardlinks-move-over-make-room-for) | ||||||
(the default, if possible), hard links, soft links, and basic copies. Please | ||||||
check your operating systems capabilities and configure DVC to use the | ||||||
appropriate file link type using `dvc config` if needed. | ||||||
|
||||||
When the project is committed and pushed, the version of `a.dat` in the cache | ||||||
will be copied to remote storage. If `a.dat` changes later and is committed | ||||||
again, then a new hash for the file will be computed, a new version of the file | ||||||
will be created in the cache, and the file link in the project directory is | ||||||
updated. | ||||||
|
||||||
The next brief example shows a directory `myDir` tracked by DVC containing two | ||||||
files `a` and `b`: | ||||||
|
||||||
 | ||||||
Comment on lines
+156
to
+159
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This 2nd diagram is confusing though. But again, for now we can leave out all these details about the caching mechanism and storage optimization, which are not specifically about remote storage. |
||||||
|
||||||
The cache in this example contains three versions of the directory. When the | ||||||
project was first committed, the directory contained only the file `a`, as | ||||||
illustrated by the directory in the cache with hash starting `6e..`. The second | ||||||
version of the directory was committed after the file `b` was added and a new | ||||||
hash `22...` for the directory was calculated. The third version of the | ||||||
directory was committed after `b` was changed. This is reflected in new hashes | ||||||
for both the directory and `b` (`ef...` and `6d...` specifically). DVC is | ||||||
therefore storing two versions of `b` in the cache, but only one copy of `a` | ||||||
(since, naturally, file `a` has not changed). Also shown in the figure is the | ||||||
file `myDir.dvc`, the metadata file specifying the MD5 hash of the | ||||||
``current'' version of `myDir` in the cache. | ||||||
|
||||||
Both remote storage and local cache use the same format for organising and | ||||||
naming your project files. If you have just committed and pushed your changes | ||||||
from the local workstation to the remote storage, there should be identical | ||||||
folders in both places. | ||||||
|
||||||
## Sharing Files via Remote Storage | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Finally, this section could probably be more hands-on as well, showing push and pull and maybe even another diagram. Take a look for example at this page which we recently removed because it wasn't in the right area of our docs (but the content is still relevant and can be recovered to some extent here). Also, let's make the title more general? And it's nice to mention ML models some times in the context of data management (which includes local and remote storage). So here's a new title suggestion:
Suggested change
|
||||||
|
||||||
A disadvantage of content addressable storage is that the folder and file | ||||||
structure of the local cache/remote storages is obfuscated and no longer | ||||||
readable. The original directory and filenames are lost thanks to the use of | ||||||
hashes as the new names, and there are many more objects than there are files in | ||||||
the project because of the versioning (since deleted and changed files and | ||||||
directories need to be kept around). | ||||||
|
||||||
Consequently, you cannot simply browse the remote storage folders that DVC | ||||||
controls in the same way you would browse a normal shared network/cloud drive. | ||||||
Similarly, new files cannot be copied to the remote folder in the usual ways | ||||||
(e.g. via drag and drop) if you want to share them. You can, however, share | ||||||
files with your team using DVC. The basic approach is to follow the standard DVC | ||||||
workflow: | ||||||
|
||||||
1. Have the file(s) you want to share in your local project folder and add them | ||||||
with `dvc add`. | ||||||
2. Commit the project using `git` and `dvc`, and push the changes to the | ||||||
DVC-configured remote storage. | ||||||
3. Have your teammates checkout the latest version of the project, then pull | ||||||
from the remote storage. They should receive the file(s) being shared into | ||||||
their local project folder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clarify that (cache and) remote storage only stores the DVC-tracked data side of the project. The versioning side is done with Git. This is a key aspect of DVC storage in general.