-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added user guide page on remote storage #3337
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this looks like a guide! Here's a first round of mid-level comments in the intro and high-level comments for the other sections. Thanks!
BTW, I deployed the page here for easier reviewing 🙂
DVC can use remote storage instead of local disk space for storing previously | ||
committed versions of your project. You may need to do this if, for example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
committed versions of your project
Let's clarify that (cache and) remote storage only stores the DVC-tracked data side of the project. The versioning side is done with Git. This is a key aspect of DVC storage in general.
committed versions of your project. You may need to do this if, for example, | ||
|
||
- You don't have enough disk space for storing all the old versions locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to include the motivation/uses of remote storage! And you got some of the main ones right but
- the order should be inverse I think: sharing/collaboration first, then backup and "not enough space" which are basically the same case.
- a missing case is to allow for custom data management designs e.g. store raw data in a remote, features in another, and models in a 3rd one -- all with different access rights (the authentication layer provided by storage platforms is key).
- I'd try to keep the bullets much shorter to make the list more effective. If needed some of the details can be moved to later parts of the doc after the intro.
- (minor) previous versions aren't necessary "old" (that seems to imply "outdated"). They may just be different in other ways. This term is currently used throughout the doc and can be misleading.
![Local cache and remote storage](/img/remote_storage.png) | ||
|
||
Multiple old versions of the project (six of them, in fact) are being archived | ||
on remote storage. The current working version of the project (version 7) has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on remote storage. The current working version of the project (version 7) has | |
on remote storage. The latest version of the project's data (7) has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loving the diagrams BTW 👍🏼
and this is updated everytime you issue `dvc commit`. If you are committing | ||
frequently, and making big changes with each commit, you could easily run out | ||
of local storage after a while. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dvc commit
is too low-level here. That command is a helper in fact. The operation is included in dvc add
and when needed in dvc repro/exp run
. We don't need to get so specific in this guide though, we can keep it general e.g. "every time a new version of data dependencies or outputs is saved with DVC" or something like that. Keep in mind the main workflows that lead to these data commits -- again, 1. add
ing base data and 2. somehow repro
ing a pipeline or experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, term "committing" is tricky: are you referring to git commit
an entire project version? DVC 'commits' data internally (to cache) even when no Git commit happens (which is why repro --no-commit
exists, for example).
Please review the entire doc with these comments in mind since at the moment there's lots of mentions of committing and dvc commit
. Thanks
stored. The data scientist recently cleared the local cache using DVC's garbage | ||
collection command `dvc gc`. Periodically, she/he issues `dvc push` to send new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't complicate the intro with dvc gc
mentions for now. That deserves a section or even page of its own, probably (not expected for this PR).
## Connecting and Pushing to Remote Storage | ||
|
||
Multiple cloud storage providers can be used with DVC, and connecting is fairly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make a section specific to setting up and connecting first? OK to mention dvc push
in the examples (if needed) as the way to confirm connectivity, but the focus should be on general setup (remote add/modify
commands, including general authentication info, ideally listing all supported types of remotes.
In fact it will probably be a series of pages for all that (extracted from https://dvc.org/doc/command-reference/remote/modify#available-parameters-per-storage-type) but no need to go nearly that far for now, we can keep it much more general in this PR.
Then the later section about Sharing can have all the details or example involving push
.
## Content Addressable Storage Format | ||
|
||
DVC optimises the storage space used in the local cache and remotes by ensuring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this section, it's definitely interesting content as I mentioned in the previous PR but still this is more about the DVC cache mechanism. Please make it into a separate page somewhere and leave it out of this doc or link to if if/where needed to make the review process easier.
The next brief example shows a directory `myDir` tracked by DVC containing two | ||
files `a` and `b`: | ||
|
||
![Local cache and remote storage](/img/cache_structure.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 2nd diagram is confusing though. But again, for now we can leave out all these details about the caching mechanism and storage optimization, which are not specifically about remote storage.
from the local workstation to the remote storage, there should be identical | ||
folders in both places. | ||
|
||
## Sharing Files via Remote Storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, this section could probably be more hands-on as well, showing push and pull and maybe even another diagram. Take a look for example at this page which we recently removed because it wasn't in the right area of our docs (but the content is still relevant and can be recovered to some extent here).
Also, let's make the title more general? And it's nice to mention ML models some times in the context of data management (which includes local and remote storage). So here's a new title suggestion:
## Sharing Files via Remote Storage | |
## Sharing Data and Models |
Closing as stale for now. Please reopen if needed. |
Rel. #2866