Added user guide page on remote storage #3337

michael-mayo · 2022-03-07T08:41:38Z

jorgeorpinel

Yep, this looks like a guide! Here's a first round of mid-level comments in the intro and high-level comments for the other sections. Thanks!

BTW, I deployed the page here for easier reviewing 🙂

jorgeorpinel · 2022-03-09T05:21:50Z

content/docs/user-guide/remote-storage.md

+DVC can use remote storage instead of local disk space for storing previously
+committed versions of your project. You may need to do this if, for example,


committed versions of your project

Let's clarify that (cache and) remote storage only stores the DVC-tracked data side of the project. The versioning side is done with Git. This is a key aspect of DVC storage in general.

jorgeorpinel · 2022-03-09T05:24:28Z

content/docs/user-guide/remote-storage.md

+committed versions of your project. You may need to do this if, for example,
+
+- You don't have enough disk space for storing all the old versions locally.


Nice to include the motivation/uses of remote storage! And you got some of the main ones right but

the order should be inverse I think: sharing/collaboration first, then backup and "not enough space" which are basically the same case.

a missing case is to allow for custom data management designs e.g. store raw data in a remote, features in another, and models in a 3rd one -- all with different access rights (the authentication layer provided by storage platforms is key).

I'd try to keep the bullets much shorter to make the list more effective. If needed some of the details can be moved to later parts of the doc after the intro.

(minor) previous versions aren't necessary "old" (that seems to imply "outdated"). They may just be different in other ways. This term is currently used throughout the doc and can be misleading.

jorgeorpinel · 2022-03-09T05:35:28Z

content/docs/user-guide/remote-storage.md

+![Local cache and remote storage](/img/remote_storage.png)
+
+Multiple old versions of the project (six of them, in fact) are being archived
+on remote storage. The current working version of the project (version 7) has


Suggested change

on remote storage. The current working version of the project (version 7) has

on remote storage. The latest version of the project's data (7) has

Loving the diagrams BTW 👍🏼

jorgeorpinel · 2022-03-09T05:38:58Z

content/docs/user-guide/remote-storage.md

+  and this is updated everytime you issue `dvc commit`. If you are committing
+  frequently, and making big changes with each commit, you could easily run out
+  of local storage after a while.


dvc commit is too low-level here. That command is a helper in fact. The operation is included in dvc add and when needed in dvc repro/exp run. We don't need to get so specific in this guide though, we can keep it general e.g. "every time a new version of data dependencies or outputs is saved with DVC" or something like that. Keep in mind the main workflows that lead to these data commits -- again, 1. adding base data and 2. somehow reproing a pipeline or experiment.

Also, term "committing" is tricky: are you referring to git commit an entire project version? DVC 'commits' data internally (to cache) even when no Git commit happens (which is why repro --no-commit exists, for example).

Please review the entire doc with these comments in mind since at the moment there's lots of mentions of committing and dvc commit. Thanks

jorgeorpinel · 2022-03-09T05:46:30Z

content/docs/user-guide/remote-storage.md

+stored. The data scientist recently cleared the local cache using DVC's garbage
+collection command `dvc gc`. Periodically, she/he issues `dvc push` to send new


I wouldn't complicate the intro with dvc gc mentions for now. That deserves a section or even page of its own, probably (not expected for this PR).

jorgeorpinel · 2022-03-09T05:56:16Z

content/docs/user-guide/remote-storage.md

+## Connecting and Pushing to Remote Storage
+
+Multiple cloud storage providers can be used with DVC, and connecting is fairly


Let's make a section specific to setting up and connecting first? OK to mention dvc push in the examples (if needed) as the way to confirm connectivity, but the focus should be on general setup (remote add/modify commands, including general authentication info, ideally listing all supported types of remotes.

In fact it will probably be a series of pages for all that (extracted from https://dvc.org/doc/command-reference/remote/modify#available-parameters-per-storage-type) but no need to go nearly that far for now, we can keep it much more general in this PR.

Then the later section about Sharing can have all the details or example involving push.

jorgeorpinel · 2022-03-09T06:00:06Z

content/docs/user-guide/remote-storage.md

+## Content Addressable Storage Format
+
+DVC optimises the storage space used in the local cache and remotes by ensuring


Thanks for this section, it's definitely interesting content as I mentioned in the previous PR but still this is more about the DVC cache mechanism. Please make it into a separate page somewhere and leave it out of this doc or link to if if/where needed to make the review process easier.

jorgeorpinel · 2022-03-09T06:02:09Z

content/docs/user-guide/remote-storage.md

+The next brief example shows a directory `myDir` tracked by DVC containing two
+files `a` and `b`:
+
+![Local cache and remote storage](/img/cache_structure.png)


This 2nd diagram is confusing though. But again, for now we can leave out all these details about the caching mechanism and storage optimization, which are not specifically about remote storage.

jorgeorpinel · 2022-03-09T06:10:46Z

content/docs/user-guide/remote-storage.md

+from the local workstation to the remote storage, there should be identical
+folders in both places.
+
+## Sharing Files via Remote Storage


Finally, this section could probably be more hands-on as well, showing push and pull and maybe even another diagram. Take a look for example at this page which we recently removed because it wasn't in the right area of our docs (but the content is still relevant and can be recovered to some extent here).

Also, let's make the title more general? And it's nice to mention ML models some times in the context of data management (which includes local and remote storage). So here's a new title suggestion:

Suggested change

## Sharing Files via Remote Storage

## Sharing Data and Models

jorgeorpinel · 2022-03-30T08:37:46Z

Closing as stale for now. Please reopen if needed.

michael-mayo added 5 commits March 4, 2022 17:14

set up remote storage page, added back blog-like posts

5fff5fd

started remotes page

3c2b48b

worked on remote page

973f29e

worked on draft

4c3dad9

finished draft remote storage page

0706662

shcheklein changed the title ~~Added user guide page on remote storage (Fix#1792)~~ Added user guide page on remote storage Mar 7, 2022

jorgeorpinel temporarily deployed to dvc-org-master-yllpkpwgl95aphz March 9, 2022 05:33 Inactive

jorgeorpinel suggested changes Mar 9, 2022

View reviewed changes

jorgeorpinel assigned michael-mayo Mar 9, 2022

jorgeorpinel added the status: stale You've been groomed! label Mar 30, 2022

jorgeorpinel closed this Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added user guide page on remote storage #3337

Added user guide page on remote storage #3337

michael-mayo commented Mar 7, 2022 •

edited by jorgeorpinel

Loading

jorgeorpinel left a comment

jorgeorpinel Mar 9, 2022

jorgeorpinel Mar 9, 2022 •

edited

Loading

jorgeorpinel Mar 9, 2022

jorgeorpinel Mar 9, 2022

jorgeorpinel Mar 9, 2022 •

edited

Loading

jorgeorpinel Mar 9, 2022

jorgeorpinel Mar 9, 2022 •

edited

Loading

jorgeorpinel Mar 9, 2022 •

edited

Loading

jorgeorpinel Mar 9, 2022

jorgeorpinel Mar 9, 2022

jorgeorpinel Mar 9, 2022

jorgeorpinel commented Mar 30, 2022

		DVC can use remote storage instead of local disk space for storing previously
		committed versions of your project. You may need to do this if, for example,

		committed versions of your project. You may need to do this if, for example,

		- You don't have enough disk space for storing all the old versions locally.

	on remote storage. The current working version of the project (version 7) has
	on remote storage. The latest version of the project's data (7) has

		stored. The data scientist recently cleared the local cache using DVC's garbage
		collection command `dvc gc`. Periodically, she/he issues `dvc push` to send new

		## Connecting and Pushing to Remote Storage

		Multiple cloud storage providers can be used with DVC, and connecting is fairly

		## Content Addressable Storage Format

		DVC optimises the storage space used in the local cache and remotes by ensuring

	## Sharing Files via Remote Storage
	## Sharing Data and Models

Added user guide page on remote storage #3337

Added user guide page on remote storage #3337

Conversation

michael-mayo commented Mar 7, 2022 • edited by jorgeorpinel Loading

jorgeorpinel left a comment

Choose a reason for hiding this comment

jorgeorpinel Mar 9, 2022

Choose a reason for hiding this comment

jorgeorpinel Mar 9, 2022 • edited Loading

Choose a reason for hiding this comment

jorgeorpinel Mar 9, 2022

Choose a reason for hiding this comment

jorgeorpinel Mar 9, 2022

Choose a reason for hiding this comment

jorgeorpinel Mar 9, 2022 • edited Loading

Choose a reason for hiding this comment

jorgeorpinel Mar 9, 2022

Choose a reason for hiding this comment

jorgeorpinel Mar 9, 2022 • edited Loading

Choose a reason for hiding this comment

jorgeorpinel Mar 9, 2022 • edited Loading

Choose a reason for hiding this comment

jorgeorpinel Mar 9, 2022

Choose a reason for hiding this comment

jorgeorpinel Mar 9, 2022

Choose a reason for hiding this comment

jorgeorpinel Mar 9, 2022

Choose a reason for hiding this comment

jorgeorpinel commented Mar 30, 2022

michael-mayo commented Mar 7, 2022 •

edited by jorgeorpinel

Loading

jorgeorpinel Mar 9, 2022 •

edited

Loading

jorgeorpinel Mar 9, 2022 •

edited

Loading

jorgeorpinel Mar 9, 2022 •

edited

Loading

jorgeorpinel Mar 9, 2022 •

edited

Loading