-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared cache on NFS Introduced #455
Conversation
link data files from cache to your workspace. It enables symlinks to avoid | ||
copying large files. | ||
|
||
`cache.protected true` - to make links `read only` so that we you don't corrupt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again explain it better that since we are going to use symlinks in this case between cache and workspace (since they are located on different file systems) it important to protect files so that we don't corrupt the cache accidentally. Mention that dvc unprotect
should be used in this case, link to the https://dvc.org/doc/user-guide/update-tracked-file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shcheklein we need dvc unprotect
only when we are writing to NFS directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to run dvc unprotect
in the client's workspace if we want to edit/rewrite the file that is under DVC control.
Now, add first version of the dataset into the DVC cache (this is done once for | ||
a dataset). | ||
|
||
```dvc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's simplify all this workflow. Let's just ask users to SSH into NFS serve machine, do git clone .../project
. Move data into project
and run dvc add
, git commit
, git push
, (dvc push
optional) after that. All the stuff below can be adjusted a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shcheklein i think it will just confuse the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cp -r . /project/
is very confusing also. I would say we need to explain the motivation here - we want to avoid copying existing data to a client machine to take it under DVC control.
I also, think git clone
protocol is a standard way to collaborate and update different requirements. It's better to do this from the NFS server machine. It'll emphasize that NFS takes care about data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good stuff 🎉 Requires a second iteration to clarify/simplify certain things. Let me know if you need some help with it.
@shcheklein please review this. |
possible and have a workspace restoration/switching speed as instant as | ||
`git checkout` for your code. | ||
|
||
With large data files it is better to set the cache directory to external NFS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: we use it's
- it's less formal
`git checkout` for your code. | ||
|
||
With large data files it is better to set the cache directory to external NFS. | ||
Not only just it will cache the data faster but also version the data. Suppose, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache faster
- I'm not sure I understand this
of a complete dataset. With `cache directory` set to `NFS server` you would | ||
avoid copying large files from NFS server to the machine and DVC will manage the | ||
links from the workspace to cache. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the paragraph above is good but feels repetitive to the first paragraph in the document and has minor problems. What information you are trying to deliver here? Can you summarize it here in the comments, please? And we'll see how can we improve the text.
@@ -0,0 +1,148 @@ | |||
# Shared Storage on NFS | |||
|
|||
In the modern software development environment, teams are working together on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
software development -> machine learning
(I even agree that it's software engineering, but it's bette to delineate them for now)
@@ -0,0 +1,148 @@ | |||
# Shared Storage on NFS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to rename it ... it's not about NFS only. It's about any network attached storages. We can do something like:
Share Storage on NAS (NFS)
|
||
In the modern software development environment, teams are working together on | ||
same dataset to get the results. It became necessary that data is accessible and | ||
every team member has a same updated dataset. NFS (Network File System) storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NAS (NFS is one common example) is widely ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also mention something like: "Here we would like to show you how to setup a shared cache on NFS, but the same idea applies to any other NAS"
team member is using the same cache location. | ||
|
||
After configuring NFS on both server and client side. Let's create an export | ||
directory on server side where all data will be stored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to use :
when you have a code block you are writing about in the sentence
From `/mnt/dataset/` you will be able to access `/storage` directory present in | ||
host server from your local machine. | ||
|
||
## Configuring Cache location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
location -> Location
Next, you can easily get this appear in your workspace by: | ||
|
||
```dvc | ||
$ cd /home/user/project/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before the project path was /project
information on `.dvc` file format, visit | ||
[here](/doc/user-guide/dvc-file-format). | ||
|
||
`data` directory will now be a symbolic link to the NFS storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth writing something similar to the last paragraph in the introduction to reiterate on why links are important, worth showing an output of the ls -a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks great! we are almost there. Please check some comments. Also, I'll try to come with an image, similar to what we have for other use cases. Good stuff.
@ryokugyu any updates on this? :) it's almost done as far as I can tell, would be great to get it merged. |
@shcheklein will work on it. Sorry for the delay! |
I think that the "Mounted DVC Storage" (which is explained on this interactive example: https://katacoda.com/dvc/courses/examples/mounted-storage) is more general than just NFS and it deprecates this one. |
my concern that it's very specific because of SSHFS and it's not emphasized enough that NFS, NAS (whatever else?) is covered
don't think so. Especially the way interactive tutorials are made - they are extremely dry and do not explain motivation very well, do not explain what is happening behind the scene and what commands are doing. |
In this case it is just an interactive example (not a tutorial) and it is referenced from a User Guide page: https://dvc-org-pr-784.herokuapp.com/doc/user-guide/data-sharing/mounted-storage |
kk. It just in you initial comment you mentioned only the interactive tutorial and hadn't had enough time to see the UG changes. Will get back to this one when I have time to read the epic PR :) |
I think it is still relevant. Unfortunately, no easy way to reopen it now. |
fix #103