-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared cache on NFS Introduced #455
Changes from all commits
0b4f581
d072afc
234172c
f64f3e8
fa047e8
7402588
2fbff29
c44563d
e8ed103
27f8572
c99cc04
1b24399
31c5d42
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,148 @@ | ||
# Shared Storage on NFS | ||
|
||
In the modern software development environment, teams are working together on | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. software development -> machine learning (I even agree that it's software engineering, but it's bette to delineate them for now) |
||
same dataset to get the results. It became necessary that data is accessible and | ||
every team member has a same updated dataset. NFS (Network File System) storage | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can also mention something like: "Here we would like to show you how to setup a shared cache on NFS, but the same idea applies to any other NAS" |
||
is widely used for storing and sharing files on the network. This allows you to | ||
have better resource utilization such as ability to store large datasets on a | ||
single host machine. | ||
|
||
With DVC, you can easily setup a shared cache storage on the NFS server that | ||
will allow your team to share and store data for your projects effectively as | ||
possible and have a workspace restoration/switching speed as instant as | ||
`git checkout` for your code. | ||
|
||
With large data files it is better to set the cache directory to external NFS. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. minor: we use |
||
Not only just it will cache the data faster but also version the data. Suppose, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
we have a dataset with 1 million images. With DVC, we can have multiple versions | ||
of a dataset without affecting each other work and without creating duplicates | ||
of a complete dataset. With `cache directory` set to `NFS server` you would | ||
avoid copying large files from NFS server to the machine and DVC will manage the | ||
links from the workspace to cache. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the paragraph above is good but feels repetitive to the first paragraph in the document and has minor problems. What information you are trying to deliver here? Can you summarize it here in the comments, please? And we'll see how can we improve the text. |
||
## Preparation | ||
|
||
First configure NFS server and client machine, following this | ||
[link](https://vitux.com/install-nfs-server-and-client-on-ubuntu/). | ||
|
||
In order to make it work on a shared server, after configuring NFS server and | ||
client we need to setup a shared cache location for your projects, so that every | ||
team member is using the same cache location. | ||
|
||
After configuring NFS on both server and client side. Let's create an export | ||
directory on server side where all data will be stored. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's better to use |
||
|
||
```dvc | ||
$ mkdir -p /storage | ||
``` | ||
|
||
You will have to make sure that the directory has proper permissions setup, so | ||
that every one on your team can read and write to it and can access cache files | ||
written by others. The most straightforward way to do that is to make sure that | ||
you and your colleagues are members of the same group (e.g. 'users') and that | ||
your shared directory is owned by that group and has respective permissions. | ||
|
||
Let's create a mount point of client side. | ||
|
||
```dvc | ||
$ mkdir -p /mnt/dataset/ | ||
``` | ||
|
||
From `/mnt/dataset/` you will be able to access `/storage` directory present in | ||
host server from your local machine. | ||
|
||
## Configuring Cache location | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. location -> Location |
||
|
||
After mounting the shared directory on client side. Assuming project code is | ||
present in `/project`. Let's initialize a `dvc repo`. | ||
|
||
```dvc | ||
$ cd /project/ | ||
$ git init | ||
$ dvc init | ||
$ git add .dvc .gitignore | ||
$ git commit . -m "initialize DVC" | ||
``` | ||
|
||
With `dvc init`, we initialized a DVC repository. For more information, visit | ||
[here](/doc/get-started/initialize). | ||
|
||
**Tell DVC to use the directory we've set up as an external cache location by | ||
running:** | ||
|
||
```dvc | ||
$ dvc cache dir /mnt/dataset/storage | ||
``` | ||
|
||
`dvc cache dir /path/to/cache/directory` - sets cache directory location. | ||
|
||
```dvc | ||
$ dvc config cache.type "reflink,symlink,hardlink,copy" | ||
``` | ||
|
||
`cache.type "reflink,symlink,hardlink,copy"` - link type that DVC should use to | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
link data files from cache to your workspace. It enables symlinks to avoid | ||
copying large files. For more information, vist | ||
[here](/doc/user-guide/large-dataset-optimization). | ||
|
||
```dvc | ||
$ dvc config cache.protected true | ||
``` | ||
|
||
`cache.protected true` - to make links `read only` so that we you don't corrupt | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. again explain it better that since we are going to use symlinks in this case between cache and workspace (since they are located on different file systems) it important to protect files so that we don't corrupt the cache accidentally. Mention that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @shcheklein we need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we need to run |
||
data accidentally present in the workspace. Since, we are using `symlinks` | ||
between the cache and local workspace because both are located on different | ||
filesystem. | ||
|
||
Also, let Git know about the changes we have done. | ||
|
||
```dvc | ||
$ git add .dvc .gitignore | ||
$ git commit . -m "DVC cache location updated" | ||
``` | ||
|
||
## Add data to DVC cache | ||
|
||
Now, add first version of the dataset into the DVC cache (this is done once for | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
a dataset). | ||
|
||
```dvc | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's simplify all this workflow. Let's just ask users to SSH into NFS serve machine, do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @shcheklein i think it will just confuse the user. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I also, think |
||
$ cd /mnt/dataset/ | ||
$ cp -r . /project/ | ||
$ cd /project | ||
$ mv /mnt/dataset/project_data/ data/ | ||
$ dvc add data | ||
``` | ||
|
||
After copying the data, we have moved the data that is present in the | ||
`/mnt/dataset/project_data/` to `./data` directory. This is only done once for a | ||
dataset. | ||
|
||
`dvc add data` will take files in `data` directory under DVC control. By default | ||
an added file is committed to the DVC cache. After `dvc add` dvc will | ||
`unprotect` all the data. For more information, visit | ||
[here](/doc/user-guide/update-tracked-file). | ||
|
||
Now, commit changes to `.dvc/config` and push them to your git remote: | ||
|
||
```dvc | ||
$ git add data.dvc .gitignore | ||
$ git commit . -m "add first version of the dataset" | ||
$ git tag -a "v1.0" -m "dataset v1.0" | ||
$ git push origin HEAD | ||
$ git push origin v1.0 | ||
``` | ||
|
||
Next, you can easily get this appear in your workspace by: | ||
|
||
```dvc | ||
$ cd /home/user/project/ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. before the project path was |
||
$ git pull | ||
$ dvc checkout | ||
``` | ||
|
||
After `git pull`, you will be able to see a `data.dvc` file. To see more | ||
information on `.dvc` file format, visit | ||
[here](/doc/user-guide/dvc-file-format). | ||
|
||
`data` directory will now be a symbolic link to the NFS storage. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. worth writing something similar to the last paragraph in the introduction to reiterate on why links are important, worth showing an output of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to rename it ... it's not about NFS only. It's about any network attached storages. We can do something like:
Share Storage on NAS (NFS)