-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared cache on NFS Introduced #455
Changes from 2 commits
0b4f581
d072afc
234172c
f64f3e8
fa047e8
7402588
2fbff29
c44563d
e8ed103
27f8572
c99cc04
1b24399
31c5d42
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
# Shared Storage on NFS | ||
|
||
In the modern software development environment, teams are working together on | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. software development -> machine learning (I even agree that it's software engineering, but it's bette to delineate them for now) |
||
same dataset to get the results. It became necessary that data is accessible and | ||
every team member has a same updated dataset. For this example, we will be using | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
NFS (Network File System) for storing and sharing files on the network. This | ||
allows you to have better resource utilization such as ability to store large | ||
disk consuming dataset on a single host machine. | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
For optimizing the performance, we can set the `cache directory` on NFS server | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
by configuring the DVC repository from making changes in the DVC config file | ||
which is present in `.dvc/config` location. With DVC, you can easily setup a | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
shared cache storage on the NFS server that will allow your team to share and | ||
store data for your projects effectively as possible and have a workspace | ||
restoration/switching speed as instant as `git checkout` for your code. | ||
|
||
With large data files it is better to set the cache directory to external NFS. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. minor: we use |
||
Not only just it will cache the data faster but also version the data. Suppose, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
we have a dataset with 1 million images. With DVC, we can have multiple versions | ||
of a dataset without affecting each other work and without creating duplicates | ||
of a complete dataset. With `cache directory` set to `NFS server` you would | ||
avoid copying large files from NFS server to the machine and DVC will manage the | ||
links from the workspace to cache. For more information, visit | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[Data and Model Files Versioning](/doc/use-cases/data-and-model-files-versioning). | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the paragraph above is good but feels repetitive to the first paragraph in the document and has minor problems. What information you are trying to deliver here? Can you summarize it here in the comments, please? And we'll see how can we improve the text. |
||
### Preparation | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
First configure NFS server and client machine, following this | ||
[link](https://vitux.com/install-nfs-server-and-client-on-ubuntu/). | ||
|
||
In order to make it work on a shared server, after configuring NFS server and | ||
client we need to setup a shared cache location for your projects, so that every | ||
team member is using the same cache location. | ||
|
||
After configuring NFS on both server and client side. Let's create an export | ||
directory on server side where all data will be stored. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's better to use |
||
|
||
```dvc | ||
$ mkdir -p /storage | ||
``` | ||
|
||
You will have to make sure that the directory has proper permissions setup,so | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
that every one on your team can read and write to it and can access cache files | ||
written by others. The most straightforward way to do that is to make sure that | ||
you and your colleagues are members of the same group (e.g. 'users') and that | ||
your shared directory is owned by that group and has respective permissions. | ||
|
||
Let's create a mount point of client side. | ||
|
||
```dvc | ||
$ mkdir -p /mnt/dataset/ | ||
``` | ||
|
||
From `/mnt/dataset/` you will be able to access `/storage` directory present in | ||
host server from your local machine. | ||
|
||
### Configuring Cache location | ||
|
||
After mounting the shared directory on client side. Assuming project code is | ||
present in `/home/user/project1`. Let's initialize a `dvc repo`. | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```dvc | ||
$ cd /home/user/project1/ | ||
$ git init | ||
$ dvc init | ||
$ git add .dvc .gitignore | ||
$ git commit . -m "initialize DVC" | ||
``` | ||
|
||
With `dvc init`, we initialized a DVC repository. DVC will start tracking all | ||
the changes. | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Tell DVC to use the directory we've set up as an external cache location by | ||
running: | ||
|
||
```dvc | ||
$ dvc config cache.dir /mnt/dataset/storage | ||
$ dvc config cache.type "reflink,symlink,hardlink,copy" | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
$ dvc config cache.protected true | ||
$ git add .dvc .gitignore | ||
$ git commit . -m "DVC cache location updated" | ||
``` | ||
|
||
By default cache is present in the `.dvc/cache` location. `dvc cache dir` | ||
changes the location of cache directory to `/mnt/dataset/storage` | ||
|
||
`config cache.dir /path/to/cache/directory` - sets cache directory location. | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Alternatively, we can also use `dvc cache dir /path/to/cache/directory`. | ||
|
||
`cache.type "reflink,symlink,hardlink,copy"` - link type that DVC should use to | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
link data files from cache to your workspace. It enables symlinks to avoid | ||
copying large files. | ||
|
||
`cache.protected true` - to make links `read only` so that we you don't corrupt | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. again explain it better that since we are going to use symlinks in this case between cache and workspace (since they are located on different file systems) it important to protect files so that we don't corrupt the cache accidentally. Mention that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @shcheklein we need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we need to run |
||
data accidentally present in the workspace. | ||
|
||
For more information on `config` options, visit | ||
[here](https://dvc.org/doc/commands-reference/config#configuration-sections) | ||
|
||
Also, let git know about the changes we have done. | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Add data to DVC cache | ||
|
||
Now, add first version of the dataset into the DVC cache (this is done once for | ||
ryokugyu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
a dataset). | ||
|
||
```dvc | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's simplify all this workflow. Let's just ask users to SSH into NFS serve machine, do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @shcheklein i think it will just confuse the user. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I also, think |
||
$ cd /mnt/dataset/ | ||
$ cp -r . /home/user/project1/ | ||
$ cd /home/user/project1 | ||
$ mv /mnt/dataset/project1_data/ data/ | ||
$ dvc add data | ||
``` | ||
|
||
After copying the data, we have moved the data that is present in the | ||
`/mnt/dataset/project1_data/`vto `./data` directory. This is only done once for | ||
a dataset. | ||
|
||
`dvc add data` will take files in `data` directory under DVC control. By default | ||
an added file is committed to the DVC cache. | ||
|
||
Now, commit changes to `.dvc/config` and push them to your git remote: | ||
|
||
```dvc | ||
$ git add data.dvc .gitignore | ||
$ git commit . -m "add first version of the dataset" | ||
$ git tag -a "v1.0" -m "dataset v1.0" | ||
$ git push origin HEAD | ||
$ git push origin v1.0 | ||
``` | ||
|
||
Next, you can easily get this appear in your workspace by: | ||
|
||
```dvc | ||
$ cd /home/user/project1/ | ||
$ git pull | ||
$ dvc checkout | ||
``` | ||
|
||
After `git pull`, you will be able to see a `data.dvc` file. To see more | ||
information on `.dvc` file format, visit | ||
[here](/doc/user-guide/dvc-file-format). | ||
|
||
`data` directory will now be a symbolic link to the NFS storage. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. worth writing something similar to the last paragraph in the introduction to reiterate on why links are important, worth showing an output of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to rename it ... it's not about NFS only. It's about any network attached storages. We can do something like:
Share Storage on NAS (NFS)