Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared cache on NFS Introduced #455

Closed
wants to merge 13 commits into from
6 changes: 4 additions & 2 deletions src/Documentation/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,14 @@
"files": [
"data-and-model-files-versioning.md",
"share-data-and-model-files.md",
"multiple-data-scientists-on-a-single-machine.md"
"multiple-data-scientists-on-a-single-machine.md",
"shared-storage-on-nfs.md"
],
"labels": {
"data-and-model-files-versioning.md": "Data & Model Files Versioning",
"share-data-and-model-files.md": "Share Data & Model Files",
"multiple-data-scientists-on-a-single-machine.md": "Shared Development Machine"
"multiple-data-scientists-on-a-single-machine.md": "Shared Development Machine",
"shared-storage-on-nfs.md": "Shared Storage on NFS"
}
},
{
Expand Down
148 changes: 148 additions & 0 deletions static/docs/use-cases/shared-storage-on-nfs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Shared Storage on NFS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to rename it ... it's not about NFS only. It's about any network attached storages. We can do something like:

Share Storage on NAS (NFS)


In the modern software development environment, teams are working together on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

software development -> machine learning

(I even agree that it's software engineering, but it's bette to delineate them for now)

same dataset to get the results. It became necessary that data is accessible and
every team member has a same updated dataset. NFS (Network File System) storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAS (NFS is one common example) is widely ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also mention something like: "Here we would like to show you how to setup a shared cache on NFS, but the same idea applies to any other NAS"

is widely used for storing and sharing files on the network. This allows you to
have better resource utilization such as ability to store large datasets on a
single host machine.

With DVC, you can easily setup a shared cache storage on the NFS server that
will allow your team to share and store data for your projects effectively as
possible and have a workspace restoration/switching speed as instant as
`git checkout` for your code.

With large data files it is better to set the cache directory to external NFS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we use it's - it's less formal

Not only just it will cache the data faster but also version the data. Suppose,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache faster - I'm not sure I understand this

we have a dataset with 1 million images. With DVC, we can have multiple versions
of a dataset without affecting each other work and without creating duplicates
of a complete dataset. With `cache directory` set to `NFS server` you would
avoid copying large files from NFS server to the machine and DVC will manage the
links from the workspace to cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the paragraph above is good but feels repetitive to the first paragraph in the document and has minor problems. What information you are trying to deliver here? Can you summarize it here in the comments, please? And we'll see how can we improve the text.

## Preparation

First configure NFS server and client machine, following this
[link](https://vitux.com/install-nfs-server-and-client-on-ubuntu/).

In order to make it work on a shared server, after configuring NFS server and
client we need to setup a shared cache location for your projects, so that every
team member is using the same cache location.

After configuring NFS on both server and client side. Let's create an export
directory on server side where all data will be stored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to use : when you have a code block you are writing about in the sentence


```dvc
$ mkdir -p /storage
```

You will have to make sure that the directory has proper permissions setup, so
that every one on your team can read and write to it and can access cache files
written by others. The most straightforward way to do that is to make sure that
you and your colleagues are members of the same group (e.g. 'users') and that
your shared directory is owned by that group and has respective permissions.

Let's create a mount point of client side.

```dvc
$ mkdir -p /mnt/dataset/
```

From `/mnt/dataset/` you will be able to access `/storage` directory present in
host server from your local machine.

## Configuring Cache location
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

location -> Location


After mounting the shared directory on client side. Assuming project code is
present in `/project`. Let's initialize a `dvc repo`.

```dvc
$ cd /project/
$ git init
$ dvc init
$ git add .dvc .gitignore
$ git commit . -m "initialize DVC"
```

With `dvc init`, we initialized a DVC repository. For more information, visit
[here](/doc/get-started/initialize).

**Tell DVC to use the directory we've set up as an external cache location by
running:**

```dvc
$ dvc cache dir /mnt/dataset/storage
```

`dvc cache dir /path/to/cache/directory` - sets cache directory location.

```dvc
$ dvc config cache.type "reflink,symlink,hardlink,copy"
```

`cache.type "reflink,symlink,hardlink,copy"` - link type that DVC should use to
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved
link data files from cache to your workspace. It enables symlinks to avoid
copying large files. For more information, vist
[here](/doc/user-guide/large-dataset-optimization).

```dvc
$ dvc config cache.protected true
```

`cache.protected true` - to make links `read only` so that we you don't corrupt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again explain it better that since we are going to use symlinks in this case between cache and workspace (since they are located on different file systems) it important to protect files so that we don't corrupt the cache accidentally. Mention that dvc unprotect should be used in this case, link to the https://dvc.org/doc/user-guide/update-tracked-file

Copy link
Contributor Author

@ryokugyu ryokugyu Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein we need dvc unprotect only when we are writing to NFS directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to run dvc unprotect in the client's workspace if we want to edit/rewrite the file that is under DVC control.

data accidentally present in the workspace. Since, we are using `symlinks`
between the cache and local workspace because both are located on different
filesystem.

Also, let Git know about the changes we have done.

```dvc
$ git add .dvc .gitignore
$ git commit . -m "DVC cache location updated"
```

## Add data to DVC cache

Now, add first version of the dataset into the DVC cache (this is done once for
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved
a dataset).

```dvc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's simplify all this workflow. Let's just ask users to SSH into NFS serve machine, do git clone .../project. Move data into project and run dvc add, git commit, git push, (dvc push optional) after that. All the stuff below can be adjusted a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein i think it will just confuse the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cp -r . /project/ is very confusing also. I would say we need to explain the motivation here - we want to avoid copying existing data to a client machine to take it under DVC control.

I also, think git clone protocol is a standard way to collaborate and update different requirements. It's better to do this from the NFS server machine. It'll emphasize that NFS takes care about data.

$ cd /mnt/dataset/
$ cp -r . /project/
$ cd /project
$ mv /mnt/dataset/project_data/ data/
$ dvc add data
```

After copying the data, we have moved the data that is present in the
`/mnt/dataset/project_data/` to `./data` directory. This is only done once for a
dataset.

`dvc add data` will take files in `data` directory under DVC control. By default
an added file is committed to the DVC cache. After `dvc add` dvc will
`unprotect` all the data. For more information, visit
[here](/doc/user-guide/update-tracked-file).

Now, commit changes to `.dvc/config` and push them to your git remote:

```dvc
$ git add data.dvc .gitignore
$ git commit . -m "add first version of the dataset"
$ git tag -a "v1.0" -m "dataset v1.0"
$ git push origin HEAD
$ git push origin v1.0
```

Next, you can easily get this appear in your workspace by:

```dvc
$ cd /home/user/project/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before the project path was /project

$ git pull
$ dvc checkout
```

After `git pull`, you will be able to see a `data.dvc` file. To see more
information on `.dvc` file format, visit
[here](/doc/user-guide/dvc-file-format).

`data` directory will now be a symbolic link to the NFS storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth writing something similar to the last paragraph in the introduction to reiterate on why links are important, worth showing an output of the ls -a