Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared cache on NFS Introduced #455

Closed
wants to merge 13 commits into from
6 changes: 4 additions & 2 deletions src/Documentation/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,14 @@
"files": [
"data-and-model-files-versioning.md",
"share-data-and-model-files.md",
"multiple-data-scientists-on-a-single-machine.md"
"multiple-data-scientists-on-a-single-machine.md",
"shared-storage-on-nfs.md"
],
"labels": {
"data-and-model-files-versioning.md": "Data & Model Files Versioning",
"share-data-and-model-files.md": "Share Data & Model Files",
"multiple-data-scientists-on-a-single-machine.md": "Shared Development Machine"
"multiple-data-scientists-on-a-single-machine.md": "Shared Development Machine",
"shared-storage-on-nfs.md": "Shared Storage on NFS"
}
},
{
Expand Down
144 changes: 144 additions & 0 deletions static/docs/use-cases/shared-storage-on-nfs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Shared Storage on NFS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to rename it ... it's not about NFS only. It's about any network attached storages. We can do something like:

Share Storage on NAS (NFS)


In the modern software development environment, teams are working together on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

software development -> machine learning

(I even agree that it's software engineering, but it's bette to delineate them for now)

same dataset to get the results. It became necessary that data is accessible and
every team member has a same updated dataset. For this example, we will be using
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved
NFS (Network File System) for storing and sharing files on the network. This
allows you to have better resource utilization such as ability to store large
disk consuming dataset on a single host machine.
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved

For optimizing the performance, we can set the `cache directory` on NFS server
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved
by configuring the DVC repository from making changes in the DVC config file
which is present in `.dvc/config` location. With DVC, you can easily setup a
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved
shared cache storage on the NFS server that will allow your team to share and
store data for your projects effectively as possible and have a workspace
restoration/switching speed as instant as `git checkout` for your code.

With large data files it is better to set the cache directory to external NFS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we use it's - it's less formal

Not only just it will cache the data faster but also version the data. Suppose,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache faster - I'm not sure I understand this

we have a dataset with 1 million images. With DVC, we can have multiple versions
of a dataset without affecting each other work and without creating duplicates
of a complete dataset. With `cache directory` set to `NFS server` you would
avoid copying large files from NFS server to the machine and DVC will manage the
links from the workspace to cache. For more information, visit
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved
[Data and Model Files Versioning](/doc/use-cases/data-and-model-files-versioning).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the paragraph above is good but feels repetitive to the first paragraph in the document and has minor problems. What information you are trying to deliver here? Can you summarize it here in the comments, please? And we'll see how can we improve the text.

### Preparation
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved

First configure NFS server and client machine, following this
[link](https://vitux.com/install-nfs-server-and-client-on-ubuntu/).

In order to make it work on a shared server, after configuring NFS server and
client we need to setup a shared cache location for your projects, so that every
team member is using the same cache location.

After configuring NFS on both server and client side. Let's create an export
directory on server side where all data will be stored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to use : when you have a code block you are writing about in the sentence


```dvc
$ mkdir -p /storage
```

You will have to make sure that the directory has proper permissions setup,so
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved
that every one on your team can read and write to it and can access cache files
written by others. The most straightforward way to do that is to make sure that
you and your colleagues are members of the same group (e.g. 'users') and that
your shared directory is owned by that group and has respective permissions.

Let's create a mount point of client side.

```dvc
$ mkdir -p /mnt/dataset/
```

From `/mnt/dataset/` you will be able to access `/storage` directory present in
host server from your local machine.

### Configuring Cache location

After mounting the shared directory on client side. Assuming project code is
present in `/home/user/project1`. Let's initialize a `dvc repo`.
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ cd /home/user/project1/
$ git init
$ dvc init
$ git add .dvc .gitignore
$ git commit . -m "initialize DVC"
```

With `dvc init`, we initialized a DVC repository. DVC will start tracking all
the changes.
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved

Tell DVC to use the directory we've set up as an external cache location by
running:

```dvc
$ dvc config cache.dir /mnt/dataset/storage
$ dvc config cache.type "reflink,symlink,hardlink,copy"
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved
$ dvc config cache.protected true
$ git add .dvc .gitignore
$ git commit . -m "DVC cache location updated"
```

By default cache is present in the `.dvc/cache` location. `dvc cache dir`
changes the location of cache directory to `/mnt/dataset/storage`

`config cache.dir /path/to/cache/directory` - sets cache directory location.
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved
Alternatively, we can also use `dvc cache dir /path/to/cache/directory`.

`cache.type "reflink,symlink,hardlink,copy"` - link type that DVC should use to
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved
link data files from cache to your workspace. It enables symlinks to avoid
copying large files.

`cache.protected true` - to make links `read only` so that we you don't corrupt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again explain it better that since we are going to use symlinks in this case between cache and workspace (since they are located on different file systems) it important to protect files so that we don't corrupt the cache accidentally. Mention that dvc unprotect should be used in this case, link to the https://dvc.org/doc/user-guide/update-tracked-file

Copy link
Contributor Author

@ryokugyu ryokugyu Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein we need dvc unprotect only when we are writing to NFS directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to run dvc unprotect in the client's workspace if we want to edit/rewrite the file that is under DVC control.

data accidentally present in the workspace.

For more information on `config` options, visit
[here](https://dvc.org/doc/commands-reference/config#configuration-sections)

Also, let git know about the changes we have done.
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved

#### Add data to DVC cache

Now, add first version of the dataset into the DVC cache (this is done once for
ryokugyu marked this conversation as resolved.
Show resolved Hide resolved
a dataset).

```dvc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's simplify all this workflow. Let's just ask users to SSH into NFS serve machine, do git clone .../project. Move data into project and run dvc add, git commit, git push, (dvc push optional) after that. All the stuff below can be adjusted a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein i think it will just confuse the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cp -r . /project/ is very confusing also. I would say we need to explain the motivation here - we want to avoid copying existing data to a client machine to take it under DVC control.

I also, think git clone protocol is a standard way to collaborate and update different requirements. It's better to do this from the NFS server machine. It'll emphasize that NFS takes care about data.

$ cd /mnt/dataset/
$ cp -r . /home/user/project1/
$ cd /home/user/project1
$ mv /mnt/dataset/project1_data/ data/
$ dvc add data
```

After copying the data, we have moved the data that is present in the
`/mnt/dataset/project1_data/`vto `./data` directory. This is only done once for
a dataset.

`dvc add data` will take files in `data` directory under DVC control. By default
an added file is committed to the DVC cache.

Now, commit changes to `.dvc/config` and push them to your git remote:

```dvc
$ git add data.dvc .gitignore
$ git commit . -m "add first version of the dataset"
$ git tag -a "v1.0" -m "dataset v1.0"
$ git push origin HEAD
$ git push origin v1.0
```

Next, you can easily get this appear in your workspace by:

```dvc
$ cd /home/user/project1/
$ git pull
$ dvc checkout
```

After `git pull`, you will be able to see a `data.dvc` file. To see more
information on `.dvc` file format, visit
[here](/doc/user-guide/dvc-file-format).

`data` directory will now be a symbolic link to the NFS storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth writing something similar to the last paragraph in the introduction to reiterate on why links are important, worth showing an output of the ls -a