Skip to content
This repository has been archived by the owner on Feb 20, 2024. It is now read-only.

Commit

Permalink
Merge pull request #93 from nginyc/dev
Browse files Browse the repository at this point in the history
[V0.0.9] Add model access rights, downloading of trained models & selecting models for training
  • Loading branch information
nginyc authored Dec 18, 2018
2 parents 02e7514 + 472362d commit 317a8a2
Show file tree
Hide file tree
Showing 56 changed files with 1,306 additions and 718 deletions.
18 changes: 12 additions & 6 deletions .env.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
# Core configuration for Rafiki
# Core external configuration for Rafiki
export DOCKER_NETWORK=rafiki
export RAFIKI_VERSION=0.0.8
export RAFIKI_IP_ADDRESS=127.0.0.1
export DOCKER_SWARM_ADVERTISE_ADDR=127.0.0.1
export RAFIKI_VERSION=0.0.9
export RAFIKI_ADDR=127.0.0.1
export ADMIN_EXT_PORT=3000
export ADMIN_WEB_EXT_PORT=3001
export ADVISOR_EXT_PORT=3002
export POSTGRES_EXT_PORT=5433
export REDIS_EXT_PORT=6380
export DATA_WORKDIR_PATH=$PWD/data # Shares a data folder with containers
export LOGS_WORKDIR_PATH=$PWD/logs # Shares a folder with containers that stores components' logs

# Internal credentials for Rafiki's components
export POSTGRES_USER=rafiki
Expand All @@ -22,7 +27,8 @@ export REDIS_HOST=rafiki_cache
export REDIS_PORT=6379
export PREDICTOR_PORT=3003
export ADMIN_WEB_HOST=rafiki_admin_web
export LOCAL_WORKDIR_PATH=$PWD
export DATA_DOCKER_WORKDIR_PATH=/root/rafiki/data
export LOGS_DOCKER_WORKDIR_PATH=/root/rafiki/logs
export DOCKER_WORKDIR_PATH=/root/rafiki
export CONDA_ENVIORNMENT=rafiki

Expand All @@ -34,8 +40,8 @@ export RAFIKI_IMAGE_WORKER=rafikiai/rafiki_worker
export RAFIKI_IMAGE_PREDICTOR=rafikiai/rafiki_predictor

# Docker images for dependent services
export IMAGE_POSTGRES=postgres:10.5
export IMAGE_REDIS=redis:5.0-rc
export IMAGE_POSTGRES=postgres:10.5-alpine
export IMAGE_REDIS=redis:5.0.3-alpine3.8

# Utility configuration
export PYTHONPATH=$PWD # Ensures that `rafiki` module can be imported at project root
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Prerequisites: MacOS or Linux environment
bash scripts/stop.sh
```
More instructions are available in [Rafiki's Developer Guide](https://nginyc.github.io/rafiki/docs/latest/docs/src/dev/setup.html).
More instructions are available in [Rafiki's Developer Guide](https://nginyc.github.io/rafiki/docs/latest/docs/src/dev).

## Acknowledgements

Expand Down
32 changes: 25 additions & 7 deletions docs/src/dev/development.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ By default, you can connect to the PostgreSQL DB using a PostgreSQL client (e.g

::

POSTGRES_HOST=localhost
POSTGRES_PORT=5433
RAFIKI_ADDR=127.0.0.1
POSTGRES_EXT_PORT=5433
POSTGRES_USER=rafiki
POSTGRES_DB=rafiki
POSTGRES_PASSWORD=rafiki
Expand All @@ -46,8 +46,8 @@ You can connect to Redis DB with `rebrow <https://github.com/marians/rebrow>`_:

::

REDIS_HOST=rafiki_cache
REDIS_PORT=6379
RAFIKI_ADDR=127.0.0.1
REDIS_EXT_PORT=6380

Building Images Locally
--------------------------------------------------------------------
Expand Down Expand Up @@ -85,14 +85,32 @@ Build & view Rafiki's Sphinx documentation on your machine with the following co
Troubleshooting
--------------------------------------------------------------------

While building Rafiki's images locally, if you encounter an error like "No space left on device", you might be running out of space allocated for Docker. Try removing all containers & images:
While building Rafiki's images locally, if you encounter errors like "No space left on device",
you might be running out of space allocated for Docker. Try one of the following:

.. code-block:: shell
::

# Prunes dangling images
docker system prune

::

# Delete all containers
docker rm $(docker ps -a -q)
# Delete all images
docker rmi $(docker images -q)

From Mac Mojave onwards, due to Mac's new `privacy protection feature <https://www.howtogeek.com/361707/how-macos-mojaves-privacy-protection-works/>`_,
you might need to explicitly give Docker *Full Disk Access*, restart Docker, or even do a factory reset of Docker.
you might need to explicitly give Docker *Full Disk Access*, restart Docker, or even do a factory reset of Docker.


Using Rafiki Admin's HTTP interface
--------------------------------------------------------------------

To make calls to the HTTP endpoints of Rafiki Admin, you'll need first authenticate with email & password
against the `POST /tokens` endpoint to obtain an authentication token `token`,
and subsequently add the `Authorization` header for every other call:

::

Authorization: Bearer {{token}}
79 changes: 44 additions & 35 deletions docs/src/dev/setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,68 +31,77 @@ To destroy Rafiki's complete stack:
bash scripts/stop.sh
Adding Nodes to Rafiki
Scaling Rafiki
--------------------------------------------------------------------

Rafiki's default setup runs on a single node, and only runs on CPUs.

Rafiki has with its dynamic stack (e.g. train workers, inference workes, predictors)
running as `Docker Swarm Services <https://docs.docker.com/engine/swarm/services/>`_.
running as `Docker Swarm Services <https://docs.docker.com/engine/swarm/services/>`_.
It runs in a `Docker routing-mesh overlay network <https://docs.docker.com/network/overlay/>`_,
using it for networking amongst nodes.

Horizontal scaling can be done by adding more nodes to the swarm.
Rafiki's workers run in Docker containers that extend the Docker image ``nvidia/cuda:9.0-runtime-ubuntu16.04``,
and are capable of leveraging on `CUDA-Capable GPUs <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions>`_
across Rafiki's nodes.

Perform the following for *each* worker node to be added:
To scale Rafiki horizontally and enable running on GPUs, do the following:

1. Connect the node to the same network as the master, so that the node can `join the master's Docker Swarm <https://docs.docker.com/engine/swarm/join-nodes/>`_.
1. If Rafiki is running, stop Rafiki, and have the master node leave its Docker Swarm

2. Configure the node with the script:
2. Put every worker node and the master node into a common network,
and change ``DOCKER_SWARM_ADVERTISE_ADDR`` in ``.env.sh`` to the IP address of the master node
in *the network that your worker nodes are in*

.. code-block:: shell
3. For every node, including the master node, ensure the `firewall rules
allow TCP & UDP traffic on ports 2377, 7946 and 4789
<https://docs.docker.com/network/overlay/#operations-for-all-overlay-networks>`_

bash scripts/setup_node.sh
4. For every node that has GPUs:

4.1. `Install NVIDIA drivers <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for CUDA *9.0* or above

Exposing Rafiki Publicly
--------------------------------------------------------------------
4.2. `Install nvidia-docker2 <https://github.com/NVIDIA/nvidia-docker>`_

4.3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here <https://lukeyeager.github.io/2018/01/22/setting-the-default-docker-runtime-to-nvidia.html>`_)

Rafiki runs in a `Docker routing-mesh overlay network <https://docs.docker.com/network/overlay/>`_, with
Rafiki Admin and Rafiki Admin Web running only on the master node.
5. Start Rafiki with ``bash scripts/start.sh``

Edit the following line in ``.env.sh`` with the IP address of the master node in the network you intend to expose Rafiki:
6. For every worker node, have the node `join the master node's Docker Swarm <https://docs.docker.com/engine/swarm/join-nodes/>`_

.. code-block:: shell
7. On the *master* node, for *every* node (including the master node), configure it with the script:

export RAFIKI_IP_ADDRESS=127.0.0.1
::

bash scripts/setup_node.sh

Re-deploy Rafiki. Rafiki Admin and Rafiki Admin Web will be available at that IP address over ports 3000 and 3001,
assuming incoming connections to these ports are allowed.

Enabling GPU for Rafiki's Workers
Exposing Rafiki Publicly
--------------------------------------------------------------------

Rafiki's workers run in Docker containers that extend the Docker image ``nvidia/cuda:9.0-runtime-ubuntu16.04``,
and are capable of leveraging on `CUBA-Capable GPUs <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions>`_
across Rafiki's nodes.
Rafiki Admin and Rafiki Admin Web runs on the master node.
Change ``RAFIKI_ADDR`` in ``.env.sh`` to the IP address of the master node
in the network you intend to expose Rafiki in.

Rafiki's default setup would only configure its workers to run on CPUs across Rafiki's nodes. To allow model
training in workers to run on GPUs, perform the following configuration on *each* node in Rafiki:
Example:

::

1. `Install NVIDIA drivers <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for CUDA *9.0* or above
export RAFIKI_ADDR=172.28.176.35

2. `Install nvidia-docker2 <https://github.com/NVIDIA/nvidia-docker>`_
Re-deploy Rafiki. Rafiki Admin and Rafiki Admin Web will be available at that IP address,
over ports 3000 and 3001 (by default), assuming incoming connections to these ports are allowed.

3. Set the ``default-runtime`` of Docker to `nvidia` (e.g. `instructions here <https://lukeyeager.github.io/2018/01/22/setting-the-default-docker-runtime-to-nvidia.html>`_)

Reading Rafiki's logs
--------------------------------------------------------------------

You can read logs of Rafiki Admin, Rafiki Advisor & any of Rafiki's services at in the project's `./logs` directory.
By default, you can read logs of Rafiki Admin, Rafiki Advisor & any of Rafiki's services
in `./logs` directory at the root of the project's directory of the master node.

Using Rafiki Admin's HTTP interface
--------------------------------------------------------------------

To make calls to the HTTP endpoints of Rafiki Admin, you'll need first authenticate with email & password
against the `POST /tokens` endpoint to obtain an authentication token `token`,
and subsequently add the `Authorization` header for every other call:

::
Troubleshooting
--------------------------------------------------------------------

Authorization: Bearer {{token}}
Q: There seems to be connectivity issues amongst containers across nodes!
A: Ensure that containers are able to communicate with one another through the Docker Swarm overlay network: https://docs.docker.com/network/network-tutorial-overlay/#use-an-overlay-network-for-standalone-containers
4 changes: 4 additions & 0 deletions docs/src/python/rafiki.constants.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,7 @@ rafiki.constants
.. autoclass:: rafiki.constants.BudgetType

.. autoclass:: rafiki.constants.UserType

.. autoclass:: rafiki.constants.ModelDependency

.. autoclass:: rafiki.constants.ModelAccessRight
8 changes: 4 additions & 4 deletions docs/src/user/client-create-inference-job.include.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
To create an model deployment job, you need to submit the app name associated with a *completed* train job.
To create an model deployment job, you need to submit the app name associated with a *stopped* train job.
The inference job would be created from the best trials from the train job.

Example:
Expand All @@ -13,8 +13,8 @@ Example:
{'app': 'fashion_mnist_app',
'app_version': 1,
'id': '74b8f43a-c4f8-4ebc-a643-18a879dbbd1d',
'predictor_host': '127.0.0.1:30000',
'train_job_id': '3f3b3bdd-43ac-4354-99a5-d4d86006b68a'}
'id': '0477d03c-d312-48c5-8612-f9b37b368949',
'predictor_host': '127.0.0.1:30001',
'train_job_id': 'ec4db479-b9b2-4289-8086-52794ffc71c8'}
.. seealso:: :meth:`rafiki.client.Client.create_inference_job`
6 changes: 3 additions & 3 deletions docs/src/user/client-create-train-job.include.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ Example:

.. code-block:: python
{'app': 'fashion_mnist_app',
'app_version': 1,
'id': '3f3b3bdd-43ac-4354-99a5-d4d86006b68a'}
{'app': 'fashion_mnist_app',
'app_version': 1,
'id': 'ec4db479-b9b2-4289-8086-52794ffc71c8'}
.. note::

Expand Down
8 changes: 4 additions & 4 deletions docs/src/user/client-list-inference-jobs.include.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ Example:

.. code-block:: python
[{'app': 'fashion_mnist_app',
{'app': 'fashion_mnist_app',
'app_version': 1,
'datetime_started': 'Sun, 18 Nov 2018 10:04:13 GMT',
'datetime_started': 'Mon, 17 Dec 2018 07:15:12 GMT',
'datetime_stopped': None,
'id': '74b8f43a-c4f8-4ebc-a643-18a879dbbd1d',
'id': '0477d03c-d312-48c5-8612-f9b37b368949',
'predictor_host': '127.0.0.1:30000',
'status': 'RUNNING',
'train_job_id': '3f3b3bdd-43ac-4354-99a5-d4d86006b68a'}]
'train_job_id': 'ec4db479-b9b2-4289-8086-52794ffc71c8'}
.. seealso:: :meth:`rafiki.client.Client.get_inference_jobs_of_app`
14 changes: 8 additions & 6 deletions docs/src/user/client-list-models.include.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,22 @@ Example:

.. code-block:: python
[{'datetime_created': 'Sun, 18 Nov 2018 09:56:03 GMT',
[{'access_right': 'PRIVATE',
'datetime_created': 'Mon, 17 Dec 2018 07:06:03 GMT',
'dependencies': {'tensorflow': '1.12.0'},
'docker_image': 'rafikiai/rafiki_worker:0.0.7',
'docker_image': 'rafikiai/rafiki_worker:0.0.9',
'model_class': 'TfFeedForward',
'name': 'TfFeedForward',
'task': 'IMAGE_CLASSIFICATION',
'user_id': '9fdefa23-c838-4c56-8eb5-f625ff4245ab'},
{'datetime_created': 'Sun, 18 Nov 2018 09:56:04 GMT',
'user_id': 'fb5671f1-c673-40e7-b53a-9208eb1ccc50'},
{'access_right': 'PRIVATE',
'datetime_created': 'Mon, 17 Dec 2018 07:06:03 GMT',
'dependencies': {'scikit-learn': '0.20.0'},
'docker_image': 'rafikiai/rafiki_worker:0.0.7',
'docker_image': 'rafikiai/rafiki_worker:0.0.9',
'model_class': 'SkDt',
'name': 'SkDt',
'task': 'IMAGE_CLASSIFICATION',
'user_id': '9fdefa23-c838-4c56-8eb5-f625ff4245ab'}]
'user_id': 'fb5671f1-c673-40e7-b53a-9208eb1ccc50'}]
.. seealso:: :meth:`rafiki.client.Client.get_models_of_task`

Expand Down
6 changes: 3 additions & 3 deletions docs/src/user/client-list-train-jobs.include.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ Example:
[{'app': 'fashion_mnist_app',
'app_version': 1,
'budget': {'MODEL_TRIAL_COUNT': 2},
'datetime_completed': None,
'datetime_started': 'Sun, 18 Nov 2018 09:56:36 GMT',
'id': '3f3b3bdd-43ac-4354-99a5-d4d86006b68a',
'datetime_started': 'Mon, 17 Dec 2018 07:08:05 GMT',
'datetime_stopped': None,
'id': 'ec4db479-b9b2-4289-8086-52794ffc71c8',
'status': 'RUNNING',
'task': 'IMAGE_CLASSIFICATION',
'test_dataset_uri': 'https://github.com/nginyc/rafiki-datasets/blob/master/fashion_mnist/fashion_mnist_for_image_classification_test.zip?raw=true',
Expand Down
4 changes: 3 additions & 1 deletion docs/src/user/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ Dataset URIs must have the protocols of either ``http`` or ``https``.

.. note::

You can alternatively use relative or absolute filepaths as dataset URIs, only if you have deployed the full Rafiki stack on your own machine.
You can alternatively use relative (e.g. ``data/dataset.zip``) filepaths as dataset URIs,
only if you have deployed the full Rafiki stack on your own machine. This filepath is relative to
the root of the project directory.

.. note::

Expand Down
11 changes: 9 additions & 2 deletions docs/src/user/quickstart-admins.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
Quick Start (Rafiki Admins)
Quick Start (Admins)
====================================================================

.. contents:: Table of Contents

As an Admin, you can manage users, manage models, and manage train & inference jobs on Rafiki.

This quickstart only highlights the key methods available to manage users.
To learn about how to manage models, go to :ref:`quickstart-model-developers`,
To learn about how to manage train & inference jobs, go to :ref:`quickstart-app-developers`,
To learn more about what you can do on Rafiki, explore the methods of :class:`rafiki.client.Client`.

We assume that you have access to a running instance of *Rafiki Admin* at ``<rafiki_host>:<admin_port>``
and *Rafiki Admin Web* at ``<rafiki_host>:<admin_web_port>``.

Expand All @@ -12,7 +19,7 @@ Installation
.. include:: ./client-installation.include.rst


Initializing the Client
Initializing the client
--------------------------------------------------------------------

Example:
Expand Down
Loading

0 comments on commit 317a8a2

Please sign in to comment.