Skip to content
This repository has been archived by the owner on Feb 20, 2024. It is now read-only.

Exclusively allocate GPU to each trial #119

Merged
merged 41 commits into from
Jun 14, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
10c722d
Validate access right in model creation
nginyc Jun 7, 2019
5e47f6c
Allow duplicate model names across users
nginyc Jun 7, 2019
e0bdd53
Modify get models to show all available models to user; disallow mode…
nginyc Jun 7, 2019
5994fe1
Add deleting of models; Fix new model management methods
nginyc Jun 7, 2019
485f79d
Pass model IDs in creation of train jobs; manage models by IDs
nginyc Jun 7, 2019
f158ded
Throw error upon deleting model with referencing train job
nginyc Jun 7, 2019
33fda51
Merge remote-tracking branch 'origin/v0.1.0' into delete-models
nginyc Jun 7, 2019
c4d99d9
Fix example scripts
nginyc Jun 7, 2019
93dacf9
Update docs on model API
nginyc Jun 7, 2019
bb578fb
Change to `GPU_COUNT`
nginyc Jun 8, 2019
49b5848
Have client read from environment vars
nginyc Jun 8, 2019
cc198a5
resolve coflict and merge into v0.1.0
Jun 11, 2019
308b8bf
Allow `.env.sh` config of `APP_SECRET` and `SUPERADMIN_PASSWORD`; reo…
nginyc Jun 11, 2019
7647f06
Merge branch 'exclusive-gpu' of https://github.com/nginyc/rafiki into…
nginyc Jun 11, 2019
530b45c
Fix image build; reorganize `requirements.txt`s
nginyc Jun 11, 2019
582867b
Use lighter Node image
nginyc Jun 12, 2019
024d447
Allocate exclusively GPUs for training
nginyc Jun 12, 2019
b434313
Fix LSTM example model
nginyc Jun 12, 2019
376b40f
Remove extra requirements
nginyc Jun 12, 2019
abe60ee
Update docs on GPU usage in models
nginyc Jun 12, 2019
4e8f80f
Remove table of contents
nginyc Jun 12, 2019
5b9f76f
Ensure docs are fully re-built
nginyc Jun 12, 2019
ee67598
Merge commit '76dc3dd9fd3d0b2fdb614562f283371c097143b8' into delete-m…
nginyc Jun 12, 2019
990cc1c
Merge commit '76dc3dd9fd3d0b2fdb614562f283371c097143b8' into exclusiv…
nginyc Jun 12, 2019
befdfef
Add script to clean database dump & folders
nginyc Jun 12, 2019
3a6a04b
Warn user about updated client API
nginyc Jun 12, 2019
2a7791c
Add underscore prefix for private methods
nginyc Jun 12, 2019
dbe2a8c
Inform user of model's install command; don't install dependencies in…
nginyc Jun 12, 2019
c7c06ae
Add docs on installing Python correctly
nginyc Jun 12, 2019
9ad9cd3
Fix bug of invalid authorization header in train worker
nginyc Jun 12, 2019
4aea969
Fix bug where train job's status was 'STARTED' when deployment fails
nginyc Jun 12, 2019
26f0a6b
Increase model trial count to 5 for examples
nginyc Jun 12, 2019
7bed8d5
Allow config of model trials in quickstart
nginyc Jun 12, 2019
28b4019
Correct .gitignore to not ignore `docs/src`
nginyc Jun 13, 2019
5c26c76
FIx bug where train jobs & trials can have incorrect statuses upon st…
nginyc Jun 13, 2019
6644857
Confirm purging of metadata & data folders
nginyc Jun 13, 2019
54cc2fe
Add section on testing code changes to docs
nginyc Jun 13, 2019
8c18214
Merge branch 'delete-models' into exclusive-gpu
nginyc Jun 13, 2019
9ebfc6c
Warn users about using removed `ENABLE_GPU`
nginyc Jun 13, 2019
6d2026d
Add integration test framework & tests for client
nginyc Jun 13, 2019
3fe4706
Disallow cross-user access of train & inference jobs; allow multiple …
nginyc Jun 14, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 14 additions & 9 deletions .env.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# Core secrets for Rafiki - change these in production!
export POSTGRES_PASSWORD=rafiki
export SUPERADMIN_PASSWORD=rafiki
export APP_SECRET=rafiki

# Core external configuration for Rafiki
export DOCKER_NETWORK=rafiki
export DOCKER_SWARM_ADVERTISE_ADDR=127.0.0.1
Expand All @@ -8,16 +13,15 @@ export ADMIN_WEB_EXT_PORT=3001
export ADVISOR_EXT_PORT=3002
export POSTGRES_EXT_PORT=5433
export REDIS_EXT_PORT=6380
export DATA_WORKDIR_PATH=$PWD/data # Folder shared with containers that contains datasets
export PARAMS_WORKDIR_PATH=$PWD/params # Folder shared with containers that contains model parameters
export LOGS_WORKDIR_PATH=$PWD/logs # Folder shared with containers that stores components' logs
export HOST_WORKDIR_PATH=$PWD
export APP_MODE=DEV # DEV or PROD
export POSTGRES_DUMP_FILE_PATH=$PWD/db_dump.sql # PostgreSQL database dump file
export DOCKER_NODE_LABEL_AVAILABLE_GPUS=available_gpus # Docker node label for no. of services currently running on the node
export DOCKER_NODE_LABEL_NUM_SERVICES=num_services # Docker node label for no. of services currently running on the node

# Internal credentials for Rafiki's components
export POSTGRES_USER=rafiki
export POSTGRES_DB=rafiki
export POSTGRES_PASSWORD=rafiki

# Internal hosts & ports and configuration for Rafiki's components
export POSTGRES_HOST=rafiki_db
Expand All @@ -30,10 +34,10 @@ export REDIS_HOST=rafiki_cache
export REDIS_PORT=6379
export PREDICTOR_PORT=3003
export ADMIN_WEB_HOST=rafiki_admin_web
export DATA_DOCKER_WORKDIR_PATH=/root/rafiki/data
export LOGS_DOCKER_WORKDIR_PATH=/root/rafiki/logs
export PARAMS_DOCKER_WORKDIR_PATH=/root/rafiki/params
export DOCKER_WORKDIR_PATH=/root/rafiki
export DOCKER_WORKDIR_PATH=/root
export DATA_DIR_PATH=data # Shares a data folder with containers, relative to workdir
export LOGS_DIR_PATH=logs # Shares a folder with containers that stores components' logs, relative to workdir
export PARAMS_DIR_PATH=params # Shares a folder with containers that stores model parameters, relative to workdir
export CONDA_ENVIORNMENT=rafiki

# Docker images for Rafiki's custom components
Expand All @@ -48,4 +52,5 @@ export IMAGE_POSTGRES=postgres:10.5-alpine
export IMAGE_REDIS=redis:5.0.3-alpine3.8

# Utility configuration
export PYTHONPATH=$PWD # Ensures that `rafiki` module can be imported at project root
export PYTHONPATH=$PWD # Ensures that `rafiki` module can be imported at project root
export PYTHONUNBUFFERED=1 # Ensures logs from Python appear instantly ``
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ __pycache__/

# Sphinx documentation
.doctrees/
docs/**/
docs/*
!docs/src

# Datasets
Expand Down
22 changes: 11 additions & 11 deletions dockerfiles/admin.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,21 +15,21 @@ RUN pip install --upgrade pip
ENV PYTHONUNBUFFERED 1

ARG DOCKER_WORKDIR_PATH
RUN mkdir $DOCKER_WORKDIR_PATH
RUN mkdir -p $DOCKER_WORKDIR_PATH
WORKDIR $DOCKER_WORKDIR_PATH
ENV PYTHONPATH $DOCKER_WORKDIR_PATH

# Install python dependencies
COPY rafiki/utils/requirements.txt utils/requirements.txt
RUN pip install -r utils/requirements.txt
COPY rafiki/db/requirements.txt db/requirements.txt
RUN pip install -r db/requirements.txt
COPY rafiki/model/requirements.txt model/requirements.txt
RUN pip install -r model/requirements.txt
COPY rafiki/container/requirements.txt container/requirements.txt
RUN pip install -r container/requirements.txt
COPY rafiki/admin/requirements.txt admin/requirements.txt
RUN pip install -r admin/requirements.txt
COPY rafiki/requirements.txt rafiki/requirements.txt
RUN pip install -r rafiki/requirements.txt
COPY rafiki/utils/requirements.txt rafiki/utils/requirements.txt
RUN pip install -r rafiki/utils/requirements.txt
COPY rafiki/db/requirements.txt rafiki/db/requirements.txt
RUN pip install -r rafiki/db/requirements.txt
COPY rafiki/container/requirements.txt rafiki/container/requirements.txt
RUN pip install -r rafiki/container/requirements.txt
COPY rafiki/admin/requirements.txt rafiki/admin/requirements.txt
RUN pip install -r rafiki/admin/requirements.txt

COPY rafiki/ rafiki/
COPY scripts/ scripts/
Expand Down
4 changes: 2 additions & 2 deletions dockerfiles/admin_web.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
FROM node:11.1
FROM node:11.1-alpine

ARG DOCKER_WORKDIR_PATH

RUN mkdir $DOCKER_WORKDIR_PATH
RUN mkdir -p $DOCKER_WORKDIR_PATH
WORKDIR $DOCKER_WORKDIR_PATH

COPY web/package.json web/package.json
Expand Down
14 changes: 7 additions & 7 deletions dockerfiles/advisor.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,17 @@ RUN pip install --upgrade pip
ENV PYTHONUNBUFFERED 1

ARG DOCKER_WORKDIR_PATH
RUN mkdir $DOCKER_WORKDIR_PATH
RUN mkdir -p $DOCKER_WORKDIR_PATH
WORKDIR $DOCKER_WORKDIR_PATH
ENV PYTHONPATH $DOCKER_WORKDIR_PATH

# Install python dependencies
COPY rafiki/utils/requirements.txt utils/requirements.txt
RUN pip install -r utils/requirements.txt
COPY rafiki/model/requirements.txt model/requirements.txt
RUN pip install -r model/requirements.txt
COPY rafiki/advisor/requirements.txt advisor/requirements.txt
RUN pip install -r advisor/requirements.txt
COPY rafiki/requirements.txt rafiki/requirements.txt
RUN pip install -r rafiki/requirements.txt
COPY rafiki/utils/requirements.txt rafiki/utils/requirements.txt
RUN pip install -r rafiki/utils/requirements.txt
COPY rafiki/advisor/requirements.txt rafiki/advisor/requirements.txt
RUN pip install -r rafiki/advisor/requirements.txt

COPY rafiki/ rafiki/
COPY scripts/ scripts/
Expand Down
20 changes: 11 additions & 9 deletions dockerfiles/predictor.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,21 @@ RUN pip install --upgrade pip
ENV PYTHONUNBUFFERED 1

ARG DOCKER_WORKDIR_PATH
RUN mkdir $DOCKER_WORKDIR_PATH
RUN mkdir -p $DOCKER_WORKDIR_PATH
WORKDIR $DOCKER_WORKDIR_PATH
ENV PYTHONPATH $DOCKER_WORKDIR_PATH

# Install python dependencies
COPY rafiki/utils/requirements.txt utils/requirements.txt
RUN pip install -r utils/requirements.txt
COPY rafiki/db/requirements.txt db/requirements.txt
RUN pip install -r db/requirements.txt
COPY rafiki/cache/requirements.txt cache/requirements.txt
RUN pip install -r cache/requirements.txt
COPY rafiki/predictor/requirements.txt predictor/requirements.txt
RUN pip install -r predictor/requirements.txt
COPY rafiki/requirements.txt rafiki/requirements.txt
RUN pip install -r rafiki/requirements.txt
COPY rafiki/utils/requirements.txt rafiki/utils/requirements.txt
RUN pip install -r rafiki/utils/requirements.txt
COPY rafiki/db/requirements.txt rafiki/db/requirements.txt
RUN pip install -r rafiki/db/requirements.txt
COPY rafiki/cache/requirements.txt rafiki/cache/requirements.txt
RUN pip install -r rafiki/cache/requirements.txt
COPY rafiki/predictor/requirements.txt rafiki/predictor/requirements.txt
RUN pip install -r rafiki/predictor/requirements.txt

COPY rafiki/ rafiki/
COPY scripts/ scripts/
Expand Down
24 changes: 11 additions & 13 deletions dockerfiles/worker.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -42,23 +42,21 @@ RUN pip install --upgrade pip
ENV PYTHONUNBUFFERED 1

ARG DOCKER_WORKDIR_PATH
RUN mkdir $DOCKER_WORKDIR_PATH
RUN mkdir -p $DOCKER_WORKDIR_PATH
WORKDIR $DOCKER_WORKDIR_PATH
ENV PYTHONPATH $DOCKER_WORKDIR_PATH

# Install python dependencies
COPY rafiki/utils/requirements.txt utils/requirements.txt
RUN pip install -r utils/requirements.txt
COPY rafiki/db/requirements.txt db/requirements.txt
RUN pip install -r db/requirements.txt
COPY rafiki/cache/requirements.txt cache/requirements.txt
RUN pip install -r cache/requirements.txt
COPY rafiki/model/requirements.txt model/requirements.txt
RUN pip install -r model/requirements.txt
COPY rafiki/client/requirements.txt client/requirements.txt
RUN pip install -r client/requirements.txt
COPY rafiki/worker/requirements.txt worker/requirements.txt
RUN pip install -r worker/requirements.txt
COPY rafiki/requirements.txt rafiki/requirements.txt
RUN pip install -r rafiki/requirements.txt
COPY rafiki/utils/requirements.txt rafiki/utils/requirements.txt
RUN pip install -r rafiki/utils/requirements.txt
COPY rafiki/db/requirements.txt rafiki/db/requirements.txt
RUN pip install -r rafiki/db/requirements.txt
COPY rafiki/cache/requirements.txt rafiki/cache/requirements.txt
RUN pip install -r rafiki/cache/requirements.txt
COPY rafiki/worker/requirements.txt rafiki/worker/requirements.txt
RUN pip install -r rafiki/worker/requirements.txt

COPY rafiki/ rafiki/
COPY scripts/ scripts/
Expand Down
17 changes: 7 additions & 10 deletions docs/src/dev/development.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@
Development
====================================================================

.. contents:: Table of Contents

Before running any individual scripts, make sure to run the shell configuration script:

.. code-block:: shell
Expand All @@ -13,24 +11,23 @@ Before running any individual scripts, make sure to run the shell configuration

Refer to :ref:`architecture` and :ref:`folder-structure` for a developer's overview of Rafiki.

Building Images Locally
Testing Latest Code Changes
--------------------------------------------------------------------

The quickstart instructions pull pre-built `Rafiki's images <https://hub.docker.com/r/rafikiai/>`_ from Docker Hub. To build Rafiki's images locally (e.g. to reflect latest code changes):
To test the lastet code changes e.g. in the ``dev`` branch, you'll need to do the following:

1. Build Rafiki's images on each participating node (the quickstart instructions pull pre-built `Rafiki's images <https://hub.docker.com/r/rafikiai/>`_ from Docker Hub):

.. code-block:: shell

bash scripts/build_images.sh

.. note::
2. Purge all of Rafiki's data (since there might be database schema changes):

If you're testing latest code changes on multiple nodes, you'll need to build Rafiki's images on those nodes as well.
.. code-block:: shell

Starting Parts of the Stack
--------------------------------------------------------------------
bash scripts/clean.sh

The quickstart instructions set up a single node Docker Swarm on your machine. Separate shell scripts in the `./scripts/` folder configure and start parts of Rafiki's stack. Refer to the commands in
`./scripts/start.sh`.

Connecting to Rafiki's DB
--------------------------------------------------------------------
Expand Down
9 changes: 2 additions & 7 deletions docs/src/dev/setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,6 @@
Setup & Configuration
====================================================================

.. contents:: Table of Contents


.. _`quick-setup`:

Quick Setup
Expand All @@ -20,7 +17,7 @@ We assume development or deployment in a MacOS or Linux environment.

If you're not a user in the ``docker`` group, you'll instead need ``sudo`` access and prefix every bash command with ``sudo -E``.

2. Install Python 3.6 (`Ubuntu <http://ubuntuhandbook.org/index.php/2017/07/install-python-3-6-1-in-ubuntu-16-04-lts/>`__, `MacOS <https://www.python.org/downloads/mac-osx/>`__)
2. Install Python 3.6 such that the ``python`` and ``pip`` commands point to the correct installation of Python 3.6 (see :ref:`installing-python`).

3. Clone the project at https://github.com/nginyc/rafiki (e.g. with `Git <https://git-scm.com/downloads>`__)

Expand Down Expand Up @@ -103,9 +100,7 @@ Re-deploy Rafiki. Rafiki Admin and Rafiki Admin Web will be available at that IP
over ports 3000 and 3001 (by default), assuming incoming connections to these ports are allowed.

**Before you expose Rafiki to the public,
it is highly recommended to change the master passwords for superadmin (located in `./rafiki/config.py` as `SUPERADMIN_PASSWORD`)
and the database (located in `.env.sh` as `POSTGRES_PASSWORD`)**

it is highly recommended to change the master passwords for superadmin, server and the database (located in `.env.sh` as `POSTGRES_PASSWORD`, `APP_SECRET` & `SUPERADMIN_PASSWORD`)**

Reading Rafiki's logs
--------------------------------------------------------------------
Expand Down
4 changes: 2 additions & 2 deletions docs/src/user/client-create-train-job.include.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ You'll need to prepare your dataset in a format specified by the target task, an

After creating a train job, you can monitor it on Rafiki Admin Web (see :ref:`using-admin-web`).

Refer to the parameters of :meth:`rafiki.client.Client.create_train_job()` for configuring how your train job runs on Rafiki, such as enabling GPU usage.
Refer to the parameters of :meth:`rafiki.client.Client.create_train_job()` for configuring how your train job runs on Rafiki, such as enabling GPU usage & specifying which models to use.

Example:

Expand All @@ -14,7 +14,7 @@ Example:
task='IMAGE_CLASSIFICATION',
train_dataset_uri='https://github.com/nginyc/rafiki-datasets/blob/master/fashion_mnist/fashion_mnist_for_image_classification_train.zip?raw=true',
test_dataset_uri='https://github.com/nginyc/rafiki-datasets/blob/master/fashion_mnist/fashion_mnist_for_image_classification_test.zip?raw=true',
budget={ 'MODEL_TRIAL_COUNT': 2 }
budget={ 'MODEL_TRIAL_COUNT': 5 }
)

Output:
Expand Down
4 changes: 2 additions & 2 deletions docs/src/user/client-installation.include.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
1. Install Python 3.6 (`Ubuntu <http://ubuntuhandbook.org/index.php/2017/07/install-python-3-6-1-in-ubuntu-16-04-lts/>`__, `MacOS <https://www.python.org/downloads/mac-osx/>`__)
1. Install Python 3.6 such that the ``python`` and ``pip`` point to the correct installation of Python (see :ref:`installing-python`)

2. Clone the project at https://github.com/nginyc/rafiki (e.g. with `Git <https://git-scm.com/downloads>`__)

3. Within the project's root folder, install Rafiki Client's Python dependencies by running:

::

pip3.6 install -r ./rafiki/client/requirements.txt
pip install -r ./rafiki/requirements.txt

12 changes: 4 additions & 8 deletions docs/src/user/client-list-models.include.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Example:

.. code-block:: python

client.get_models_of_task(task='IMAGE_CLASSIFICATION')
client.get_available_models(task='IMAGE_CLASSIFICATION')

Output:

Expand All @@ -11,20 +11,16 @@ Example:
[{'access_right': 'PRIVATE',
'datetime_created': 'Mon, 17 Dec 2018 07:06:03 GMT',
'dependencies': {'tensorflow': '1.12.0'},
'docker_image': 'rafikiai/rafiki_worker:0.0.9',
'model_class': 'TfFeedForward',
'id': '45df3f34-53d7-4fb8-a7c2-55391ea10030',
'name': 'TfFeedForward',
'task': 'IMAGE_CLASSIFICATION',
'user_id': 'fb5671f1-c673-40e7-b53a-9208eb1ccc50'},
{'access_right': 'PRIVATE',
'datetime_created': 'Mon, 17 Dec 2018 07:06:03 GMT',
'dependencies': {'scikit-learn': '0.20.0'},
'docker_image': 'rafikiai/rafiki_worker:0.0.9',
'model_class': 'SkDt',
'id': 'd0ea96ce-478b-4167-8a84-eb36ae631235',
'name': 'SkDt',
'task': 'IMAGE_CLASSIFICATION',
'user_id': 'fb5671f1-c673-40e7-b53a-9208eb1ccc50'}]

.. seealso:: :meth:`rafiki.client.Client.get_models_of_task`


.. seealso:: :meth:`rafiki.client.Client.get_available_models`
2 changes: 1 addition & 1 deletion docs/src/user/client-list-train-jobs.include.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Example:

[{'app': 'fashion_mnist_app',
'app_version': 1,
'budget': {'MODEL_TRIAL_COUNT': 2},
'budget': {'MODEL_TRIAL_COUNT': 5},
'datetime_started': 'Mon, 17 Dec 2018 07:08:05 GMT',
'datetime_stopped': None,
'id': 'ec4db479-b9b2-4289-8086-52794ffc71c8',
Expand Down
Loading