Exclusively allocate GPU to each trial #119

nginyc · 2019-06-12T20:58:12Z

Fixes #106

Rafiki's super administrator specifies GPU nos available at each node with bash scripts/setup_node.sh
App/model developers can specify no. of GPUs to be allocated for train job with the new GPU_COUNT budget option (ENABLE_GPU option removed)
In a train job, no. of workers deployed for each model is subject to no. of GPUs allocated for the job and the no. of models:
- E.g. if GPU_COUNT=4 and 2 models are used, 2 GPU-enabled workers are deployed for each model
- E.g. if GPU_COUNT=3 and 2 models are used, model 1 gets 2 GPU workers and model 2 gets 1 GPU worker
- E.g. if GPU_COUNT=1 and 2 models are used, model 1 gets 1 GPU worker, model 2 gets 1 CPU worker
Models, in each trial, are exclusively allocated a single GPU, if GPU is enabled on that worker

…l developers from accessing others' models

…rganize environment variable config for folder mounting

… exclusive-gpu

…odels

…e-gpu # Conflicts: # rafiki/worker/train.py

… `test_model_class`

…opping; associate trial to worker ID

vaanforz · 2019-06-13T08:22:14Z

Hi Yun Chuan,

I tested this branch on NCRA with the usual steps... clone, checkout, source env, build images, start.sh, setup_node.sh etc...

However, at the create_train_job stage, the job doesn't use any GPU despite setting 'ENABLE_GPU': 1, instead, it continued running in the background using only CPU. At the setup_node stage, the prompt asked to designate GPU availability, I tried inputting various strings such as '0,2' or '2' but the GPU remains unutilized.

Additionally, after running scripts/stop.sh, the created models seem to remain saved whereas previously it was wiped.

nginyc · 2019-06-13T14:15:24Z

Thanks for the feedback! Did you use the new option GPU_COUNT instead of ENABLE_GPU, which has been removed? Refer to the details of the PR.

For your last point, yes, Rafiki now saves metadata across restarts (another PR has been merged into the v0.1.0 branch). if you want to purge all metadata, delete the file db_dump.sql at the root of the project before running scripts/start.sh.

# Conflicts: # docs/src/user/client-installation.include.rst

…app devs to share same app name

vaanforz · 2019-06-14T11:04:33Z

I pulled the latest updates from latest commit and reran with GPU_COUNT on NCRA, I believe it is working correctly now.

I designated GPU 1,2 as the available nodes during setup and as expected, only 1 service ID is deployed exclusively on each GPU. Additionally, by changing GPU_COUNT to 1, it will correctly designate 1 service ID to first GPU and the other to CPU, despite having an additional free GPU.

nginyc · 2019-06-14T15:26:39Z

Great. Thanks for helping to verify that this works! I'll be merging this.

nginyc and others added 30 commits June 7, 2019 08:51

Validate access right in model creation

10c722d

Allow duplicate model names across users

5e47f6c

Modify get models to show all available models to user; disallow mode…

e0bdd53

…l developers from accessing others' models

Add deleting of models; Fix new model management methods

5994fe1

Pass model IDs in creation of train jobs; manage models by IDs

485f79d

Throw error upon deleting model with referencing train job

f158ded

Merge remote-tracking branch 'origin/v0.1.0' into delete-models

33fda51

Fix example scripts

c4d99d9

Update docs on model API

93dacf9

Change to GPU_COUNT

bb578fb

Have client read from environment vars

49b5848

resolve coflict and merge into v0.1.0

cc198a5

Allow .env.sh config of APP_SECRET and SUPERADMIN_PASSWORD; reo…

308b8bf

…rganize environment variable config for folder mounting

Merge branch 'exclusive-gpu' of https://github.com/nginyc/rafiki into…

7647f06

… exclusive-gpu

Fix image build; reorganize requirements.txts

530b45c

Use lighter Node image

582867b

Allocate exclusively GPUs for training

024d447

Fix LSTM example model

b434313

Remove extra requirements

376b40f

Update docs on GPU usage in models

abe60ee

Remove table of contents

4e8f80f

Ensure docs are fully re-built

5b9f76f

Merge commit '76dc3dd9fd3d0b2fdb614562f283371c097143b8' into delete-m…

ee67598

…odels

Merge commit '76dc3dd9fd3d0b2fdb614562f283371c097143b8' into exclusiv…

990cc1c

…e-gpu # Conflicts: # rafiki/worker/train.py

Add script to clean database dump & folders

befdfef

Warn user about updated client API

3a6a04b

Add underscore prefix for private methods

2a7791c

Inform user of model's install command; don't install dependencies in…

dbe2a8c

… `test_model_class`

Add docs on installing Python correctly

c7c06ae

Fix bug of invalid authorization header in train worker

9ad9cd3

nginyc added 5 commits June 13, 2019 05:42

Fix bug where train job's status was 'STARTED' when deployment fails

4aea969

Increase model trial count to 5 for examples

26f0a6b

Allow config of model trials in quickstart

7bed8d5

Correct .gitignore to not ignore docs/src

28b4019

FIx bug where train jobs & trials can have incorrect statuses upon st…

5c26c76

…opping; associate trial to worker ID

nginyc added 6 commits June 13, 2019 09:01

Confirm purging of metadata & data folders

6644857

Add section on testing code changes to docs

54cc2fe

Merge branch 'delete-models' into exclusive-gpu

8c18214

# Conflicts: # docs/src/user/client-installation.include.rst

Warn users about using removed ENABLE_GPU

9ebfc6c

Add integration test framework & tests for client

6d2026d

Disallow cross-user access of train & inference jobs; allow multiple …

3fe4706

…app devs to share same app name

nginyc merged commit 9d3b8d7 into v0.1.0 Jun 14, 2019

nginyc mentioned this pull request Jun 14, 2019

Release v0.1.0 #121

Merged

nginyc deleted the exclusive-gpu branch June 14, 2019 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclusively allocate GPU to each trial #119

Exclusively allocate GPU to each trial #119

nginyc commented Jun 12, 2019 •

edited

Loading

vaanforz commented Jun 13, 2019

nginyc commented Jun 13, 2019

vaanforz commented Jun 14, 2019

nginyc commented Jun 14, 2019

Exclusively allocate GPU to each trial #119

Exclusively allocate GPU to each trial #119

Conversation

nginyc commented Jun 12, 2019 • edited Loading

vaanforz commented Jun 13, 2019

nginyc commented Jun 13, 2019

vaanforz commented Jun 14, 2019

nginyc commented Jun 14, 2019

nginyc commented Jun 12, 2019 •

edited

Loading