Skip to content
This repository has been archived by the owner on Feb 20, 2024. It is now read-only.

Exclusively allocate GPU to each trial #119

Merged
merged 41 commits into from
Jun 14, 2019
Merged

Exclusively allocate GPU to each trial #119

merged 41 commits into from
Jun 14, 2019

Conversation

nginyc
Copy link
Owner

@nginyc nginyc commented Jun 12, 2019

Fixes #106

  • Rafiki's super administrator specifies GPU nos available at each node with bash scripts/setup_node.sh
  • App/model developers can specify no. of GPUs to be allocated for train job with the new GPU_COUNT budget option (ENABLE_GPU option removed)
  • In a train job, no. of workers deployed for each model is subject to no. of GPUs allocated for the job and the no. of models:
    • E.g. if GPU_COUNT=4 and 2 models are used, 2 GPU-enabled workers are deployed for each model
    • E.g. if GPU_COUNT=3 and 2 models are used, model 1 gets 2 GPU workers and model 2 gets 1 GPU worker
    • E.g. if GPU_COUNT=1 and 2 models are used, model 1 gets 1 GPU worker, model 2 gets 1 CPU worker
  • Models, in each trial, are exclusively allocated a single GPU, if GPU is enabled on that worker

nginyc and others added 30 commits June 7, 2019 08:51
…rganize environment variable config for folder mounting
@vaanforz
Copy link
Collaborator

Hi Yun Chuan,

I tested this branch on NCRA with the usual steps... clone, checkout, source env, build images, start.sh, setup_node.sh etc...

However, at the create_train_job stage, the job doesn't use any GPU despite setting 'ENABLE_GPU': 1, instead, it continued running in the background using only CPU. At the setup_node stage, the prompt asked to designate GPU availability, I tried inputting various strings such as '0,2' or '2' but the GPU remains unutilized.

Additionally, after running scripts/stop.sh, the created models seem to remain saved whereas previously it was wiped.

@nginyc
Copy link
Owner Author

nginyc commented Jun 13, 2019

Thanks for the feedback! Did you use the new option GPU_COUNT instead of ENABLE_GPU, which has been removed? Refer to the details of the PR.

For your last point, yes, Rafiki now saves metadata across restarts (another PR has been merged into the v0.1.0 branch). if you want to purge all metadata, delete the file db_dump.sql at the root of the project before running scripts/start.sh.

@vaanforz
Copy link
Collaborator

I pulled the latest updates from latest commit and reran with GPU_COUNT on NCRA, I believe it is working correctly now.

I designated GPU 1,2 as the available nodes during setup and as expected, only 1 service ID is deployed exclusively on each GPU. Additionally, by changing GPU_COUNT to 1, it will correctly designate 1 service ID to first GPU and the other to CPU, despite having an additional free GPU.

@nginyc
Copy link
Owner Author

nginyc commented Jun 14, 2019

Great. Thanks for helping to verify that this works! I'll be merging this.

@nginyc nginyc merged commit 9d3b8d7 into v0.1.0 Jun 14, 2019
@nginyc nginyc mentioned this pull request Jun 14, 2019
@nginyc nginyc deleted the exclusive-gpu branch June 14, 2019 21:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants