-
Notifications
You must be signed in to change notification settings - Fork 23
Exclusively allocate GPU to each trial #119
Conversation
…l developers from accessing others' models
…rganize environment variable config for folder mounting
…e-gpu # Conflicts: # rafiki/worker/train.py
… `test_model_class`
…opping; associate trial to worker ID
Hi Yun Chuan, I tested this branch on NCRA with the usual steps... clone, checkout, source env, build images, start.sh, setup_node.sh etc... However, at the create_train_job stage, the job doesn't use any GPU despite setting 'ENABLE_GPU': 1, instead, it continued running in the background using only CPU. At the setup_node stage, the prompt asked to designate GPU availability, I tried inputting various strings such as '0,2' or '2' but the GPU remains unutilized. Additionally, after running scripts/stop.sh, the created models seem to remain saved whereas previously it was wiped. |
Thanks for the feedback! Did you use the new option For your last point, yes, Rafiki now saves metadata across restarts (another PR has been merged into the |
# Conflicts: # docs/src/user/client-installation.include.rst
…app devs to share same app name
I pulled the latest updates from latest commit and reran with GPU_COUNT on NCRA, I believe it is working correctly now. I designated GPU 1,2 as the available nodes during setup and as expected, only 1 service ID is deployed exclusively on each GPU. Additionally, by changing GPU_COUNT to 1, it will correctly designate 1 service ID to first GPU and the other to CPU, despite having an additional free GPU. |
Great. Thanks for helping to verify that this works! I'll be merging this. |
Fixes #106
bash scripts/setup_node.sh
GPU_COUNT
budget option (ENABLE_GPU
option removed)GPU_COUNT=4
and 2 models are used, 2 GPU-enabled workers are deployed for each modelGPU_COUNT=3
and 2 models are used, model 1 gets 2 GPU workers and model 2 gets 1 GPU workerGPU_COUNT=1
and 2 models are used, model 1 gets 1 GPU worker, model 2 gets 1 CPU worker