Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Multiple designated GPU question #775

Closed
LeoLau94 opened this issue Feb 22, 2019 · 4 comments
Closed

Multiple designated GPU question #775

LeoLau94 opened this issue Feb 22, 2019 · 4 comments
Assignees
Labels

Comments

@LeoLau94
Copy link

LeoLau94 commented Feb 22, 2019

Short summary about the issue/question:
你们好,由于我与他人共用一台服务器,所以我想在3个指定编号的空闲GPU上同时跑3个trials,但是我发现,无论我怎么指定可见的GPU设备,自动生成的trial执行脚本run.sh里面,始终会从编号0的开始生成export CUDA_VISIBLE_DEVICES语句。我于是想到使用docker run-e NVIDIA_VISIBLE_DEVICES=1,2,3来控制GPU的可见性,然而还是没有用,因为在docker里面使用nvidia-smi得到的GPU编号是重新编号的,而你们生成的export CUDA_VISIBLE_DEVICES还是根据1,2,3来生成的,所以生成的命令在docker容器里是不对的。
后来我看了一下源码(nni/src/nni_manager/training_service/local/gpuScheduler.ts),你们似乎是使用node-nvidia-smi来获取GPU的编号的,于是我在docker容器里面执行了nvidia-smi -q,发现这样获取到的是真实的GPU编号,这样就很不灵活(摊手,叹气),而且你们貌似没有提供指定GPU的方法。
非常希望你们能尽快回复我,很急很关键,谢谢你们的开源和贡献,辛苦大佬们了。

Hello, I want to start 3 trials concurrently in 3 designated free GPUs. According to the tutorial, I just need to set the trialConcurrency to 3 and trial->gpuNum to 1 in config.yaml. However, I found it impossible to do that. Because the automatically generated run.sh of
trials always have statement export CUDA_VISIBLE_DEVICES started from number 0, no matter how I set visible devices. Then, I tried another way: use -e NVIDIA_VISIBLE_DEVICES=1,2,3 of docker run to build a new container in which only the number 1, 2, 3 GPUs are visible, and the GPUs are renumbered to 0, 1, 2. I failed again in such way because the automatic generated statement export CUDA_VISIBLE_DEVICES in run.sh started from 1, which is absolutely wrong!
Finally, I checked out the source code (nni/src/nni_manager/training_service/local/gpuScheduler.ts). It seems that you use node-nvidia-smi to get the information of GPU, such that you get the real number (not the renumbered one) of GPU device. Yes, I try nvidia-smi -q in the container mentioned above and I get 1,2,3 instead of the renumbered ones (0, 1, 2) in nvidia-smi.
Well, I don't konw how to deal with it. What else could I say ? Fxxk you, nVidia? (Linus Torvalds approved)
Please response as soon as possible, thank you and your great contribution!

nni Environment:

  • nni version:0.5.1
  • nni mode(local|pai|remote):local
  • OS:Ubuntu 16.04
  • python version:3.6.5
  • is conda or virtualenv used?: conda
  • is running in docker?:yes

Anything else we need to know:
Nope.

@leelaylay
Copy link
Contributor

Thank you for your suggestion. It seems that NNI can not do such thing.
I think it will be more flexible if users can assign GPU index to every trial.

@LeoLau94
Copy link
Author

LeoLau94 commented Feb 24, 2019

Thank you for your suggestion. It seems that NNI can not do such thing.
I think it will be more flexible if users can assign GPU index to every trial.

Sad😭...
Actually, I don't request assigning GPU index to each trial. All I want is setting different visibility of different GPU devices for NNI.
And it seems feasible if you don't have to get the real number of GPU devices.

@JohnAllen
Copy link

I have what seems to be a similar problem: I have 3 GPUs on a machine and want to be able to run separate but concurrent trials. Since I can't change the params passed to my training script I cannot vary the GPU id in any way. I would have to do something like have 3 separate configs and 3 experiments which obviously doesn't really work as I could end up repeating hyper-parameters in my search.

@scarlett2018
Copy link
Member

@LeoLau94 - we had released the support for multiple designated GPU in 0.7, try it out =).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants