You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.
Short summary about the issue/question:
你们好,由于我与他人共用一台服务器,所以我想在3个指定编号的空闲GPU上同时跑3个trials,但是我发现,无论我怎么指定可见的GPU设备,自动生成的trial执行脚本run.sh里面,始终会从编号0的开始生成export CUDA_VISIBLE_DEVICES语句。我于是想到使用docker run的-e NVIDIA_VISIBLE_DEVICES=1,2,3来控制GPU的可见性,然而还是没有用,因为在docker里面使用nvidia-smi得到的GPU编号是重新编号的,而你们生成的export CUDA_VISIBLE_DEVICES还是根据1,2,3来生成的,所以生成的命令在docker容器里是不对的。 后来我看了一下源码(nni/src/nni_manager/training_service/local/gpuScheduler.ts),你们似乎是使用node-nvidia-smi来获取GPU的编号的,于是我在docker容器里面执行了nvidia-smi -q,发现这样获取到的是真实的GPU编号,这样就很不灵活(摊手,叹气),而且你们貌似没有提供指定GPU的方法。
非常希望你们能尽快回复我,很急很关键,谢谢你们的开源和贡献,辛苦大佬们了。
Hello, I want to start 3 trials concurrently in 3 designated free GPUs. According to the tutorial, I just need to set the trialConcurrency to 3 and trial->gpuNum to 1 in config.yaml. However, I found it impossible to do that. Because the automatically generated run.sh of
trials always have statement export CUDA_VISIBLE_DEVICES started from number 0, no matter how I set visible devices. Then, I tried another way: use -e NVIDIA_VISIBLE_DEVICES=1,2,3 of docker run to build a new container in which only the number 1, 2, 3 GPUs are visible, and the GPUs are renumbered to 0, 1, 2. I failed again in such way because the automatic generated statement export CUDA_VISIBLE_DEVICES in run.sh started from 1, which is absolutely wrong!
Finally, I checked out the source code (nni/src/nni_manager/training_service/local/gpuScheduler.ts). It seems that you use node-nvidia-smi to get the information of GPU, such that you get the real number (not the renumbered one) of GPU device. Yes, I try nvidia-smi -q in the container mentioned above and I get 1,2,3 instead of the renumbered ones (0, 1, 2) in nvidia-smi. Well, I don't konw how to deal with it. What else could I say ? Fxxk you, nVidia? (Linus Torvalds approved)
Please response as soon as possible, thank you and your great contribution!
nni Environment:
nni version:0.5.1
nni mode(local|pai|remote):local
OS:Ubuntu 16.04
python version:3.6.5
is conda or virtualenv used?: conda
is running in docker?:yes
Anything else we need to know:
Nope.
The text was updated successfully, but these errors were encountered:
Thank you for your suggestion. It seems that NNI can not do such thing.
I think it will be more flexible if users can assign GPU index to every trial.
Sad😭...
Actually, I don't request assigning GPU index to each trial. All I want is setting different visibility of different GPU devices for NNI.
And it seems feasible if you don't have to get the real number of GPU devices.
I have what seems to be a similar problem: I have 3 GPUs on a machine and want to be able to run separate but concurrent trials. Since I can't change the params passed to my training script I cannot vary the GPU id in any way. I would have to do something like have 3 separate configs and 3 experiments which obviously doesn't really work as I could end up repeating hyper-parameters in my search.
Short summary about the issue/question:
你们好,由于我与他人共用一台服务器,所以我想在3个指定编号的空闲GPU上同时跑3个trials,但是我发现,无论我怎么指定可见的GPU设备,自动生成的trial执行脚本run.sh里面,始终会从编号0的开始生成
export CUDA_VISIBLE_DEVICES
语句。我于是想到使用docker run
的-e NVIDIA_VISIBLE_DEVICES=1,2,3
来控制GPU的可见性,然而还是没有用,因为在docker里面使用nvidia-smi
得到的GPU编号是重新编号的,而你们生成的export CUDA_VISIBLE_DEVICES
还是根据1,2,3来生成的,所以生成的命令在docker容器里是不对的。后来我看了一下源码(nni/src/nni_manager/training_service/local/gpuScheduler.ts),你们似乎是使用node-nvidia-smi来获取GPU的编号的,于是我在docker容器里面执行了
nvidia-smi -q
,发现这样获取到的是真实的GPU编号,这样就很不灵活(摊手,叹气),而且你们貌似没有提供指定GPU的方法。非常希望你们能尽快回复我,很急很关键,谢谢你们的开源和贡献,辛苦大佬们了。
Hello, I want to start 3 trials concurrently in 3 designated free GPUs. According to the tutorial, I just need to set the trialConcurrency to 3 and trial->gpuNum to 1 in config.yaml. However, I found it impossible to do that. Because the automatically generated run.sh of
trials always have statement
export CUDA_VISIBLE_DEVICES
started from number 0, no matter how I set visible devices. Then, I tried another way: use-e NVIDIA_VISIBLE_DEVICES=1,2,3
ofdocker run
to build a new container in which only the number 1, 2, 3 GPUs are visible, and the GPUs are renumbered to 0, 1, 2. I failed again in such way because the automatic generated statementexport CUDA_VISIBLE_DEVICES
in run.sh started from 1, which is absolutely wrong!Finally, I checked out the source code (nni/src/nni_manager/training_service/local/gpuScheduler.ts). It seems that you use node-nvidia-smi to get the information of GPU, such that you get the real number (not the renumbered one) of GPU device. Yes, I try
nvidia-smi -q
in the container mentioned above and I get 1,2,3 instead of the renumbered ones (0, 1, 2) innvidia-smi
. Well, I don't konw how to deal with it. What else could I say ? Fxxk you, nVidia? (Linus Torvalds approved)Please response as soon as possible, thank you and your great contribution!
nni Environment:
Anything else we need to know:
Nope.
The text was updated successfully, but these errors were encountered: