You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.
Short summary about the issue/question: While spawning trials on two machines in remote mode, NNI dispatcher can create two parallel tasks on the same machine given concurrency=2. Expected behaivor: spawn new task on a free machine, not on the one already running something.
Brief what process you are following: Input: Two worker machines (8 GPU each). Third one runs NNI experiment using remote method spawning trials to those two workers. One trial takes all 8 GPUs on one machine. Concurrency is 2. Attempt to run two trials on the same machine leads to OOM.
How to reproduce it: Run experiment with concurrency=2 on two worker machines. Expected behavior: maintain one trial per machine. Observed behavior: randomly starting trial on one of the machines, ignoring the fact that it might already be running the task.
nni Environment:
nni version: 1.0
nni mode(local|pai|remote): remote
OS: Ubuntu 16.04.6 LTS
python version: 3.6.8
is conda or virtualenv used?: no
is running in docker?: yes
need to update document(yes/no): no
Anything else we need to know: Thank you for help! Screenshot below kind of explains the issue. FAILed attempts were spawned on the node where other trial was in progress. Given 2 machines and concurrency=2 is there a way to limit 1 task per machine?
The text was updated successfully, but these errors were encountered:
apatsekin
changed the title
Bug in trials spawning when using remote training
Trials spawning logic when using remote training
Sep 24, 2019
apatsekin
changed the title
Trials spawning logic when using remote training
Limit one trial per machine in remote mode
Sep 24, 2019
Hi,
If you set concurrency=2 in your configuration file, there will only be two trial jobs running simultaneously, the third trial job will be started only if one of the previous two jobs is finished. If you use GPU for trial jobs, NNI will detect the GPU status, and will not allocate new trial jobs to the occupied GPU, unless you set maxTrialNumPerGpu.
Could you please provide your configuration file and the nnimanager.log here?
Hey, apatsekin, thanks for reporting this issue. Have you explicitly set gpuNum in your nni config? Looks like you didn't specify this config entry so 1 trial can use all of the GPUs on the machine. If 2 trials are scheduled onto the same machine, one of them will fail due to OOM.
@yds05 gpuNum solved the problem, thanks! It would be great to mention it in remote tutorials
@SparkSnail due to the absense of gpuNum in my configuration, NNI randomly assigned tasks to one of the two machines. So even if #1 machine is having one RUNNING task at the moment and #2 is totally idle, new task might be assigned to #1 machine. In general it's kind of imbalanced dispatching from one prospective, from another - it makes sense not to waste resources if two jobs suppose to fit in one machine.
Short summary about the issue/question: While spawning trials on two machines in
remote
mode, NNI dispatcher can create two parallel tasks on the same machine givenconcurrency=2
. Expected behaivor: spawn new task on a free machine, not on the one already running something.Brief what process you are following: Input: Two worker machines (8 GPU each). Third one runs NNI experiment using
remote
method spawning trials to those two workers. One trial takes all 8 GPUs on one machine. Concurrency is2
. Attempt to run two trials on the same machine leads to OOM.How to reproduce it: Run experiment with
concurrency=2
on two worker machines. Expected behavior: maintain one trial per machine. Observed behavior: randomly starting trial on one of the machines, ignoring the fact that it might already be running the task.nni Environment:
need to update document(yes/no): no
Anything else we need to know: Thank you for help! Screenshot below kind of explains the issue. FAILed attempts were spawned on the node where other trial was in progress. Given 2 machines and concurrency=2 is there a way to limit 1 task per machine?
The text was updated successfully, but these errors were encountered: