Limit one trial per machine in remote mode #1565

apatsekin · 2019-09-24T03:58:16Z

Short summary about the issue/question: While spawning trials on two machines in remote mode, NNI dispatcher can create two parallel tasks on the same machine given concurrency=2. Expected behaivor: spawn new task on a free machine, not on the one already running something.

Brief what process you are following: Input: Two worker machines (8 GPU each). Third one runs NNI experiment using remote method spawning trials to those two workers. One trial takes all 8 GPUs on one machine. Concurrency is 2. Attempt to run two trials on the same machine leads to OOM.

How to reproduce it: Run experiment with concurrency=2 on two worker machines. Expected behavior: maintain one trial per machine. Observed behavior: randomly starting trial on one of the machines, ignoring the fact that it might already be running the task.

nni Environment:

nni version: 1.0
nni mode(local|pai|remote): remote
OS: Ubuntu 16.04.6 LTS
python version: 3.6.8
is conda or virtualenv used?: no
is running in docker?: yes

need to update document(yes/no): no

Anything else we need to know: Thank you for help! Screenshot below kind of explains the issue. FAILed attempts were spawned on the node where other trial was in progress. Given 2 machines and concurrency=2 is there a way to limit 1 task per machine?

The text was updated successfully, but these errors were encountered:

SparkSnail · 2019-09-26T06:19:42Z

Hi,
If you set concurrency=2 in your configuration file, there will only be two trial jobs running simultaneously, the third trial job will be started only if one of the previous two jobs is finished. If you use GPU for trial jobs, NNI will detect the GPU status, and will not allocate new trial jobs to the occupied GPU, unless you set maxTrialNumPerGpu.
Could you please provide your configuration file and the nnimanager.log here?

yds05 · 2019-09-26T08:50:35Z

Hey, apatsekin, thanks for reporting this issue. Have you explicitly set gpuNum in your nni config? Looks like you didn't specify this config entry so 1 trial can use all of the GPUs on the machine. If 2 trials are scheduled onto the same machine, one of them will fail due to OOM.

apatsekin · 2019-09-26T17:40:31Z

@yds05
gpuNum solved the problem, thanks! It would be great to mention it in remote tutorials

@SparkSnail due to the absense of gpuNum in my configuration, NNI randomly assigned tasks to one of the two machines. So even if #1 machine is having one RUNNING task at the moment and #2 is totally idle, new task might be assigned to #1 machine. In general it's kind of imbalanced dispatching from one prospective, from another - it makes sense not to waste resources if two jobs suppose to fit in one machine.

Thank you for help and response!

apatsekin changed the title ~~Bug in trials spawning when using remote training~~ Trials spawning logic when using remote training Sep 24, 2019

apatsekin changed the title ~~Trials spawning logic when using remote training~~ Limit one trial per machine in remote mode Sep 24, 2019

xuehui1991 added the user raised label Sep 26, 2019

scarlett2018 added the waiting user confirm label Sep 26, 2019

apatsekin closed this as completed Sep 26, 2019

scarlett2018 added remote GPU-usage Training Service and removed waiting user confirm labels Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit one trial per machine in remote mode #1565

Limit one trial per machine in remote mode #1565

apatsekin commented Sep 24, 2019

SparkSnail commented Sep 26, 2019

yds05 commented Sep 26, 2019

apatsekin commented Sep 26, 2019 •

edited

Loading

Limit one trial per machine in remote mode #1565

Limit one trial per machine in remote mode #1565

Comments

apatsekin commented Sep 24, 2019

SparkSnail commented Sep 26, 2019

yds05 commented Sep 26, 2019

apatsekin commented Sep 26, 2019 • edited Loading

apatsekin commented Sep 26, 2019 •

edited

Loading