Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Limit one trial per machine in remote mode #1565

Closed
apatsekin opened this issue Sep 24, 2019 · 3 comments
Closed

Limit one trial per machine in remote mode #1565

apatsekin opened this issue Sep 24, 2019 · 3 comments

Comments

@apatsekin
Copy link
Contributor

Short summary about the issue/question: While spawning trials on two machines in remote mode, NNI dispatcher can create two parallel tasks on the same machine given concurrency=2. Expected behaivor: spawn new task on a free machine, not on the one already running something.

Brief what process you are following: Input: Two worker machines (8 GPU each). Third one runs NNI experiment using remote method spawning trials to those two workers. One trial takes all 8 GPUs on one machine. Concurrency is 2. Attempt to run two trials on the same machine leads to OOM.

How to reproduce it: Run experiment with concurrency=2 on two worker machines. Expected behavior: maintain one trial per machine. Observed behavior: randomly starting trial on one of the machines, ignoring the fact that it might already be running the task.

nni Environment:

  • nni version: 1.0
  • nni mode(local|pai|remote): remote
  • OS: Ubuntu 16.04.6 LTS
  • python version: 3.6.8
  • is conda or virtualenv used?: no
  • is running in docker?: yes

need to update document(yes/no): no

Anything else we need to know: Thank you for help! Screenshot below kind of explains the issue. FAILed attempts were spawned on the node where other trial was in progress. Given 2 machines and concurrency=2 is there a way to limit 1 task per machine?

image

@apatsekin apatsekin changed the title Bug in trials spawning when using remote training Trials spawning logic when using remote training Sep 24, 2019
@apatsekin apatsekin changed the title Trials spawning logic when using remote training Limit one trial per machine in remote mode Sep 24, 2019
@SparkSnail
Copy link
Contributor

Hi,
If you set concurrency=2 in your configuration file, there will only be two trial jobs running simultaneously, the third trial job will be started only if one of the previous two jobs is finished. If you use GPU for trial jobs, NNI will detect the GPU status, and will not allocate new trial jobs to the occupied GPU, unless you set maxTrialNumPerGpu.
Could you please provide your configuration file and the nnimanager.log here?

@yds05
Copy link
Contributor

yds05 commented Sep 26, 2019

Hey, apatsekin, thanks for reporting this issue. Have you explicitly set gpuNum in your nni config? Looks like you didn't specify this config entry so 1 trial can use all of the GPUs on the machine. If 2 trials are scheduled onto the same machine, one of them will fail due to OOM.

@apatsekin
Copy link
Contributor Author

apatsekin commented Sep 26, 2019

@yds05
gpuNum solved the problem, thanks! It would be great to mention it in remote tutorials

@SparkSnail due to the absense of gpuNum in my configuration, NNI randomly assigned tasks to one of the two machines. So even if #1 machine is having one RUNNING task at the moment and #2 is totally idle, new task might be assigned to #1 machine. In general it's kind of imbalanced dispatching from one prospective, from another - it makes sense not to waste resources if two jobs suppose to fit in one machine.

Thank you for help and response!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants