no machine can be scheduled, return TMP_NO_AVAILABLE_GPU #1712

apatsekin · 2019-11-06T18:19:52Z

Short summary about the issue/question: In short: after update to 1.1 my remote experiment stopped working with subject error. I figured the source of it and propose some modifications, which will make debugging more clear.

Details
When NNI is run in remote mode, master machine generates from a hard-coded pattern a bash script which is meant to be uploaded to all workers and then run there as gpu_metrics_collector.sh. This script just updates current state of GPUs on workers. There are two weak points to me here:

If this script fails by any reason, master just doesn't get anything in response, and for some reason (!!) processes this as TMP_NO_AVAILABLE_GPU. Which of course is not the case. Correct handling of this scenario would be really great. I.e. "No response from workers, check that script output on worker machine".
gpu_metrics_collector.sh has a hardcoded call to /usr/bin/python3 where NNI expects its module to be installed. To me it was a reason of error, since default python3 which had NNI module was not installed/linked to this particular path. I would refrain from using hardcoded path to python and also provide some stderr feedback from gpu_metrics_collector.sh.

Thanks!

How to reproduce it: run remote experiment with two workers. workers machine shouldn't have default python with installed NNI under this path /usr/bin/python3 .

nni Environment:
nni version: 1.0 (Upgrade to 1.1 didn't solve the problem)
nni mode(local|pai|remote): remote
OS: Ubuntu 16.04.6 LTS
python version: 3.6.8
is conda or virtualenv used?: no
is running in docker?: yes

need to update document(yes/no):

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

liuzhe-lz · 2019-11-07T06:11:09Z

Thanks for your feedback.
We have noticed the GPU resource detector is error-prone and the error message is vague.
We are planning to refactor the detecting and reporting mechanism, and to show error messages in web UI.
There is a hotfix PR #1707 for another issue, which may solve this problem (to some extent) as well.

scarlett2018 · 2019-11-18T02:56:41Z

Thanks for your feedback.
We have noticed the GPU resource detector is error-prone and the error message is vague.
We are planning to refactor the detecting and reporting mechanism, and to show error messages in web UI.
There is a hotfix PR #1707 for another issue, which may solve this problem (to some extent) as well.

the fix will go with the upcoming release, please stay tuned.

liuzhe-lz · 2019-12-06T08:58:04Z

According to the code, I think this script uses the first python3 appears in PATH, not hard-coded directory.

scarlett2018 · 2020-04-15T06:31:39Z

According to the code, I think this script uses the first python3 appears in PATH, not hard-coded directory.

can we close this issue? is the problem fixed upstream? @liuzhe-lz

xuehui1991 added the user raised label Nov 7, 2019

xuehui1991 pinned this issue Nov 7, 2019

scarlett2018 added the GPU-usage label Nov 8, 2019

scarlett2018 assigned liuzhe-lz Nov 18, 2019

scarlett2018 mentioned this issue Nov 28, 2019

Iteration Plan for Dec 2019 #1794

Closed

44 tasks

scarlett2018 unpinned this issue Dec 16, 2019

leckie-chn mentioned this issue Dec 25, 2019

Endgame for Iteration Dec. 2019 #1872

Closed

19 tasks

scarlett2018 mentioned this issue Dec 30, 2019

Iteration Plan for Jan-Feb 2020 #1900

Closed

51 tasks

scarlett2018 added this to the 2020 Jan - 1.4 candidate milestone Dec 30, 2019

liuzhe-lz modified the milestones: 2020 Jan - 1.4 candidate, Backlog Feb 24, 2020

xingwangsfu mentioned this issue Mar 16, 2020

failed to train with remote mode #1578

Closed

scarlett2018 added NNI SDK bug Something isn't working support labels Apr 15, 2020

apatsekin closed this as completed Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no machine can be scheduled, return TMP_NO_AVAILABLE_GPU #1712

no machine can be scheduled, return TMP_NO_AVAILABLE_GPU #1712

apatsekin commented Nov 6, 2019 •

edited

Loading

liuzhe-lz commented Nov 7, 2019

scarlett2018 commented Nov 18, 2019

liuzhe-lz commented Dec 6, 2019 •

edited

Loading

scarlett2018 commented Apr 15, 2020

no machine can be scheduled, return TMP_NO_AVAILABLE_GPU #1712

no machine can be scheduled, return TMP_NO_AVAILABLE_GPU #1712

Comments

apatsekin commented Nov 6, 2019 • edited Loading

liuzhe-lz commented Nov 7, 2019

scarlett2018 commented Nov 18, 2019

liuzhe-lz commented Dec 6, 2019 • edited Loading

scarlett2018 commented Apr 15, 2020

apatsekin commented Nov 6, 2019 •

edited

Loading

liuzhe-lz commented Dec 6, 2019 •

edited

Loading