Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

no machine can be scheduled, return TMP_NO_AVAILABLE_GPU #1712

Closed
apatsekin opened this issue Nov 6, 2019 · 4 comments
Closed

no machine can be scheduled, return TMP_NO_AVAILABLE_GPU #1712

apatsekin opened this issue Nov 6, 2019 · 4 comments
Assignees
Milestone

Comments

@apatsekin
Copy link
Contributor

apatsekin commented Nov 6, 2019

Short summary about the issue/question: In short: after update to 1.1 my remote experiment stopped working with subject error. I figured the source of it and propose some modifications, which will make debugging more clear.

Details
When NNI is run in remote mode, master machine generates from a hard-coded pattern a bash script which is meant to be uploaded to all workers and then run there as gpu_metrics_collector.sh. This script just updates current state of GPUs on workers. There are two weak points to me here:

  1. If this script fails by any reason, master just doesn't get anything in response, and for some reason (!!) processes this as TMP_NO_AVAILABLE_GPU. Which of course is not the case. Correct handling of this scenario would be really great. I.e. "No response from workers, check that script output on worker machine".
  2. gpu_metrics_collector.sh has a hardcoded call to /usr/bin/python3 where NNI expects its module to be installed. To me it was a reason of error, since default python3 which had NNI module was not installed/linked to this particular path. I would refrain from using hardcoded path to python and also provide some stderr feedback from gpu_metrics_collector.sh.

Thanks!

How to reproduce it: run remote experiment with two workers. workers machine shouldn't have default python with installed NNI under this path /usr/bin/python3 .

nni Environment:
nni version: 1.0 (Upgrade to 1.1 didn't solve the problem)
nni mode(local|pai|remote): remote
OS: Ubuntu 16.04.6 LTS
python version: 3.6.8
is conda or virtualenv used?: no
is running in docker?: yes

need to update document(yes/no):

Anything else we need to know:

@xuehui1991 xuehui1991 pinned this issue Nov 7, 2019
@liuzhe-lz
Copy link
Contributor

Thanks for your feedback.
We have noticed the GPU resource detector is error-prone and the error message is vague.
We are planning to refactor the detecting and reporting mechanism, and to show error messages in web UI.
There is a hotfix PR #1707 for another issue, which may solve this problem (to some extent) as well.

@scarlett2018
Copy link
Member

Thanks for your feedback.
We have noticed the GPU resource detector is error-prone and the error message is vague.
We are planning to refactor the detecting and reporting mechanism, and to show error messages in web UI.
There is a hotfix PR #1707 for another issue, which may solve this problem (to some extent) as well.

the fix will go with the upcoming release, please stay tuned.

@liuzhe-lz
Copy link
Contributor

liuzhe-lz commented Dec 6, 2019

According to the code, I think this script uses the first python3 appears in PATH, not hard-coded directory.

@scarlett2018
Copy link
Member

According to the code, I think this script uses the first python3 appears in PATH, not hard-coded directory.

can we close this issue? is the problem fixed upstream? @liuzhe-lz

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants