You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.
Short summary about the issue/question: In short: after update to 1.1 my remote experiment stopped working with subject error. I figured the source of it and propose some modifications, which will make debugging more clear.
Details
When NNI is run in remote mode, master machine generates from a hard-coded pattern a bash script which is meant to be uploaded to all workers and then run there as gpu_metrics_collector.sh. This script just updates current state of GPUs on workers. There are two weak points to me here:
If this script fails by any reason, master just doesn't get anything in response, and for some reason (!!) processes this as TMP_NO_AVAILABLE_GPU. Which of course is not the case. Correct handling of this scenario would be really great. I.e. "No response from workers, check that script output on worker machine".
gpu_metrics_collector.sh has a hardcoded call to /usr/bin/python3 where NNI expects its module to be installed. To me it was a reason of error, since default python3 which had NNI module was not installed/linked to this particular path. I would refrain from using hardcoded path to python and also provide some stderr feedback from gpu_metrics_collector.sh.
Thanks!
How to reproduce it: run remote experiment with two workers. workers machine shouldn't have default python with installed NNI under this path /usr/bin/python3 .
nni Environment:
nni version: 1.0 (Upgrade to 1.1 didn't solve the problem)
nni mode(local|pai|remote): remote
OS: Ubuntu 16.04.6 LTS
python version: 3.6.8
is conda or virtualenv used?: no
is running in docker?: yes
need to update document(yes/no):
Anything else we need to know:
The text was updated successfully, but these errors were encountered:
Thanks for your feedback.
We have noticed the GPU resource detector is error-prone and the error message is vague.
We are planning to refactor the detecting and reporting mechanism, and to show error messages in web UI.
There is a hotfix PR #1707 for another issue, which may solve this problem (to some extent) as well.
Thanks for your feedback.
We have noticed the GPU resource detector is error-prone and the error message is vague.
We are planning to refactor the detecting and reporting mechanism, and to show error messages in web UI.
There is a hotfix PR #1707 for another issue, which may solve this problem (to some extent) as well.
the fix will go with the upcoming release, please stay tuned.
Short summary about the issue/question: In short: after update to 1.1 my
remote
experiment stopped working with subject error. I figured the source of it and propose some modifications, which will make debugging more clear.Details
When NNI is run in remote mode, master machine generates from a hard-coded pattern a bash script which is meant to be uploaded to all workers and then run there as
gpu_metrics_collector.sh
. This script just updates current state of GPUs on workers. There are two weak points to me here:gpu_metrics_collector.sh
has a hardcoded call to/usr/bin/python3
where NNI expects its module to be installed. To me it was a reason of error, since default python3 which had NNI module was not installed/linked to this particular path. I would refrain from using hardcoded path to python and also provide some stderr feedback fromgpu_metrics_collector.sh
.Thanks!
How to reproduce it: run remote experiment with two workers. workers machine shouldn't have default python with installed NNI under this path
/usr/bin/python3
.nni Environment:
nni version: 1.0 (Upgrade to 1.1 didn't solve the problem)
nni mode(local|pai|remote): remote
OS: Ubuntu 16.04.6 LTS
python version: 3.6.8
is conda or virtualenv used?: no
is running in docker?: yes
need to update document(yes/no):
Anything else we need to know:
The text was updated successfully, but these errors were encountered: