Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process hangs on 'Setting up PyTorch plugin "bias_act_plugin"...' when using multiple GPUs #41

Closed
markemus opened this issue Feb 18, 2021 · 6 comments

Comments

@markemus
Copy link

markemus commented Feb 18, 2021

I added these lines to train.py as lines 13 and 14 (right under import os):

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,2,3,4"

I tested the process with --gpus 1 and it spent a few minutes on Setting up PyTorch plugin "bias_act_plugin"... but then proceeded to train. However with --gpus 4 it has been hanging on this line for an hour and a half.

Creating output directory...
Launching processes...
Loading training set...

Num images:  505487
Image shape: [3, 256, 256]
Label shape: [0]

Constructing networks...
Setting up PyTorch plugin "bias_act_plugin"...

Here's the nvidia-smi printout as well. As you can see three of the cores (2,3,4) have 100% GPU utilization while the first core (0) has 0%. The memory usage does not seem to be changing.


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                    0 |
| N/A   33C    P0    57W / 300W |   2088MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:1B:00.0 Off |                    0 |
| N/A   34C    P0    59W / 300W |  31147MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:3D:00.0 Off |                    0 |
| N/A   35C    P0    68W / 300W |   4261MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:3E:00.0 Off |                    0 |
| N/A   31C    P0    68W / 300W |   4345MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:88:00.0 Off |                    0 |
| N/A   33C    P0    71W / 300W |   4201MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Do I just need to be more patient? On one core it really only took a couple of minutes to begin training.

EDIT: note that the cores (0,2,3,4) are not consecutive.

@nurpax
Copy link
Contributor

nurpax commented Feb 18, 2021

No, it definitely shouldn't take long. About the same as on single core.

I'd try a couple of things if the problem persists:

  1. Set CUDA_VISIBLE_DEVICES in the shell before starting the Python process, just so that there isn't anything funky going on with multiprocessing.
  2. This could be a case of a stale multiprocess lock in ~/.cache/torch_extensions (default on Linux). Try rm -rf'ing the torch_extensions directory and rerun.

If you're running docker, you should NOT need CUDA_VISIBLE_DEVICES separately. I think it's enough to configure available devices using the --gpus parameter. Also I think within Docker, CUDA_VISIBLE_DEVICES might in fact need to be in consecutive order (or probably does not need to be specified within the container), and that you should specify the real device mapping when you start docker. I'm a bit on thin ice here with this as I haven't ever run using such a configuration.

@markemus
Copy link
Author

@nurpax Thank you! rm -rf ~/.cache/torch_extensions solved the issue and it's training now on 4 GPUs.

For posterity: I left the CUDA_VISIBLE_DEVICES definition in the code. This is not running in docker, it's running in Anaconda.

@tasinislam21
Copy link

No, it definitely shouldn't take long. About the same as on single core.

I'd try a couple of things if the problem persists:

  1. Set CUDA_VISIBLE_DEVICES in the shell before starting the Python process, just so that there isn't anything funky going on with multiprocessing.
  2. This could be a case of a stale multiprocess lock in ~/.cache/torch_extensions (default on Linux). Try rm -rf'ing the torch_extensions directory and rerun.

If you're running docker, you should NOT need CUDA_VISIBLE_DEVICES separately. I think it's enough to configure available devices using the --gpus parameter. Also I think within Docker, CUDA_VISIBLE_DEVICES might in fact need to be in consecutive order (or probably does not need to be specified within the container), and that you should specify the real device mapping when you start docker. I'm a bit on thin ice here with this as I haven't ever run using such a configuration.

How do you do it on windows? It used work perfectly but when I run the project a few days layer it gets stuck.

@nurpax
Copy link
Contributor

nurpax commented Feb 21, 2021

I'm not sure what's the exact location and don't have Windows access right now. But here's how you should be able to figure it out:

Change torch_utils/custom_ops.py as follows:

diff --git a/torch_utils/custom_ops.py b/torch_utils/custom_ops.py
index 4cc4e43..4dfcef7 100755
--- a/torch_utils/custom_ops.py
+++ b/torch_utils/custom_ops.py
@@ -20,7 +20,7 @@ from torch.utils.file_baton import FileBaton
 #----------------------------------------------------------------------------
 # Global options.
 
-verbosity = 'brief' # Verbosity level: 'none', 'brief', 'full'
+verbosity = 'full' # Verbosity level: 'none', 'brief', 'full'
 
 #----------------------------------------------------------------------------
 # Internal helper funcs.

Then run for example generate.py with default options, and check the logs. On my computer, it prints something like this:

Using /scratch/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /scratch/.cache/torch_extensions/bias_act_plugin/build.ninja...

This should reveal the Windows location for you.

@tasinislam21
Copy link

tasinislam21 commented Feb 22, 2021

I'm not sure what's the exact location and don't have Windows access right now. But here's how you should be able to figure it out:

Change torch_utils/custom_ops.py as follows:

diff --git a/torch_utils/custom_ops.py b/torch_utils/custom_ops.py
index 4cc4e43..4dfcef7 100755
--- a/torch_utils/custom_ops.py
+++ b/torch_utils/custom_ops.py
@@ -20,7 +20,7 @@ from torch.utils.file_baton import FileBaton
 #----------------------------------------------------------------------------
 # Global options.
 
-verbosity = 'brief' # Verbosity level: 'none', 'brief', 'full'
+verbosity = 'full' # Verbosity level: 'none', 'brief', 'full'
 
 #----------------------------------------------------------------------------
 # Internal helper funcs.

Then run for example generate.py with default options, and check the logs. On my computer, it prints something like this:

Using /scratch/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /scratch/.cache/torch_extensions/bias_act_plugin/build.ninja...

This should reveal the Windows location for you.

Thank you! The cache for windows can be found in 'C:\Users\<user_name>\AppData\Local\torch_extensions\torch_extensions\Cache'. I was able to delete it but also had to reinstall ninja to build bias_act_plugin again. In the end, it worked.

@tedschw
Copy link

tedschw commented Nov 26, 2023

removing the stale lock file ~/.cache/torch_extensions/py310_cu121/bias_act_plugin/3cb576a0039689487cfba59279dd6d46-nvidia-geforce-rtx-2060/lock worked for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants