Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toxic Ctest behavior on multi gpu systems #5130

Open
PDoakORNL opened this issue Aug 16, 2024 · 3 comments
Open

Toxic Ctest behavior on multi gpu systems #5130

PDoakORNL opened this issue Aug 16, 2024 · 3 comments

Comments

@PDoakORNL
Copy link
Contributor

Describe the bug
Don't export a CUDA_VISIBLE_DEVICES before you run ctest. I ran ctest on a 16gpu dgx server. It seems to run one copy of each test per GPU at least once it gets to the new drivers mpi tests. Unclear on whether they interfere with each each other i.e. in terms of pass fail no. test_new_driver_mpi-rx seems to have another issue that makes it take many many times longer to complete a test launched by ctest than run on its own with mpiexec on my machine.

This is pretty obnoxious default behavior, high GPU count nodes and workstations are pretty common place at this point. Is there anything we can do to have some default behavior that isn’t quite so toxic. Perhaps pick a sane test launch setup at configure time and print notice that if some other setup is desired what to do? I could see using all the GPU's but why run a instance of the test per GPU?

Exploring this problem further I tried

export CUDA_VISIBLE_DEVICES=9,10,11,12
ctest -R 'unit.*' -V
And this is what I saw while while the r2 test was running:
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    9   N/A  N/A   1536602      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|    9   N/A  N/A   1536603      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   10   N/A  N/A   1536602      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   10   N/A  N/A   1536603      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|   11   N/A  N/A   1536602      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   11   N/A  N/A   1536603      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   12   N/A  N/A   1536602      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   12   N/A  N/A   1536603      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
+-----------------------------------------------------------------------------------------+
and during the r3 test
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    9   N/A  N/A   1537618      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|    9   N/A  N/A   1537619      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|    9   N/A  N/A   1537620      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   10   N/A  N/A   1537618      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   10   N/A  N/A   1537619      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|   10   N/A  N/A   1537620      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   11   N/A  N/A   1537618      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   11   N/A  N/A   1537619      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   11   N/A  N/A   1537620      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|   12   N/A  N/A   1537618      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   12   N/A  N/A   1537619      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
|   12   N/A  N/A   1537620      C   ...CDrivers/tests/test_new_drivers_mpi        306MiB |
+-----------------------------------------------------------------------------------------+
...
etc.

with
export CUDA_VISIBLE_DEVICES=1
during r4

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    9   N/A  N/A   1543180      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|    9   N/A  N/A   1543181      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|    9   N/A  N/A   1543182      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
|    9   N/A  N/A   1543183      C   ...CDrivers/tests/test_new_drivers_mpi        404MiB |
+-----------------------------------------------------------------------------------------+


This is related to #3879 I think.

**To Reproduce**
Steps to reproduce the behavior:
don't export a reducing CUDA_VISIBLE_DRIVERS when many GPU are available.

**Expected behavior**
Even if many GPU's are available don't use more than you have ranks at the most agressive.

**System:**
 - sdgx-server.sns.gov

**Additional context**
spack openmpi 4.2.1
clang 18
@ye-luo
Copy link
Contributor

ye-luo commented Aug 16, 2024

How the GPU binding of MPI processes are managed by the MPI launcher (mpirun/mpiexec). ctest was not aware of MPI capability. Due to the variants of MPI libraries and subtle differences among machine configurations, we don't handle GPU affinity. I have not seen a clean way to make MPI and ctest interact regarding GPU affinity.

We do have some CPU logic processor binding control via PROCESSORS and PROCESSOR_AFFINITY test properties.

You may customize MPIRUN options to driver multiple GPUs although this is applied on very test which uses MPI.

I know ways to allow ctest to assign tests to different GPUs but I don't know simple ways to make ctest and MPI corporate and assigning some jobs to more GPUs and some jobs to use fewer ones.

@ye-luo
Copy link
Contributor

ye-luo commented Aug 16, 2024

CUDA_VISIBLE_DEVICES should not be an issue. Each test can only driver one GPU no matter how many GPUs are exposed unless your mpirun is smart and adjust CUDA_VISIBLE_DEVICES. When GPU features are on, we only run one test at a time to prevent over-subscription. I believe the issue you saw here is due to not using -j and all the MPI processes run on a single core. So please try -j16.

@prckent
Copy link
Contributor

prckent commented Aug 16, 2024

On our multi GPU machines the resource locking in cmake/ctest appears to work successfully. e.g. In nightlies we never use both cards on our dual MI210 machine.

It would be very beneficial to revisit the GPU resource locking so that we could use multi GPU machines fully and correctly. Other projects solve this with some extra scripting so that cmake knows the correct number of GPUs and then visibility is set via appropriate environment variables etc.

=> Most likely something is "special" in your current software setup. (?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants