Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Multiproc always looks for GPUs making it impossible to run pipelines on CPU-only machines #3717

Open
man-shu opened this issue Mar 3, 2025 · 10 comments · May be fixed by #3718
Open

Comments

@man-shu
Copy link

man-shu commented Mar 3, 2025

Hello,

I noticed that in #3642 GPU support has been added, which is indeed much appreciated.

However, the current implementation always checks for GPUs availability and there's no way to turn this off. Users (like me) might need to do so, for example, when they are trying to run a pipeline on a CPU-only HPC (where no NVIDIA-SMI is installed).

Could we maybe have another parameter that would skip checking for GPUs? I would be up for making a PR if you think this makes sense. Please let me know.

Thanks!

@effigies
Copy link
Member

effigies commented Mar 3, 2025

cc @mauriliogenovese

@mauriliogenovese
Copy link
Contributor

mauriliogenovese commented Mar 3, 2025

How would you handle the node failure? I mean, if we skip the check in the plugin and force nodes to act as a GPU is available, I'm afraid the execution will fail

Edit: maybe I misunderstood your point. The implementation checks if a node is a "gpu node" and handle a separate process queue for those. So if a pipeline does not include GPU nodes or handle different path for CUDA and non-CUDA systems, the workflow should run without problem

@effigies
Copy link
Member

effigies commented Mar 3, 2025

Seems like a minimal reproducible example would be helpful here.

@mauriliogenovese
Copy link
Contributor

Moreover, a further step is updating interfaces and nodes to support inputs.use_gpu = true/false if the tool has a GPU version

@man-shu
Copy link
Author

man-shu commented Mar 4, 2025

So if a pipeline does not include GPU nodes or handle different path for CUDA and non-CUDA systems, the workflow should run without problem

Yes exactly! That is indeed my thinking -- if I know that none of the nodes in my pipeline are using GPUs, then I should just be able to turn off GPU checking.

Seems like a minimal reproducible example would be helpful here.

I do not have a minimal example as of now, but all the code for the pipeline I working on is here: https://github.com/man-shu/diffusion-preprocessing. But here's the error I get on our CPU-only HPC:

Traceback (most recent call last):
  File "/storage/store3/work/haggarwa/diffusion/diffusion-preprocessing/runners/run_tracto_drago_downsampled.py", line 33, in <module>
    tracto.run(plugin="MultiProc", plugin_args={"n_procs": 60})
  File "/storage/store3/work/haggarwa/nipype/nipype/pipeline/engine/workflows.py", line 610, in run
    runner = plugin_mod(plugin_args=plugin_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/storage/store3/work/haggarwa/nipype/nipype/pipeline/plugins/multiproc.py", line 136, in __init__
    self.n_gpus_visible = gpu_count()
                          ^^^^^^^^^^^
  File "/storage/store3/work/haggarwa/nipype/nipype/pipeline/plugins/tools.py", line 187, in gpu_count
    return len(GPUtil.getGPUs())
               ^^^^^^^^^^^^^^^^
  File "/data/parietal/store3/work/haggarwa/miniconda3/envs/dwiprep/lib/python3.12/site-packages/GPUtil/GPUtil.py", line 102, in getGPUs
    deviceIds = int(vals[i])
                ^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."

Moreover, a further step is updating interfaces and nodes to support inputs.use_gpu = true/false if the tool has a GPU version

I am using FSL's BEDPOSTX and ProbTrackX2, and at least BEDPOSTX has such a parameter

@effigies
Copy link
Member

effigies commented Mar 4, 2025

What's the rest of the traceback? Just running on my local (CPU-only) system:

In [2]: import GPUtil

In [3]: GPUtil.getGPUs()
Out[3]: []

I think the fix should probably be in nipype.pipeline.plugins.tools.gpu_count and return 0 on any exception. Probably raise a warning at ERROR level so that it's not invisibly disabled for people expecting GPU processing.

@man-shu
Copy link
Author

man-shu commented Mar 4, 2025

What's the rest of the traceback?

Sorry the last line somehow kept disappearing on my tmux window. Updated the error in my previous comment.

@effigies
Copy link
Member

effigies commented Mar 4, 2025

Got it. Well, it looks like gputil is basically unmaintained, and can't be used in Python 3.12+:

https://github.com/anderskm/gputil
https://pypi.org/project/GPUtil/#history

We may want to consider vendoring just the bits we need.

In particular, I think we need these patches:

Everything outside of getGPUs could be stripped out.

@mauriliogenovese mauriliogenovese linked a pull request Mar 4, 2025 that will close this issue
@mauriliogenovese
Copy link
Contributor

@man-shu I sent a PR #3718 , could you test under windows pls?

@man-shu
Copy link
Author

man-shu commented Mar 5, 2025

I do have an old personal machine with Windows on it. Will try this on that over the weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants