unify gpu checking around gpustat #35581

mattip · 2023-05-21T10:50:19Z

Why are these changes needed?

Consistently use the required gpustat to detect gpu availability. There are fall-back paths to check /proc/driver/nvidia/gpus on linux, which requires root permissions. Also update gpustat to 1.1 to get a fix for wookayin/gpustat#142.

Related issue number

Toward #28064 (this comment to fix the dashboard server checking for / disk usage which requires root permissions is not part of this PR, and should probably be a separate issue)

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

mattip · 2023-05-21T21:02:58Z

Some of the errors do not seem to be related to my PR. Is there a "known good commit" I can test against?

RobFirth · 2023-07-06T16:11:06Z

I'm interested in the progress of this PR as I'm having issues with permissions i.e. #28064
It looks like this is basically awaiting review?

mattip · 2023-07-18T09:50:41Z

I rebased this to clear merge conflicts, I think it is ready for review.

jjyao · 2023-09-06T16:00:12Z

@mattip

Sorry for missing this one. Currently the team is busy with Ray summit, we will review after that.

mattip · 2023-10-05T11:16:21Z

rebased, in the hope someone will review

rkooo567 · 2023-10-05T13:52:58Z

I will take a look at it soon

thoraxe · 2023-10-24T17:17:11Z

+1 to this as, by default, containers can't run with root in OpenShift environments, so a non-root-required mechanism for the detection is important.

jjyao · 2023-11-02T15:28:58Z

@mattip sorry for the late review. this looks great to me. Could you rebase with master since I moved the auto detection code to nvidia_gpu.py

mattip · 2023-11-02T19:18:00Z

I will try, but it is frustrating to have to do this work twice. I hope this time it gets a review.

Signed-off-by: mattip <matti.picus@gmail.com>

mattip · 2023-11-02T21:32:48Z

At least one of the failures is connected to this PR. I think it is due to the mocking?

::test_gpu_info_parsing 2023-11-02 20:18:07,551	ERROR nvidia_gpu.py:66 -- Could not parse gpu information.
Traceback (most recent call last):
  File "C:\Miniconda3\lib\site-packages\pynvml.py", line 1975, in _LoadNvmlLibrary
    nvmlLib = CDLL("libnvidia-ml.so.1")
  File "C:\Miniconda3\lib\ctypes\__init__.py", line 381, in __init__
    self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'libnvidia-ml.so.1' (or one of its dependencies). Try using the full path with constructor syntax.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\install\ray\python\ray\_private\accelerators\nvidia_gpu.py", line 59, in get_current_node_accelerator_type
    gpu_list = gpustat.new_query()
  File "C:\Miniconda3\lib\site-packages\gpustat\core.py", line 745, in new_query
    return GPUStatCollection.new_query()
  File "C:\Miniconda3\lib\site-packages\gpustat\core.py", line 402, in new_query
    N.nvmlInit()
  File "C:\Miniconda3\lib\site-packages\pynvml.py", line 1947, in nvmlInit
    nvmlInitWithFlags(0)
  File "C:\Miniconda3\lib\site-packages\pynvml.py", line 1930, in nvmlInitWithFlags
    _LoadNvmlLibrary()
  File "C:\Miniconda3\lib\site-packages\pynvml.py", line 1977, in _LoadNvmlLibrary
    _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND)
  File "C:\Miniconda3\lib\site-packages\pynvml.py", line 899, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found
+++ Error creating PyTest summary
[Errno 2] No such file or directory: '::test_gpu_info_parsing.txt'
FAILED

================================== FAILURES ===================================
____________________________ test_gpu_info_parsing ____________________________

mock_listdir = <MagicMock name='listdir' id='2045615548880'>
mock_isdir = <MagicMock name='isdir' id='2045615601360'>
mock_find_spec = <MagicMock name='find_spec' id='2045615679040'>

    @patch("importlib.util.find_spec", return_value=False)
    @patch("os.path.isdir", return_value=True)
    @patch("os.listdir", return_value=["1"])
    @patch("sys.platform", "linux")
    def test_gpu_info_parsing(mock_listdir, mock_isdir, mock_find_spec):
        info_string = """Model:           Tesla V100-SXM2-16GB
    IRQ:             107
    GPU UUID:        GPU-8eaaebb8-bb64-8489-fda2-62256e821983
    Video BIOS:      88.00.4f.00.09
    Bus Type:        PCIe
    DMA Size:        47 bits
    DMA Mask:        0x7fffffffffff
    Bus Location:    0000:00:1e.0
    Device Minor:    0
    Blacklisted:     No
        """
        with patch("builtins.open", mock_open(read_data=info_string)):
>           assert NvidiaGPUAcceleratorManager.get_current_node_accelerator_type() == "V100"
E           AssertionError: assert None == 'V100'
E             +None
E             -'V100'

--

jjyao · 2023-11-03T04:48:27Z

python/ray/_private/accelerators/nvidia_gpu.py

-            cmdargs = ["WMIC", "PATH", "Win32_VideoController", "GET", props]
-            lines = subprocess.check_output(cmdargs).splitlines()[1:]
-            num_gpus = len([x.rstrip() for x in lines if x.startswith(b"NVIDIA")])
+        num_gpus = gpustat.gpu_count()


Current gpustat is only installed for ray[default] so I think we still need the old code that checks "/proc/driver/nvidia/gpus" for minimal installed ray?

Would it be acceptable to make gpustat a unconditional dependency for working with GPUs and ray? That code is very fragile.

It might be hard since it has external dependencies as well:

install_requires = [ 'nvidia-ml-py>=12.535.108', # see #107, #143, #161 'psutil>=5.6.0', # GH-1447 'blessed>=1.17.1', # GH-126 'typing_extensions', ]

Should we copy the auto detect code that pytorch has torch.cuda.device_count(). I think it doesn't depend on GPUtil or gpustat?

That goes to this code, which eventually uses ctypes and libnvidia-ml.so.1. How does this work on windows and macos?

jjyao · 2023-11-03T04:51:39Z

I will try, but it is frustrating to have to do this work twice. I hope this time it gets a review.

Completely understand the frustration. Sorry again for the late review.

mattip · 2023-11-08T21:13:43Z

#41020 takes a different (and probably better) approach of using pynvml directly.

jjyao · 2023-11-14T19:28:02Z

Replaced by #41020

mattip requested review from richardliaw, ericl and edoakes as code owners May 21, 2023 10:50

mattip force-pushed the gpustat branch from 334c04c to a3af6c8 Compare May 21, 2023 10:52

mattip force-pushed the gpustat branch from a3af6c8 to 72cef68 Compare June 12, 2023 06:30

mattip force-pushed the gpustat branch from 72cef68 to 4585195 Compare July 17, 2023 09:38

mattip mentioned this pull request Jul 25, 2023

[Data]Can't instantiate ray on Python 3.10.11 | packaged by Anaconda, Inc. | [MSC v.1916 64 bit (AMD64)] on win32 #37036

Closed

mattip mentioned this pull request Aug 27, 2023

Ray cannot access GPUs under a non-root user (failed access of ray.init() to root-owned /proc/driver/nvidia/gpus) #28064

Closed

jjyao requested a review from a team September 6, 2023 16:00

jjyao assigned jjyao and cadedaniel Sep 6, 2023

mattip mentioned this pull request Oct 5, 2023

[Windows] Windows init() fails, Video Controller #38388

Closed

mattip force-pushed the gpustat branch from 4585195 to acd4b00 Compare October 5, 2023 11:14

rkooo567 self-assigned this Oct 5, 2023

mattip added 2 commits November 2, 2023 21:22

unify gpu checking around gpustat

6a14ef6

Signed-off-by: mattip <matti.picus@gmail.com>

redo GPUtil -> gpustat changes

034978b

Signed-off-by: mattip <matti.picus@gmail.com>

mattip force-pushed the gpustat branch from acd4b00 to 034978b Compare November 2, 2023 19:38

jjyao reviewed Nov 3, 2023

View reviewed changes

mattip mentioned this pull request Nov 8, 2023

[Core] Ray auto detect nvidia Gpu with pynvml #41020

Merged

8 tasks

jjyao closed this Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unify gpu checking around gpustat #35581

unify gpu checking around gpustat #35581

mattip commented May 21, 2023 •

edited

Loading

mattip commented May 21, 2023

RobFirth commented Jul 6, 2023

mattip commented Jul 18, 2023

jjyao commented Sep 6, 2023

mattip commented Oct 5, 2023

rkooo567 commented Oct 5, 2023

thoraxe commented Oct 24, 2023

jjyao commented Nov 2, 2023

mattip commented Nov 2, 2023

mattip commented Nov 2, 2023

jjyao Nov 3, 2023

mattip Nov 3, 2023

jjyao Nov 3, 2023

jjyao Nov 3, 2023

mattip Nov 4, 2023

jjyao commented Nov 3, 2023

mattip commented Nov 8, 2023

jjyao commented Nov 14, 2023

unify gpu checking around gpustat #35581

unify gpu checking around gpustat #35581

Conversation

mattip commented May 21, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

mattip commented May 21, 2023

RobFirth commented Jul 6, 2023

mattip commented Jul 18, 2023

jjyao commented Sep 6, 2023

mattip commented Oct 5, 2023

rkooo567 commented Oct 5, 2023

thoraxe commented Oct 24, 2023

jjyao commented Nov 2, 2023

mattip commented Nov 2, 2023

mattip commented Nov 2, 2023

jjyao Nov 3, 2023

Choose a reason for hiding this comment

mattip Nov 3, 2023

Choose a reason for hiding this comment

jjyao Nov 3, 2023

Choose a reason for hiding this comment

jjyao Nov 3, 2023

Choose a reason for hiding this comment

mattip Nov 4, 2023

Choose a reason for hiding this comment

jjyao commented Nov 3, 2023

mattip commented Nov 8, 2023

jjyao commented Nov 14, 2023

mattip commented May 21, 2023 •

edited

Loading