Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unify gpu checking around gpustat #35581

Closed
wants to merge 2 commits into from
Closed

Conversation

mattip
Copy link
Contributor

@mattip mattip commented May 21, 2023

Why are these changes needed?

Consistently use the required gpustat to detect gpu availability. There are fall-back paths to check /proc/driver/nvidia/gpus on linux, which requires root permissions. Also update gpustat to 1.1 to get a fix for wookayin/gpustat#142.

Related issue number

Toward #28064 (this comment to fix the dashboard server checking for / disk usage which requires root permissions is not part of this PR, and should probably be a separate issue)

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@mattip
Copy link
Contributor Author

mattip commented May 21, 2023

Some of the errors do not seem to be related to my PR. Is there a "known good commit" I can test against?

@RobFirth
Copy link

RobFirth commented Jul 6, 2023

I'm interested in the progress of this PR as I'm having issues with permissions i.e. #28064
It looks like this is basically awaiting review?

@mattip
Copy link
Contributor Author

mattip commented Jul 18, 2023

I rebased this to clear merge conflicts, I think it is ready for review.

@jjyao
Copy link
Collaborator

jjyao commented Sep 6, 2023

@mattip

Sorry for missing this one. Currently the team is busy with Ray summit, we will review after that.

@mattip
Copy link
Contributor Author

mattip commented Oct 5, 2023

rebased, in the hope someone will review

@rkooo567 rkooo567 self-assigned this Oct 5, 2023
@rkooo567
Copy link
Contributor

rkooo567 commented Oct 5, 2023

I will take a look at it soon

@thoraxe
Copy link

thoraxe commented Oct 24, 2023

+1 to this as, by default, containers can't run with root in OpenShift environments, so a non-root-required mechanism for the detection is important.

@jjyao
Copy link
Collaborator

jjyao commented Nov 2, 2023

@mattip sorry for the late review. this looks great to me. Could you rebase with master since I moved the auto detection code to nvidia_gpu.py

@mattip
Copy link
Contributor Author

mattip commented Nov 2, 2023

I will try, but it is frustrating to have to do this work twice. I hope this time it gets a review.

Signed-off-by: mattip <matti.picus@gmail.com>
Signed-off-by: mattip <matti.picus@gmail.com>
@mattip
Copy link
Contributor Author

mattip commented Nov 2, 2023

At least one of the failures is connected to this PR. I think it is due to the mocking?

::test_gpu_info_parsing 2023-11-02 20:18:07,551	ERROR nvidia_gpu.py:66 -- Could not parse gpu information.
Traceback (most recent call last):
  File "C:\Miniconda3\lib\site-packages\pynvml.py", line 1975, in _LoadNvmlLibrary
    nvmlLib = CDLL("libnvidia-ml.so.1")
  File "C:\Miniconda3\lib\ctypes\__init__.py", line 381, in __init__
    self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'libnvidia-ml.so.1' (or one of its dependencies). Try using the full path with constructor syntax.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\install\ray\python\ray\_private\accelerators\nvidia_gpu.py", line 59, in get_current_node_accelerator_type
    gpu_list = gpustat.new_query()
  File "C:\Miniconda3\lib\site-packages\gpustat\core.py", line 745, in new_query
    return GPUStatCollection.new_query()
  File "C:\Miniconda3\lib\site-packages\gpustat\core.py", line 402, in new_query
    N.nvmlInit()
  File "C:\Miniconda3\lib\site-packages\pynvml.py", line 1947, in nvmlInit
    nvmlInitWithFlags(0)
  File "C:\Miniconda3\lib\site-packages\pynvml.py", line 1930, in nvmlInitWithFlags
    _LoadNvmlLibrary()
  File "C:\Miniconda3\lib\site-packages\pynvml.py", line 1977, in _LoadNvmlLibrary
    _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND)
  File "C:\Miniconda3\lib\site-packages\pynvml.py", line 899, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found
+++ Error creating PyTest summary
[Errno 2] No such file or directory: '::test_gpu_info_parsing.txt'
FAILED

================================== FAILURES ===================================
____________________________ test_gpu_info_parsing ____________________________

mock_listdir = <MagicMock name='listdir' id='2045615548880'>
mock_isdir = <MagicMock name='isdir' id='2045615601360'>
mock_find_spec = <MagicMock name='find_spec' id='2045615679040'>

    @patch("importlib.util.find_spec", return_value=False)
    @patch("os.path.isdir", return_value=True)
    @patch("os.listdir", return_value=["1"])
    @patch("sys.platform", "linux")
    def test_gpu_info_parsing(mock_listdir, mock_isdir, mock_find_spec):
        info_string = """Model:           Tesla V100-SXM2-16GB
    IRQ:             107
    GPU UUID:        GPU-8eaaebb8-bb64-8489-fda2-62256e821983
    Video BIOS:      88.00.4f.00.09
    Bus Type:        PCIe
    DMA Size:        47 bits
    DMA Mask:        0x7fffffffffff
    Bus Location:    0000:00:1e.0
    Device Minor:    0
    Blacklisted:     No
        """
        with patch("builtins.open", mock_open(read_data=info_string)):
>           assert NvidiaGPUAcceleratorManager.get_current_node_accelerator_type() == "V100"
E           AssertionError: assert None == 'V100'
E             +None
E             -'V100'

--



cmdargs = ["WMIC", "PATH", "Win32_VideoController", "GET", props]
lines = subprocess.check_output(cmdargs).splitlines()[1:]
num_gpus = len([x.rstrip() for x in lines if x.startswith(b"NVIDIA")])
num_gpus = gpustat.gpu_count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current gpustat is only installed for ray[default] so I think we still need the old code that checks "/proc/driver/nvidia/gpus" for minimal installed ray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be acceptable to make gpustat a unconditional dependency for working with GPUs and ray? That code is very fragile.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be hard since it has external dependencies as well:

install_requires = [
    'nvidia-ml-py>=12.535.108',  # see #107, #143, #161
    'psutil>=5.6.0',    # GH-1447
    'blessed>=1.17.1',  # GH-126
    'typing_extensions',
]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we copy the auto detect code that pytorch has torch.cuda.device_count(). I think it doesn't depend on GPUtil or gpustat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That goes to this code, which eventually uses ctypes and libnvidia-ml.so.1. How does this work on windows and macos?

@jjyao
Copy link
Collaborator

jjyao commented Nov 3, 2023

I will try, but it is frustrating to have to do this work twice. I hope this time it gets a review.

Completely understand the frustration. Sorry again for the late review.

@mattip
Copy link
Contributor Author

mattip commented Nov 8, 2023

#41020 takes a different (and probably better) approach of using pynvml directly.

@jjyao
Copy link
Collaborator

jjyao commented Nov 14, 2023

Replaced by #41020

@jjyao jjyao closed this Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants