Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pynvml compatibility by monkey-patching #143

Merged
merged 4 commits into from
Dec 1, 2022
Merged

Add pynvml compatibility by monkey-patching #143

merged 4 commits into from
Dec 1, 2022

Conversation

wookayin
Copy link
Owner

@wookayin wookayin commented Nov 27, 2022

The purpose of this PR is to relax the strict pynvml requirement (<= 11.495.46) introduced in #107.

pynvml 11.510.69 has broken the backward compatibility by removing nvml.nvmlDeviceGetComputeRunningProcesses_v2 which is replaced by v3 APIs (nvml.nvmlDeviceGetComputeRunningProcesses_v3), but this function does not exist for old nvidia drivers less than 510.39.01.

Therefore we pinned pynvml version at 11.495.46 in gpustat v1.0 (#107), but we actually have to use recent pynvml versions for "latest" or modern NVIDIA drivers. To make compute/graphics process information work correctly when a combination of old nvidia drivers (< 510.39) AND pynvml >= 11.510.69 is used, we need to monkey-patch pynvml functions in our custom manner such that, for instance, when v3 API is introduced, we can simply fallback to v2 APIs to retrieve the process information.

This commit adds unit tests for #107, where legacy and supported
nvidia-drivers would behave differently on process-relatd APIs (e.g.,
nvmlDeviceGetComputeRunningProcesses_v2).

Note: As already pointed out in #107, this test (and gpustat's process
information) fails with nvidia-ml-py > 11.495.46 breaking the backward
compatibility.
@wookayin wookayin self-assigned this Nov 27, 2022
@wookayin wookayin added this to the 1.1 milestone Nov 27, 2022
pynvml 11.510.69 has broken the backward compatibility by removing
`nvml.nvmlDeviceGetComputeRunningProcesses_v2` which is replaced by v3
APIs (`nvml.nvmlDeviceGetComputeRunningProcesses_v3`), but this function
does not exist for old nvidia drivers less than 510.39.01.

Therefore we pinned pynvml version at 11.495.46 in gpustat v1.0 (#107),
but we actually have to use recent pynvml versions for "latest" or modern
NVIDIA drivers. To make compute/graphics process information work
correctly when a combination of old nvidia drivers (`< 510.39`) AND
`pynvml >= 11.510.69` is used, we need to monkey-patch pynvml functions
in our custom manner such that, for instance, when v3 API is introduced,
we can simply fallback to v2 APIs to retrieve the process information.
nvmlDeviceGetMemoryInfo_v2 was added in driver 510.39.01, but breaking
the v1 API with no backward compatibility. A corresponding version of
pynvml (11.510.69+) is needed to use the v2 API, in order to get the
correct memory usage information in nvidia drivers 510.39 or higher.

Fixes #141.
@wookayin
Copy link
Owner Author

wookayin commented Dec 1, 2022

To fix #141, pynvml.nvmlDeviceGetMemoryInfo is also monkey-patched in efa355a.

NVIDIA >= 510.39 (nvmlMemory_v2) NVIDIA < 510.39 (nvmlMemory_v1)
nvidia-ml-py>=11.510 ✅ call version=nvmlMemory_v2 ⚠️ call v2 API -> Function not found.
✅ Fallback to v1 API -> correct result
nvidia-ml-py<11.510 ❌ call v1 API, incorrect result (#141).
⚠️ UserWarning printed
✅ call v1 API, correct result

@wookayin wookayin merged commit 647bd34 into master Dec 1, 2022
@wookayin wookayin deleted the pynvml branch December 1, 2022 22:53
wookayin added a commit that referenced this pull request Dec 1, 2022
wookayin added a commit that referenced this pull request Dec 1, 2022
setup.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants