Report GPU power limit using the correct NVML API #446

amarathe84 · 2023-07-11T15:26:33Z

Description

This PR replaces the nvmlDeviceGetPowerManagementLimit() call by the nvmlDeviceGetEnforcedPowerLimit(). The former API does not report the updated GPU power limit if it has been modified via. an out-of-band interface. For example, if the GPU power limit is implicitly updated by the node power limiting interface on Lassen, the updated GPU power limit is not reflected in the output of the former API. The latter API (i.e., nvmlDeviceGetEnforcedPowerLimit()) captures the changes in the GPU power limit applied through both the in-band interface (e.g., nvidia-smi or NVML) as well as the out-of-band interface (e.g., IBM node power limit interface).

Fixes #445

Type of change

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

This fix has been tested on Lassen. Following test cases have been performed:

Test 0: Report default GPU power limit. Result: Pass
Test 1: Report GPU power limit after changing it with nvidia-smi (i.e., in-band). Result: Pass
Test 2: Report GPU power limit after resetting it to default (300 W) with nvidia-smi (in-band). Result: Pass
Test 3: Report GPU power limit after implicitly changing it with the IBM node power limit interface (out-of-band). Result: Pass
Test 4: Report GPU power limit after implicitly resetting it with IBM node power limit interface (out-of-band). Result: Pass

Checklist:

I have run ./scripts/check-code-format.sh and confirm my code code follows the style guidelines of variorum
My changes generate no new warnings (build with -DENABLE_WARNINGS=ON)
New and existing unit tests pass with my changes
Added failure checks previously missing for the API call

Thank you for taking the time to contribute to Variorum!

Report GPU power limit using the correct NVML API

b55d376

amarathe84 added status-ready-for-review Formatted, and tested on multiple systems. type-bug labels Jul 11, 2023

tpatki merged commit 861a155 into LLNL:dev Jul 11, 2023

tpatki added this to the Production: v0.8.0 Release milestone Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report GPU power limit using the correct NVML API #446

Report GPU power limit using the correct NVML API #446

amarathe84 commented Jul 11, 2023 •

edited

Loading

Report GPU power limit using the correct NVML API #446

Report GPU power limit using the correct NVML API #446

Conversation

amarathe84 commented Jul 11, 2023 • edited Loading

Description

Type of change

How Has This Been Tested?

Checklist:

amarathe84 commented Jul 11, 2023 •

edited

Loading