Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report GPU power limit using the correct NVML API #446

Merged
merged 1 commit into from
Jul 11, 2023

Conversation

amarathe84
Copy link
Collaborator

@amarathe84 amarathe84 commented Jul 11, 2023

Description

This PR replaces the nvmlDeviceGetPowerManagementLimit() call by the nvmlDeviceGetEnforcedPowerLimit(). The former API does not report the updated GPU power limit if it has been modified via. an out-of-band interface. For example, if the GPU power limit is implicitly updated by the node power limiting interface on Lassen, the updated GPU power limit is not reflected in the output of the former API. The latter API (i.e., nvmlDeviceGetEnforcedPowerLimit()) captures the changes in the GPU power limit applied through both the in-band interface (e.g., nvidia-smi or NVML) as well as the out-of-band interface (e.g., IBM node power limit interface).

Fixes #445

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

This fix has been tested on Lassen. Following test cases have been performed:

  • Test 0: Report default GPU power limit. Result: Pass
  • Test 1: Report GPU power limit after changing it with nvidia-smi (i.e., in-band). Result: Pass
  • Test 2: Report GPU power limit after resetting it to default (300 W) with nvidia-smi (in-band). Result: Pass
  • Test 3: Report GPU power limit after implicitly changing it with the IBM node power limit interface (out-of-band). Result: Pass
  • Test 4: Report GPU power limit after implicitly resetting it with IBM node power limit interface (out-of-band). Result: Pass

Checklist:

  • I have run ./scripts/check-code-format.sh and confirm my code code follows the style guidelines of variorum
  • My changes generate no new warnings (build with -DENABLE_WARNINGS=ON)
  • New and existing unit tests pass with my changes
  • Added failure checks previously missing for the API call

Thank you for taking the time to contribute to Variorum!

@amarathe84 amarathe84 added status-ready-for-review Formatted, and tested on multiple systems. type-bug labels Jul 11, 2023
@tpatki tpatki merged commit 861a155 into LLNL:dev Jul 11, 2023
@tpatki tpatki added this to the Production: v0.8.0 Release milestone Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status-ready-for-review Formatted, and tested on multiple systems. type-bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU power limit not reported correctly when the limit is enforced out-of-band
2 participants