-
Notifications
You must be signed in to change notification settings - Fork 6
add violation metrics 1.8 version changes adding partition level violation stats #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ation
stats
hw gpu without violation enabled
```bash
[root@exporter-amdgpu-metrics-exporter-bjmpj ~]# gpuctl show gpu all
-- snipped ---
VRAM usage:
Total VRAM (in MB) : 196592
Used VRAM (in MB) : 1525
Free VRAM (in MB) : 195067
Total visible VRAM (in MB) : 196592
Used visible VRAM (in MB) : 1525
Free visible VRAM (in MB) : 195067
Total GTT (in MB) : 773918
Used GTT (in MB) : 26
Free GTT (in MB) : 773892
Accumulated energy consumed (in uJ) : 445856595574784.00
GFX activity accumulated : 189101231
Memory activity accumulated : 4132638
Link 2 data read (in KB) : 1
Link 2 data written (in KB) : 1
Link 3 data read (in KB) : 1
Link 3 data written (in KB) : 1
Link 4 data read (in KB) : 1
Link 4 data written (in KB) : 1
Link 5 data read (in KB) : 1
Link 5 data written (in KB) : 1
Link 6 data read (in KB) : 1
Link 6 data written (in KB) : 1
Link 7 data read (in KB) : 1
Link 7 data written (in KB) : 1
Link 8 data read (in KB) : 1
Link 8 data written (in KB) : 1
Current accumulated counter : 2467541803
Processor hot residency accumulated : 0
PPT residency accumulated : 297918
Socket thermal residency accumulated : 3295
VR thermal residency accumulated : 0
HBM thermal residency accumulated : 0
Processor hot residency percentage : 0%
PPT residency percentage : 0%
Socket thermal residency percentage : 0%
VR thermal residency percentage : 0%
HBM thermal residency percentage : 0%
[root@exporter-amdgpu-metrics-exporter-bjmpj ~]# gpuctl show gpu -y
violationstats:
currentaccumulatedcounter: 2468312606
processorhotresidencyaccumulated: 0
pptresidencyaccumulated: 297918
socketthermalresidencyaccumulated: 3295
vrthermalresidencyaccumulated: 0
hbmthermalresidencyaccumulated: 0
processorhotresidencypercentage: 0
pptresidencypercentage: 0
socketthermalresidencypercentage: 0
vrthermalresidencypercentage: 0
hbmthermalresidencypercentage: 0
gfxbelowhostlimitpoweraccumulated:
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
gfxbelowhostlimitthmaccumulated:
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
gfxlowutilizationaccumulated:
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
gfxbelowhostlimittotalaccumulated:
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
gfxbelowhostlimitpowerpercentage:
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
gfxbelowhostlimitthmpercentage:
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
gfxlowutilizationpercentage:
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
gfxbelowhostlimittotalpercentage:
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
- 18446744073709551615
xxx_nounkeyedliteral: {}
xxx_unrecognized: []
xxx_sizecache: 0
xxx_nounkeyedliteral: {}
xxx_unrecognized: []
xxx_sizecache: 0
xxx_nounkeyedliteral: {}
xxx_unrecognized: []
xxx_sizecache: 0
```
gpu mock
```bash
[root@b15a1a105561 /]# gpuctl show gpu all
Id : 0003ac11-420f-4242-0f51-d89b47c87169 (15)
--- snipped ---
Accumulated energy consumed (in uJ) : 25293978861511.00
Current accumulated counter : 123478
Processor hot residency accumulated : 23443
PPT residency accumulated : 34523
Socket thermal residency accumulated : 45687
VR thermal residency accumulated : 56753
HBM thermal residency accumulated : 67869
Processor hot residency percentage : 7%
PPT residency percentage : 22%
Socket thermal residency percentage : 41%
VR thermal residency percentage : 44%
HBM thermal residency percentage : 8%
GPU GFX clock host limit accumulated:
Power : 1225 1219 1241 1231 1171 1255 1221 1235
Thermal : 2285 2299 2348 2368 2430 2403 2285 2393
Low Utilization : 3501 3426 3465 3414 3409 3516 3442 3446
Total : 4561 4620 4609 4535 4599 4599 4602 4565
GPU GFX clock host limit percentage:
Power : 19% 90% 10% 90% 43% 17% 46% 54%
Thermal : 65% 27% 20% 48% 64% 34% 28% 36%
Low Utilization : 27% 50% 53% 1% 65% 65% 82% 12%
Total : 59% 20% 34% 68% 39% 49% 35% 62%
```
(cherry picked from commit 788653d006e52ea9513972792222e4b4ac37d00a)
sarat-k
approved these changes
Nov 7, 2025
Collaborator
sarat-k
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
compilation successful
-labsl_cordz_functions -labsl_exponential_biased -labsl_status -labsl_statusor -labsl_flags_commandlineflag_internal -labsl_flags_commandlineflag -labsl_flags_private_handle_accesso r -labsl_flags_marshalling -labsl_flags_program_name -labsl_flags_config -labsl_flags_internal -labsl_flags_reflection -labsl_flags -labsl_flags_usage_internal -labsl_flags_usage -l absl_flags_parse -lpthread -lz -lm -lrt -ldl -l:libev.a -l:libzmq.a -l:libssl.a -l:libcrypto.a -o /usr/src/github.com/ROCm/gpu-agent/sw/nic/build/x86_64/sim/bin/gpuagent_mock make gpuctl make[1]: Entering directory '/usr/src/github.com/ROCm/gpu-agent/sw/nic/gpuagent' building gpuctl CGO_ENABLED=0 go build -C cli -o /usr/src/github.com/ROCm/gpu-agent/sw/nic/build/x86_64/sim/bin/gpuctlhw gpu without violation enabled
gpu mock
(cherry picked from commit 788653d006e52ea9513972792222e4b4ac37d00a)
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist