You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
XPUM supports officially (is tested with) only data center GPUs, and as result, its:
Binary releases rely on Intel repo packages which can conflict with distro packages
Latest container release is too old to support Xe / latest GPUs
Collect v6-RC includes Sysman plugin, but has no final release, nor binary releases
And it's development has completely stalled
Neither of them is included to any distro
To fix that, I'm proposing zello_sysman binary to be installed when compute-runtime is built, and it to be included to its release packages. That way it should eventually be available also in the distros.
While its output is not as nicely layed out as xpu-smi one, it does provide all the available metrics from L0 backend.
There are few things that could be done to productize it better for end users:
Add manual page (I could help with that)
Change help output a bit to indicate that it outputs metrics
e.g. selectively run fan black box test -> run fan tests and provide resulting metrics
Maybe rename as ze_sysman_tool or something
(It's source code is not that large, so one option could also be including it to doc/ dir as L0 usage example.)
The text was updated successfully, but these errors were encountered:
Users want some tool to monitor their GPUs.
Currently there are no good options:
To fix that, I'm proposing
zello_sysman
binary to be installed when compute-runtime is built, and it to be included to its release packages. That way it should eventually be available also in the distros.While its output is not as nicely layed out as
xpu-smi
one, it does provide all the available metrics from L0 backend.There are few things that could be done to productize it better for end users:
selectively run fan black box test
->run fan tests and provide resulting metrics
ze_sysman_tool
or something(It's source code is not that large, so one option could also be including it to
doc/
dir as L0 usage example.)The text was updated successfully, but these errors were encountered: