Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide "zello_sysman" tool with binary releases #787

Open
eero-t opened this issue Dec 19, 2024 · 0 comments
Open

Provide "zello_sysman" tool with binary releases #787

eero-t opened this issue Dec 19, 2024 · 0 comments
Labels
Feature request New driver feature L0 Sysman Issue related to L0 Sysman

Comments

@eero-t
Copy link

eero-t commented Dec 19, 2024

Users want some tool to monitor their GPUs.

Currently there are no good options:

  • XPUM supports officially (is tested with) only data center GPUs, and as result, its:
    • Binary releases rely on Intel repo packages which can conflict with distro packages
    • Latest container release is too old to support Xe / latest GPUs
  • Collect v6-RC includes Sysman plugin, but has no final release, nor binary releases
    • And it's development has completely stalled
  • Neither of them is included to any distro

To fix that, I'm proposing zello_sysman binary to be installed when compute-runtime is built, and it to be included to its release packages. That way it should eventually be available also in the distros.

While its output is not as nicely layed out as xpu-smi one, it does provide all the available metrics from L0 backend.

There are few things that could be done to productize it better for end users:

  • Add manual page (I could help with that)
  • Change help output a bit to indicate that it outputs metrics
    • e.g. selectively run fan black box test -> run fan tests and provide resulting metrics
  • Maybe rename as ze_sysman_tool or something

(It's source code is not that large, so one option could also be including it to doc/ dir as L0 usage example.)

@JablonskiMateusz JablonskiMateusz added Feature request New driver feature L0 Sysman Issue related to L0 Sysman labels Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request New driver feature L0 Sysman Issue related to L0 Sysman
Projects
None yet
Development

No branches or pull requests

2 participants