-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial support for monitoring GPUs on Linux #1998
Conversation
Sample metrics from a Vega64: # HELP node_drm_card_info Card information
# TYPE node_drm_card_info gauge
node_drm_card_info{card="card0",memory_vendor="samsung",power_performance_level="manual",unique_id="1234567890",vendor="amd"} 1
# HELP node_drm_gpu_busy_percent How busy the GPU is as a percentage.
# TYPE node_drm_gpu_busy_percent gauge
node_drm_gpu_busy_percent{card="card0"} 10
# HELP node_drm_memory_gtt_size_bytes The size of the graphics translation table (GTT) block in bytes.
# TYPE node_drm_memory_gtt_size_bytes gauge
node_drm_memory_gtt_size_bytes{card="card0"} 8.573157376e+09
# HELP node_drm_memory_gtt_used_bytes The used amount of the graphics translation table (GTT) block in bytes.
# TYPE node_drm_memory_gtt_used_bytes gauge
node_drm_memory_gtt_used_bytes{card="card0"} 1.48447232e+08
# HELP node_drm_memory_vis_vram_size_bytes The size of visible VRAM in bytes.
# TYPE node_drm_memory_vis_vram_size_bytes gauge
node_drm_memory_vis_vram_size_bytes{card="card0"} 2.68435456e+08
# HELP node_drm_memory_vis_vram_used_bytes The used amount of visible VRAM in bytes.
# TYPE node_drm_memory_vis_vram_used_bytes gauge
node_drm_memory_vis_vram_used_bytes{card="card0"} 1.13287168e+08
# HELP node_drm_memory_vram_size_bytes The size of VRAM in bytes.
# TYPE node_drm_memory_vram_size_bytes gauge
node_drm_memory_vram_size_bytes{card="card0"} 8.573157376e+09
# HELP node_drm_memory_vram_used_bytes The used amount of VRAM in bytes.
# TYPE node_drm_memory_vram_used_bytes gauge
node_drm_memory_vram_used_bytes{card="card0"} 1.773531136e+09 |
FreeBSD supports the same Linux driver but I'm not sure if it exposes the DRM information through sysfs. |
Nice! But we should move the parsing to https://github.com/prometheus/procfs - Can you submit a PR there? That'd be great! |
@discordianfish prometheus/procfs#370 |
@SuperQ when will the next |
@discordianfish this is refactored and ready for review 😄 |
I think the only required change would be to set the |
Expose GPU metrics using `sysfs/drm`. `amdgpu` is the only driver which exposes this information through DRM. Signed-off-by: Siavash Safi <siavash.safi@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks!
Any update on this? |
@SuperQ Can you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM,Thanks!
NOTE: In order to support globs in the textfile collector path, filenames exposed by `node_textfile_mtime_seconds` now contain the full path name. * [CHANGE] Add path label to rapl collector #2146 * [FEATURE] Add support for monitoring GPUs on Linux #1998 * [FEATURE] Add os release collector #2094 * [FEATURE] Add netdev.address-info collector #2105 * [ENHANCEMENT] Support glob textfile collector directories #1985 * [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080 * [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165 * [ENHANCEMENT] Add flag to disable guest CPU metrics #2123 * [BUGFIX] ethtool: Sanitize metric names #2093 * [BUGFIX] Fix ethtool collector for multiple interfaces #2126 * [BUGFIX] Fix possible panic on macOS #2133 * [BUGFIX] Collect flag_info and bug_info only for one core #2156 Signed-off-by: Ben Kochie <superq@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by `node_textfile_mtime_seconds` now contain the full path name. * [CHANGE] Add path label to rapl collector #2146 * [FEATURE] Add support for monitoring GPUs on Linux #1998 * [FEATURE] Add os release collector #2094 * [FEATURE] Add netdev.address-info collector #2105 * [ENHANCEMENT] Support glob textfile collector directories #1985 * [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080 * [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165 * [ENHANCEMENT] Add flag to disable guest CPU metrics #2123 * [ENHANCEMENT] Add threads metrics to processes collector #2164 * [BUGFIX] ethtool: Sanitize metric names #2093 * [BUGFIX] Fix ethtool collector for multiple interfaces #2126 * [BUGFIX] Fix possible panic on macOS #2133 * [BUGFIX] Collect flag_info and bug_info only for one core #2156 Signed-off-by: Ben Kochie <superq@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by `node_textfile_mtime_seconds` now contain the full path name. * [CHANGE] Add path label to rapl collector #2146 * [CHANGE] Exclude filesystems under /run/credentials #2157 * [FEATURE] Add support for monitoring GPUs on Linux #1998 * [FEATURE] Add Darwin thermal collector #2032 * [FEATURE] Add os release collector #2094 * [FEATURE] Add netdev.address-info collector #2105 * [ENHANCEMENT] Support glob textfile collector directories #1985 * [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080 * [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165 * [ENHANCEMENT] Add flag to disable guest CPU metrics #2123 * [ENHANCEMENT] Add DMI collector #2131 * [ENHANCEMENT] Add threads metrics to processes collector #2164 * [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector #2169 * [BUGFIX] ethtool: Sanitize metric names #2093 * [BUGFIX] Fix ethtool collector for multiple interfaces #2126 * [BUGFIX] Fix possible panic on macOS #2133 * [BUGFIX] Collect flag_info and bug_info only for one core #2156 Signed-off-by: Ben Kochie <superq@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by `node_textfile_mtime_seconds` now contain the full path name. * [CHANGE] Add path label to rapl collector #2146 * [CHANGE] Exclude filesystems under /run/credentials #2157 * [FEATURE] Add darwin powersupply collector #1777 * [FEATURE] Add support for monitoring GPUs on Linux #1998 * [FEATURE] Add Darwin thermal collector #2032 * [FEATURE] Add os release collector #2094 * [FEATURE] Add netdev.address-info collector #2105 * [ENHANCEMENT] Support glob textfile collector directories #1985 * [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080 * [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165 * [ENHANCEMENT] Add flag to disable guest CPU metrics #2123 * [ENHANCEMENT] Add DMI collector #2131 * [ENHANCEMENT] Add threads metrics to processes collector #2164 * [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector #2169 * [BUGFIX] ethtool: Sanitize metric names #2093 * [BUGFIX] Fix ethtool collector for multiple interfaces #2126 * [BUGFIX] Fix possible panic on macOS #2133 * [BUGFIX] Collect flag_info and bug_info only for one core #2156 Signed-off-by: Ben Kochie <superq@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by `node_textfile_mtime_seconds` now contain the full path name. * [CHANGE] Add path label to rapl collector #2146 * [CHANGE] Exclude filesystems under /run/credentials #2157 * [FEATURE] Add lnstat collector for metrics from /proc/net/stat/ #1771 * [FEATURE] Add darwin powersupply collector #1777 * [FEATURE] Add support for monitoring GPUs on Linux #1998 * [FEATURE] Add Darwin thermal collector #2032 * [FEATURE] Add os release collector #2094 * [FEATURE] Add netdev.address-info collector #2105 * [ENHANCEMENT] Support glob textfile collector directories #1985 * [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080 * [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165 * [ENHANCEMENT] Add flag to disable guest CPU metrics #2123 * [ENHANCEMENT] Add DMI collector #2131 * [ENHANCEMENT] Add threads metrics to processes collector #2164 * [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector #2169 * [BUGFIX] ethtool: Sanitize metric names #2093 * [BUGFIX] Fix ethtool collector for multiple interfaces #2126 * [BUGFIX] Fix possible panic on macOS #2133 * [BUGFIX] Collect flag_info and bug_info only for one core #2156 Signed-off-by: Ben Kochie <superq@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by `node_textfile_mtime_seconds` now contain the full path name. * [CHANGE] Add path label to rapl collector #2146 * [CHANGE] Exclude filesystems under /run/credentials #2157 * [FEATURE] Add lnstat collector for metrics from /proc/net/stat/ #1771 * [FEATURE] Add darwin powersupply collector #1777 * [FEATURE] Add support for monitoring GPUs on Linux #1998 * [FEATURE] Add Darwin thermal collector #2032 * [FEATURE] Add os release collector #2094 * [FEATURE] Add netdev.address-info collector #2105 * [ENHANCEMENT] Support glob textfile collector directories #1985 * [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080 * [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165 * [ENHANCEMENT] Add flag to disable guest CPU metrics #2123 * [ENHANCEMENT] Add DMI collector #2131 * [ENHANCEMENT] Add threads metrics to processes collector #2164 * [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector #2169 * [BUGFIX] ethtool: Sanitize metric names #2093 * [BUGFIX] Fix ethtool collector for multiple interfaces #2126 * [BUGFIX] Fix possible panic on macOS #2133 * [BUGFIX] Collect flag_info and bug_info only for one core #2156 Signed-off-by: Ben Kochie <superq@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by `node_textfile_mtime_seconds` now contain the full path name. * [CHANGE] Add path label to rapl collector #2146 * [CHANGE] Exclude filesystems under /run/credentials #2157 * [FEATURE] Add lnstat collector for metrics from /proc/net/stat/ #1771 * [FEATURE] Add darwin powersupply collector #1777 * [FEATURE] Add support for monitoring GPUs on Linux #1998 * [FEATURE] Add Darwin thermal collector #2032 * [FEATURE] Add os release collector #2094 * [FEATURE] Add netdev.address-info collector #2105 * [ENHANCEMENT] Support glob textfile collector directories #1985 * [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080 * [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165 * [ENHANCEMENT] Add flag to disable guest CPU metrics #2123 * [ENHANCEMENT] Add DMI collector #2131 * [ENHANCEMENT] Add threads metrics to processes collector #2164 * [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector #2169 * [BUGFIX] ethtool: Sanitize metric names #2093 * [BUGFIX] Fix ethtool collector for multiple interfaces #2126 * [BUGFIX] Fix possible panic on macOS #2133 * [BUGFIX] Collect flag_info and bug_info only for one core #2156 Signed-off-by: Ben Kochie <superq@gmail.com>
Expose GPU metrics using `sysfs/drm`. `amdgpu` is the only driver which exposes this information through DRM. Signed-off-by: Siavash Safi <siavash.safi@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by `node_textfile_mtime_seconds` now contain the full path name. * [CHANGE] Add path label to rapl collector prometheus#2146 * [CHANGE] Exclude filesystems under /run/credentials prometheus#2157 * [FEATURE] Add lnstat collector for metrics from /proc/net/stat/ prometheus#1771 * [FEATURE] Add darwin powersupply collector prometheus#1777 * [FEATURE] Add support for monitoring GPUs on Linux prometheus#1998 * [FEATURE] Add Darwin thermal collector prometheus#2032 * [FEATURE] Add os release collector prometheus#2094 * [FEATURE] Add netdev.address-info collector prometheus#2105 * [ENHANCEMENT] Support glob textfile collector directories prometheus#1985 * [ENHANCEMENT] ethtool: Expose node_ethtool_info metric prometheus#2080 * [ENHANCEMENT] Use include/exclude flags for ethtool filtering prometheus#2165 * [ENHANCEMENT] Add flag to disable guest CPU metrics prometheus#2123 * [ENHANCEMENT] Add DMI collector prometheus#2131 * [ENHANCEMENT] Add threads metrics to processes collector prometheus#2164 * [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector prometheus#2169 * [BUGFIX] ethtool: Sanitize metric names prometheus#2093 * [BUGFIX] Fix ethtool collector for multiple interfaces prometheus#2126 * [BUGFIX] Fix possible panic on macOS prometheus#2133 * [BUGFIX] Collect flag_info and bug_info only for one core prometheus#2156 Signed-off-by: Ben Kochie <superq@gmail.com>
Expose GPU metrics using `sysfs/drm`. `amdgpu` is the only driver which exposes this information through DRM. Signed-off-by: Siavash Safi <siavash.safi@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by `node_textfile_mtime_seconds` now contain the full path name. * [CHANGE] Add path label to rapl collector prometheus#2146 * [CHANGE] Exclude filesystems under /run/credentials prometheus#2157 * [FEATURE] Add lnstat collector for metrics from /proc/net/stat/ prometheus#1771 * [FEATURE] Add darwin powersupply collector prometheus#1777 * [FEATURE] Add support for monitoring GPUs on Linux prometheus#1998 * [FEATURE] Add Darwin thermal collector prometheus#2032 * [FEATURE] Add os release collector prometheus#2094 * [FEATURE] Add netdev.address-info collector prometheus#2105 * [ENHANCEMENT] Support glob textfile collector directories prometheus#1985 * [ENHANCEMENT] ethtool: Expose node_ethtool_info metric prometheus#2080 * [ENHANCEMENT] Use include/exclude flags for ethtool filtering prometheus#2165 * [ENHANCEMENT] Add flag to disable guest CPU metrics prometheus#2123 * [ENHANCEMENT] Add DMI collector prometheus#2131 * [ENHANCEMENT] Add threads metrics to processes collector prometheus#2164 * [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector prometheus#2169 * [BUGFIX] ethtool: Sanitize metric names prometheus#2093 * [BUGFIX] Fix ethtool collector for multiple interfaces prometheus#2126 * [BUGFIX] Fix possible panic on macOS prometheus#2133 * [BUGFIX] Collect flag_info and bug_info only for one core prometheus#2156 Signed-off-by: Ben Kochie <superq@gmail.com>
Expose GPU metrics using
sysfs/drm
.amdgpu
is the only driver which exposes this information through DRM.