-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will ARC be supported? #74
Comments
XPU Manager mainly targets Intel data center GPU. For some missing metrics, please refer to the issue 26. What metrics are supported, depends on the underlying HW, its FW, and kernel + user-space drivers. All metrics supported by XPU Manager, are not provided by all HW, or their driver stacks. |
While XPUM is validated only for those, it uses LevelZero Sysman API to query the metrics: https://spec.oneapi.io/level-zero/latest/sysman/api.html And Intel GPU L0 backend releases do list ARC (DG2) as having "production" level support: https://github.com/intel/compute-runtime/ PS. Release testing for the Sysman part of L0 seems somewhat spotty still, as during the years I've noticed couple of regressions, with latest one being: intel/compute-runtime#707 There being 3 Intel kernel GPU driver uAPIs that the user-space driver tries to support at the same time, may have something to do with it:
Driver releases are currently built with support for the first uAPIs two, but it's possible that the changes to support last one could regress them => In addition to latest driver, one could also try one or two older ones, especially for HW that's been out for a while, like ARC is. |
On a quick test with A770 (0x56a0) on TGL-H host, with GuC 70.8.0 FW, using "6.5.0-18-generic" HWE kernel (=upstream with Ubuntu patches) on Ubuntu 22.04.4 LTS distro, with compute-runtime "23.48.27912.11" (own build), I get following GPU metrics from the driver:
(There may be some kernel DKMS drivers + user-space driver combo which would provide also GPU memory BW, temperature and maybe also error counters, but at least one of those will need out-of-band metrics kernel driver instead of GPU one.) PS. I'm checking these with the tester in the corresponding
( |
This is very much needed feature. the zello_sysman command provided is not that friendly. |
@QiXuanWang Just use XPUM then? If |
I gave it a try... It seems the ras and temperature is currently not supported. Temperature metrics is such a needed feature imo. I opened a report on i915's kernel driver repository.
Also, power is a suprise here. Thankfully no more idle 30W power usage. |
Those are OoB (Out of Band) metrics, i.e. not provided by I get temperature metrics for A770 both with [1] https://cgit.freedesktop.org/drm-tip/ |
Unfortunately, I still can't get temperature metrics even with the 6.9.5 kernel. I am using Arch Linux and I don't mind compiling a kernel with the patches which enables the metrics.
This feels like the support is on that (i915) drm kernel driver rather than intel_pmt driver to me(Unless the dkms driver from Intel repo adds new intel_pmt driver). I am trying to find the commit which enables it. edit. And this is where it should have been but it isn't. |
Are these enabled in your kernel builds?
It does:
Note that |
Yes:
hmm... I will try to build an arch linux package later. Thank you for your help! |
Just adding here that having ARC support would be greatly appreciated as I use an Arc card to develop on before trying to run on the MAX 1550. Or, maybe at least have plans to support BattleMage GPUs whenever they are released? |
As commented above, XPUM should work fine with Arc. What metrics are available depends on what FW / kernel / L0 driver versions are installed.
For Max, you need to use kernel and user-space drivers from Intel's driver repository: https://dgpu-docs.intel.com/driver/installation.html
They should also work with XPUM as long, as you have correct kernel + user-space driver installed. |
We recently got a Arc B580 the What is this Even if I am able to install Coming from a single Without On 6.12.4 kernel xanmod with Ubuntu 24.04 os. |
@Qubitium I've never heard of package named Where you got pytorch and what
XPU releases seem to be built against Intel repo packages. I added separate #89 bug about building them against distro driver packages. But all related Intel SW is open source, so you could also file bug against the distro you're using, so that it adds package for the missing project ( I'm not sure what would be the best workaround in the meanwhile:
One temporary solution could be to install distro level-zero dev package (
[1] Background info Reasons why you do not see this "mess" for Nvidia, is that Nvidia (CUDA) drivers are proprietary, so distros won't include their own versions of them (meaning things won't work out of the box, but user can install the driver from Nvidia after accepting their driver license). Intel driver projects use their own names for the driver packages they released. However, when distros eventually packaged those drivers, they chose different names for their own packages, and in some cases also for the library binaries compiled from those sources (sadly it happened both with Debian/Ubuntu and Fedora/RHEL). Packages in Intel repo naturally use Intel package names for their dependencies, and distro packages use distro specific names for their dependencies. Additionally, distro package drivers are built against upstream kernel uAPI, whereas Intel repository drivers are (or at least have been) built against the (earlier and for a long time, more extensive, out-of-tree) uAPI, provided by the DKMS kernel driver in Intel repository. While that kernel API difference is nowadays mostly relevant for media driver, as a general rule, you should not mix (kernel and user-space) packages from different repositories, as they might not be configured with support for uAPI used by kernel driver from another repository. |
@eero-t Thanks for the quick reply. Here is my intel repos in cuda-ubuntu2404-x86_64.list
# deb [signed-by=/usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg] https://apt.repos.intel.com/intel-for-pytorch-gpu-dev all main
intel-for-pytorch-gpu-dev.list
# https://repositories.intel.com/gpu/ubuntu noble client
intel-gpu-noble.list
## deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main
oneAPI.list
ubuntu.sources Info on dpkg -l | grep intel-gpu-compute
ii intel-gpu-compute 2441.19.0-2~24.04 amd64 Install Intel GPU compute runtime packages apt policy intel-gpu-compute
intel-gpu-compute:
Installed: 2441.19.0-2~24.04
Candidate: 2441.19.0-2~24.04
Version table:
*** 2441.19.0-2~24.04 500
500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
100 /var/lib/dpkg/status
2437.26.0-2~24.04 500
500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
2423.31.0-2~24.04 500
500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
output of:
|
Note that I am able to now install dmesg output with relevant info for intel arc b580 sudo dmesg | grep "96:00"
[ 4.440816] pci 0000:96:00.0: [8086:e20b] type 00 class 0x030000 PCIe Endpoint
[ 4.440834] pci 0000:96:00.0: BAR 0 [mem 0xe4000000-0xe4ffffff 64bit]
[ 4.440847] pci 0000:96:00.0: BAR 2 [mem 0xe13f800000000-0xe13fbffffffff 64bit pref]
[ 4.440870] pci 0000:96:00.0: ROM [mem 0xe5000000-0xe51fffff pref]
[ 4.440954] pci 0000:96:00.0: PME# supported from D0 D3hot
[ 6.518118] pci 0000:96:00.0: vgaarb: bridge control possible
[ 6.518118] pci 0000:96:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[ 21.306002] xe 0000:96:00.0: [drm] Found BATTLEMAGE (device ID e20b) display version 14.01 stepping B0
[ 21.307637] xe 0000:96:00.0: [drm] Using GuC firmware from xe/bmg_guc_70.bin version 70.29.2
[ 21.322426] xe 0000:96:00.0: [drm] Using GuC firmware from xe/bmg_guc_70.bin version 70.29.2
[ 21.325224] xe 0000:96:00.0: [drm] Using HuC firmware from xe/bmg_huc.bin version 8.2.10
[ 21.369744] xe 0000:96:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[ 21.370530] xe 0000:96:00.0: [drm] VISIBLE VRAM: 0x000e13f800000000, 0x0000000400000000
[ 21.371189] xe 0000:96:00.0: [drm] VRAM[0, 0]: Actual physical size 0x0000000300000000, usable size exclude stolen 0x00000002fb800000, CPU accessible size 0x00000002fb800000
[ 21.371191] xe 0000:96:00.0: [drm] VRAM[0, 0]: DPA range: [0x0000000000000000-300000000], io range: [0x000e13f800000000-e13fafb800000]
[ 21.371192] xe 0000:96:00.0: [drm] Total VRAM: 0x000e13f800000000, 0x0000000300000000
[ 21.371193] xe 0000:96:00.0: [drm] Available VRAM: 0x000e13f800000000, 0x00000002fb800000
[ 21.389740] xe 0000:96:00.0: [drm] Finished loading DMC firmware i915/bmg_dmc.bin (v2.6)
[ 21.566082] xe 0000:96:00.0: [drm] ccs2 fused off
[ 21.566084] xe 0000:96:00.0: [drm] ccs3 fused off
[ 21.587146] xe 0000:96:00.0: [drm] vcs1 fused off
[ 21.587149] xe 0000:96:00.0: [drm] vcs3 fused off
[ 21.587150] xe 0000:96:00.0: [drm] vcs4 fused off
[ 21.587150] xe 0000:96:00.0: [drm] vcs5 fused off
[ 21.587150] xe 0000:96:00.0: [drm] vcs6 fused off
[ 21.587151] xe 0000:96:00.0: [drm] vcs7 fused off
[ 21.587151] xe 0000:96:00.0: [drm] vecs2 fused off
[ 21.587152] xe 0000:96:00.0: [drm] vecs3 fused off
[ 21.587155] xe 0000:96:00.0: [drm] gsccs disabled due to lack of FW
[ 21.646506] [drm] Initialized xe 1.1.0 for 0000:96:00.0 on minor 13
[ 21.701619] xe 0000:96:00.0: [drm] Cannot find any crtc or sizes
[ 21.813685] xe 0000:96:00.0: [drm] Cannot find any crtc or sizes
[ 21.869647] snd_hda_intel 0000:97:00.0: bound 0000:96:00.0 (ops i915_audio_component_bind_ops [xe])
[ 21.869693] xe 0000:96:00.0: [drm] Cannot find any crtc or sizes This line looks concerting.
|
Ok. Found the conflict. The issue was with the
Trying to install apt install intel-level-zero-gpu intel-gsc
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
intel-level-zero-gpu is already the newest version (1.3.29735.27-914~24.04).
The following packages were automatically installed and are no longer required:
intel-metrics-discovery intel-metrics-library libigsc0 libmetee4
Use 'apt autoremove' to remove them.
The following NEW packages will be installed:
intel-gsc
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 0 B/60.9 kB of archives.
After this operation, 211 kB of additional disk space will be used.
Do you want to continue? [Y/n] y
(Reading database ... 110342 files and directories currently installed.)
Preparing to unpack .../intel-gsc_0.8.16+88~u24.04_amd64.deb ...
Unpacking intel-gsc (0.8.16+88~u24.04) ...
dpkg: error processing archive /var/cache/apt/archives/intel-gsc_0.8.16+88~u24.04_amd64.deb (--unpack):
trying to overwrite '/usr/lib/x86_64-linux-gnu/libigsc.so.0', which is also in package libigsc0 0.9.3-104~u24.04
Errors were encountered while processing:
/var/cache/apt/archives/intel-gsc_0.8.16+88~u24.04_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1) |
@eero-t Sorry for the multiple posts, I need to break down the long outputs and not have to edit super long single msg.
# test.py
import torch
print("cuda", torch.cuda.is_available())
print("xpu", torch.xpu.is_available()) Check pytorch + xpu (base) root# python test.py
cuda False
/root/miniconda3/lib/python3.11/site-packages/torch/xpu/__init__.py:60: UserWarning: XPU device count is zero! (Triggered internally at /pytorch/c10/xpu/XPUFunctions.cpp:50.)
return torch._C._xpu_getDeviceCount()
xpu False Now install
Check again
I am as confused as you are about |
GSC [1] FW is needed only for viewing protected content, i.e. media related thing, not compute or metrics. Do you have FW package installed from Intel repo, or from Ubuntu? Battlemage GSC FW is missing from upstream, so at least Ubuntu package cannot include it. I filed bug about that: [1] more info: https://lore.kernel.org/all/20231027222928.1981633-1-daniele.ceraolospurio@intel.com/T/ |
Both In theory, they should be interchangeable as L0 frontend should abstract what backend is loaded, and be able to load either of them. Higher level app and lib packages should not be depending on the L0 backend, only on frontend, so this seems like
What its package description states: |
Please add info from your comment to bug #89. (I wonder what XPUM needs the GSC lib for, and why that also depends directly on L0 backend...) |
There's currently no way to get most performance statistics on ARC GPUs. intel_gpu_top doesn't have memory usage, and while it appears xpu-smi has some metrics it's missing a lot on ARC.
I'm working on a multi-GPU ARC system and it's hard to troubleshoot certain things without knowing what the GPUs are doing outside of code.
Thanks!
The text was updated successfully, but these errors were encountered: