Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device assert broken on gfx1030 with "Bus error", or hanging after synchronize #3368

Open
JackAKirk opened this issue Nov 23, 2023 · 27 comments

Comments

@JackAKirk
Copy link

JackAKirk commented Nov 23, 2023

If for example i add assert(0); to the kernel in the vectorAdd sample: https://github.com/ROCm-Developer-Tools/HIP-Examples/blob/master/vectorAdd/vectoradd_hip.cpp

via

diff --git a/vectorAdd/vectoradd_hip.cpp b/vectorAdd/vectoradd_hip.cpp
index 0362c8a..a20bd2c 100644
--- a/vectorAdd/vectoradd_hip.cpp
+++ b/vectorAdd/vectoradd_hip.cpp
@@ -47,7 +47,7 @@ __global__ void
 vectoradd_float(float* __restrict__ a, const float* __restrict__ b, const float* __restrict__ c, int width, int height) 
 
   {
- 
+assert(0);
       int x = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;
       int y = hipBlockDim_y * hipBlockIdx_y + hipThreadIdx_y;
 

Then on mi250x I get the expected behavior

......
vectoradd_hip.cpp:50: void vectoradd_float(float *__restrict, const float *__restrict, const float *__restrict, int, int): Device-side assertion `0' failed.
vectoradd_hip.cpp:50: void vectoradd_float(float *__restrict, const float *__restrict, const float *__restrict, int, int): Device-side assertion `0' failed.
:0:rocdevice.cpp            :2778: 1891319054196 us: 83888: [tid:0x7fd1f8497700] Callback: Queue 0x7fcfcce00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
make: *** [Makefile:24: test] Aborted (core dumped)

etc

However on gfx1030 using ubuntu 2204.03 and rocm5.7.1 (an officially supported combination), I get:

System minor 3
 System major 10
 agent prop name AMD Radeon PRO W6800
hip Device prop succeeded 
Bus error

i.e. the assert message diagnostic is removed and replaced with "Bus error".

@iassiour
Copy link
Contributor

I am using rocm5.7.1 on ubuntu 2204.03 and gfx1030, the only difference is that my card is a RX 6900 XT but I cannot reproduce the issue.
I will try to get access to a Radeon PRO W6800, in the meantime can you also confirm the kernel mode driver version you are using:

~/HIP-Examples/vectorAdd$ apt show amdgpu-dkms

Package: amdgpu-dkms
Version: 1:6.2.4.50701-1664922.22.04
Priority: optional
Section: misc
Maintainer: Advanced Micro Devices (AMD) gpudriverdevsupport@amd.com
Installed-Size: 443 MB
Provides: rock-dkms
Depends: dkms (>= 1.95), libc-dev | libc6-dev, autoconf, automake, initramfs-tools, shim-signed, amdgpu-dkms-firmware (= 1:6.2.4.50701-1664922.22.04)
Conflicts: rock-dkms (<< 1:6.2.4.50701-1664922.22.04)
Breaks: rock-dkms (<< 1:6.2.4.50701-1664922.22.04)
Replaces: rock-dkms (<< 1:6.2.4.50701-1664922.22.04)
Download-Size: 10.2 MB
APT-Manual-Installed: yes
APT-Sources: https://repo.radeon.com/amdgpu/5.7.1/ubuntu jammy/main amd64 Packages
Description: amdgpu driver in DKMS format.

@JackAKirk
Copy link
Author

apt show amdgpu-dkms

Thanks

$ apt show amdgpu-dkms
Package: amdgpu-dkms
Version: 1:6.1.5.50601-1649308.22.04
Priority: optional
Section: misc
Maintainer: Advanced Micro Devices (AMD) gpudriverdevsupport@amd.com
Installed-Size: 441 MB
Provides: rock-dkms
Depends: dkms (>= 1.95), libc-dev | libc6-dev, autoconf, automake, initramfs-tools, shim-signed, amdgpu-dkms-firmware (= 1:6.1.5.50601-1649308.22.04)
Conflicts: rock-dkms (<< 1:6.1.5.50601-1649308.22.04)
Breaks: rock-dkms (<< 1:6.1.5.50601-1649308.22.04)
Replaces: rock-dkms (<< 1:6.1.5.50601-1649308.22.04)
Download-Size: 10.1 MB
APT-Manual-Installed: yes
APT-Sources: https://repo.radeon.com/amdgpu/5.6.1/ubuntu jammy/main amd64 Packages
Description: amdgpu driver in DKMS format.

@iassiour
Copy link
Contributor

iassiour commented Nov 24, 2023

Thanks @JackAKirk I think this is a slightly older version of the driver that seems to correspond to 5.6 so you can try to upgrade that but first can you also check if PCIe atomics are supported on the gfx1030 system and if there is any difference comparing to the MI250 system in that respect. I think that device assert is one of the calls that require PCIe atomics in order to work correctly.

@JackAKirk
Copy link
Author

JackAKirk commented Nov 24, 2023

PCIe atomics

I've also verified that the same error occurs if I use rocm5.6.1 with that 5.6.1 driver. I can't easily check the 5.7.1 driver.
I think that the gfx1030 system doesn't support PCIe atomics, but I need to check. This won't tell me exactly what I need however. What I really want to know is:

  • Does assert 100% require PCIe atomics for all amd cards? If so, is this for llvm.trap (SIGABRT)? and if so which atomic instruction does it need?
  • If assert doesn't require PCIe atomics for some amd cards then which ones?

In general are there any amd docs on PCIe atomics requirements for parts of the hip runtime?

Thanks.

@iassiour
Copy link
Contributor

@JackAKirk please see https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html
The PCIe atomic requirement for ROCm is listed under "CPU support" and it applies to all supported amd cards.

device assert (similar to printf() and device-side malloc) is implemented based on a hostcall service that in turn requires the system to support PCIe atomics. (Although for printf() specifically there is a non-hostcall implementation introduced in 5.7 https://rocm.docs.amd.com/en/docs-5.7.0/release.html#)

@JackAKirk
Copy link
Author

@JackAKirk please see https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html The PCIe atomic requirement for ROCm is listed under "CPU support" and it applies to all supported amd cards.

device assert (similar to printf() and device-side malloc) is implemented based on a hostcall service that in turn requires the system to support PCIe atomics. (Although for printf() specifically there is a non-hostcall implementation introduced in 5.7 https://rocm.docs.amd.com/en/docs-5.7.0/release.html#)

Thanks very much for this information.

Could you confirm that this hardware does not support pcie atomics from the lspci output:
lspci_root_amdgpu-4.txt

I think the relevant part is probably:

PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] (prog-if 00 [Normal decode])
DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- LN System CLS Not Supported, TPHComp+ ExtTPHComp- ARIFwd-
AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-

?

Thanks

@iassiour
Copy link
Contributor

Hi @JackAKirk yes I think that the relevant part is this:
AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
can you also provide the output of lspci -t

@JackAKirk
Copy link
Author

I can't get lspci -t on that machine. But here is the output on another machine where I have the same issue:

lspci -t -vv
-[0000:00]-+-00.0  Intel Corporation Device 4660
           +-01.0-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 GL-XL [Radeon PRO W6800]
           |                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 HDMI Audio [Radeon RX 6800/6800 XT / 6900 XT]
           +-02.0  Intel Corporation AlderLake-S GT1
           +-04.0  Intel Corporation Alder Lake Innovation Platform Framework Processor Participant
           +-06.0-[04]----00.0  Toshiba Corporation XG6 NVMe SSD Controller
           +-08.0  Intel Corporation 12th Gen Core Processor Gaussian & Neural Accelerator
           +-14.0  Intel Corporation Device 7ae0
           +-14.2  Intel Corporation Device 7aa7
           +-15.0  Intel Corporation Device 7acc
           +-16.0  Intel Corporation Device 7ae8
           +-17.0  Intel Corporation Device 7ae2
           +-1c.0-[05]----00.0  Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader
           +-1f.0  Intel Corporation Device 7a88
           +-1f.3  Intel Corporation Device 7ad0
           +-1f.4  Intel Corporation Device 7aa3
           +-1f.5  Intel Corporation Device 7aa4
           \-1f.6  Intel Corporation Ethernet Connection (17) I219-LM

@iassiour
Copy link
Contributor

iassiour commented Dec 6, 2023

Hi @JackAKirk on this new machine can you now check the atomics for 00:01.0
sudo lspci -s 00:01.0 -vv | grep AtomicOpsCap

I expect it will show something like AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS- indicating that the atomics are disabled.
If that is not the case can you please attach the full lspci verbose output as in the previous machine.

@JackAKirk
Copy link
Author

JackAKirk commented Dec 7, 2023

Hi @JackAKirk on this new machine can you now check the atomics for 00:01.0 sudo lspci -s 00:01.0 -vv | grep AtomicOpsCap

I expect it will show something like AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS- indicating that the atomics are disabled. If that is not the case can you please attach the full lspci verbose output as in the previous machine.

This is the output of the command.

 lspci -s 00:01.0 -vv | grep AtomicOpsCap
                         AtomicOpsCap: Routing+ 32bit+ 64bit+ 128bitCAS+

See the full output attached [
lspci_w6800.txt
](url)

@JackAKirk
Copy link
Author

Hi @JackAKirk on this new machine can you now check the atomics for 00:01.0 sudo lspci -s 00:01.0 -vv | grep AtomicOpsCap
I expect it will show something like AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS- indicating that the atomics are disabled. If that is not the case can you please attach the full lspci verbose output as in the previous machine.

This is the output of the command.

 lspci -s 00:01.0 -vv | grep AtomicOpsCap
                         AtomicOpsCap: Routing+ 32bit+ 64bit+ 128bitCAS+

See the full output attached [ lspci_w6800.txt ](url)

I'm pretty sure this indicates that the card supports pcie atomics right? So in this case I don't think that can be the issue.

Do you have any unit testing set up for kernel asserts on w6800?

@iassiour
Copy link
Contributor

iassiour commented Dec 8, 2023

I'm pretty sure this indicates that the card supports pcie atomics right? So in this case I don't think that can be the issue.

Can you please try the following as well:

  1. enable logging by setting AMD_LOG_LEVEL=4 and re-run the test. Do the logs show up any error related to missing pcie atomics? For example Pcie atomics not enabled, hostcall not supported
  2. Is there any atomics related error showing in dmesg output after running the test?

@JackAKirk
Copy link
Author

I'm pretty sure this indicates that the card supports pcie atomics right? So in this case I don't think that can be the issue.

Can you please try the following as well:

1. enable logging by setting  AMD_LOG_LEVEL=4 and re-run the test. Do the logs show up any error related to missing pcie atomics? For example `Pcie atomics not enabled, hostcall not supported`

2. Is there any atomics related error showing in dmesg output after running the test?

Here is the log:

AMD_LOG_LEVEL=7.txt

I don't see any errors relating to missing pcie atomics.

@JackAKirk
Copy link
Author

I'm pretty sure this indicates that the card supports pcie atomics right? So in this case I don't think that can be the issue.

Can you please try the following as well:

1. enable logging by setting  AMD_LOG_LEVEL=4 and re-run the test. Do the logs show up any error related to missing pcie atomics? For example `Pcie atomics not enabled, hostcall not supported`

2. Is there any atomics related error showing in dmesg output after running the test?

Here is the log:

AMD_LOG_LEVEL=7.txt

I don't see any errors relating to missing pcie atomics.

$ dmesg -wH
dmesg: read kernel buffer failed: Operation not permitted

@iassiour
Copy link
Contributor

iassiour commented Dec 8, 2023

dmesg: read kernel buffer failed: Operation not permitted

Can you try to run dmesg with sudo. Also, can you post the output of this:
grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties

@JackAKirk
Copy link
Author

dmesg: read kernel buffer failed: Operation not permitted

Can you try to run dmesg with sudo. Also, can you post the output of this: grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties

Sure
dmesg.txt

# grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties
/sys/class/kfd/kfd/topology/nodes/0/io_links/0/properties:flags 3
/sys/class/kfd/kfd/topology/nodes/1/io_links/0/properties:flags 1

Thanks

@iassiour
Copy link
Contributor

@JackAKirk there is no indication of missing pcie atomics from the logs as far as I can see.

  1. Can you try to call printf from the kernel. Does that show the same problem?
  2. Can you run the test under a debugger. When it crashes please get the backtrace and post it here. This won't contain much information to begin with as the debug symbols are missing but can still give some pointers.
    For next step we might need to get a debug build.

@JackAKirk
Copy link
Author

JackAKirk commented Dec 12, 2023

@JackAKirk there is no indication of missing pcie atomics from the logs as far as I can see.

1. Can you try to call printf from the kernel. Does that show the same problem?

2. Can you run the test under a debugger. When it crashes please get the backtrace and post it here. This won't contain much information to begin with as the debug symbols are missing but can still give some pointers.
   For next step we might need to get a debug build.

Do you have testing for printf/kernel asserts on w6800? Does it work for you?

@iassiour
Copy link
Contributor

iassiour commented Dec 12, 2023

@JackAKirk printf is part of unit tests https://github.com/ROCm/hip-tests/tree/develop/catch/unit/printf and these are quite well tested on gfx1030.
Although I do not know the frequency each particular card is being used and this changes over time, w6800 is on the list of officially supported hardware https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html so I do believe the function has been well tested there.
For me both printf/device_assert work on a RX 6900 XT (also gfx1030).

@JackAKirk
Copy link
Author

@JackAKirk printf is part of unit tests https://github.com/ROCm/hip-tests/tree/develop/catch/unit/printf and these are quite well tested on gfx1030. Although I do not know the frequency each particular card is being used and this changes over time, w6800 is on the list of officially supported hardware https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html so I do believe the function has been well tested there. For me both printf/device_assert work on a RX 6900 XT (also gfx1030).

Thanks for the info. I've ran hip_tests on the w6800. The ones with printf in their name:

          Start 1067: Unit_printf_flags
1067/1268 Test #1067: Unit_printf_flags ............................................................   Passed    0.19 sec
          Start 1068: Unit_printf_specifier
1068/1268 Test #1068: Unit_printf_specifier ........................................................   Passed    0.18 sec

The only failing tests are:

The following tests FAILED:
	1060 - Unit_hipIpcOpenMemHandle_Negative_Open_In_Two_Contexts_Same_Device (Failed)
	1061 - Unit_hipIpcGetMemHandle_Positive_Unique_Handles_Separate_Allocations (Failed)
	1062 - Unit_hipIpcGetMemHandle_Positive_Unique_Handles_Reused_Memory (Failed)
	1065 - Unit_hipIpcCloseMemHandle_Positive_Reference_Counting (Failed)
	1166 - Unit_hipIpcMemAccess_Semaphores (Timeout)
	1167 - Unit_hipIpcMemAccess_ParameterValidation (Failed)

and there are no printf tests are not in the set that are skipped.

@iassiour
Copy link
Contributor

Hi @JackAKirk so to confirm, If you replace assert() with printf() in the vectoradd test does it go through or it still fail with a bus error.

@JackAKirk
Copy link
Author

Hi @JackAKirk so to confirm, If you replace assert() with printf() in the vectoradd test does it go through or it still fail with a bus error.
Yeah if I replace assert(0); with printf ("hi"); I also get "Bus error"

@iassiour
Copy link
Contributor

Hi @JackAKirk but at the same time the printf unit tests pass, which is interesting. I would recommend we start with the printf unit test as a reference. The code of the test is the following you can try to just compile as a standalone outside unit tests using the same flags as the failing test.

#include <hip/hip_runtime.h>

__global__ void test_kernel() {
  printf("%08d\n", 42);
  printf("%08i\n", -42);
  printf("%08u\n", 42);
  printf("%08g\n", 123.456);
  printf("%0+8d\n", 42);
  printf("%+d\n", -42);
  printf("%+08d\n", 42);
  printf("%-8s\n", "xyzzy");
  printf("% i\n", -42);
  printf("%-16.8d\n", 42);
  printf("%16.8d\n", 42);
}

int main() {
  test_kernel<<<1, 1>>>();
  static_cast<void>(hipDeviceSynchronize());
}

If it works as a standalone, you can then try to strip down the failing test to match this. For example, the unit test uses a hipDeviceSynchronize after the kernel but the vectorAdd test does not. Does the vectorAdd test still fail with hipDeviceSynchronize. The unit tests launches 1 thread, try to do the same in vectorAdd etc. In this way we can likely narrow it down.

@JackAKirk
Copy link
Author

JackAKirk commented Dec 13, 2023

Hi @JackAKirk but at the same time the printf unit tests pass, which is interesting. I would recommend we start with the printf unit test as a reference. The code of the test is the following you can try to just compile as a standalone outside unit tests using the same flags as the failing test.

#include <hip/hip_runtime.h>

__global__ void test_kernel() {
  printf("%08d\n", 42);
  printf("%08i\n", -42);
  printf("%08u\n", 42);
  printf("%08g\n", 123.456);
  printf("%0+8d\n", 42);
  printf("%+d\n", -42);
  printf("%+08d\n", 42);
  printf("%-8s\n", "xyzzy");
  printf("% i\n", -42);
  printf("%-16.8d\n", 42);
  printf("%16.8d\n", 42);
}

int main() {
  test_kernel<<<1, 1>>>();
  static_cast<void>(hipDeviceSynchronize());
}

If it works as a standalone, you can then try to strip down the failing test to match this. For example, the unit test uses a hipDeviceSynchronize after the kernel but the vectorAdd test does not. Does the vectorAdd test still fail with hipDeviceSynchronize. The unit tests launches 1 thread, try to do the same in vectorAdd etc. In this way we can likely narrow it down.

Hi @iassiour Your example with printf passes. However if I add an assert it hangs. Can you try this to reproduce it:

#include <hip/hip_runtime.h>

__global__ void test_kernel() {
  assert(0);
}

int main() {
  test_kernel<<<1, 1>>>();
  static_cast<void>(hipDeviceSynchronize());
}

However if I comment out

//static_cast<void>(hipDeviceSynchronize());

it doesn't hang. So the problem seems to be calling a device sync following an assert. Do your unit tests cover this?

@iassiour
Copy link
Contributor

iassiour commented Dec 13, 2023

Hi @JackAKirk the example with the assert works for me. I think that with the current implementation both the assert and printf require a synchronization on the host side before exiting (either hipDeviceSynchronize or implicitly with a blocking call like hipMemCpy) otherwise the program exits too soon. i.e removing hipDeviceSynchronize() I think it just hides the issue.

I would suggest to focus on printf testing only for the time being. In that case we know that a test succeeds and a test fails with the bus error. Can we narrow down what is different in the failing test that causes the error? This may shed some light on what happens with assert as well.

@JackAKirk
Copy link
Author

Hi @JackAKirk the example with the assert works for me. I think that with the current implementation both the assert and printf require a synchronization on the host side before exiting (either hipDeviceSynchronize or implicitly with a blocking call like hipMemCpy) otherwise the program exits too soon. i.e removing hipDeviceSynchronize() I think it just hides the issue.

I would suggest to focus on printf testing only for the time being. In that case we know that a test succeeds and a test fails with the bus error. Can we narrow down what is different in the failing test that causes the error? This may shed some light on what happens with assert as well.

Do you think that you could arrange for someone to test this on a w6800, to check whether you reproduce the hanging issue with assert?

@iassiour
Copy link
Contributor

Hi @JackAKirk I managed to reproduce the hanging issue with assert on a w6800 machine on windows.
I will create an internal ticket to investigate the issue and will come back as soon as I have more details.

@JackAKirk JackAKirk changed the title Device assert broken on gfx1030 with "Bus error" Device assert broken on gfx1030 with "Bus error", or hanging after synchronize Dec 14, 2023
JackAKirk added a commit to JackAKirk/llvm that referenced this issue Mar 8, 2024
Due to ROCm/HIP#3368

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants