-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device assert broken on gfx1030 with "Bus error", or hanging after synchronize #3368
Comments
I am using rocm5.7.1 on ubuntu 2204.03 and gfx1030, the only difference is that my card is a RX 6900 XT but I cannot reproduce the issue. ~/HIP-Examples/vectorAdd$ apt show amdgpu-dkms Package: amdgpu-dkms |
Thanks $ apt show amdgpu-dkms |
Thanks @JackAKirk I think this is a slightly older version of the driver that seems to correspond to 5.6 so you can try to upgrade that but first can you also check if PCIe atomics are supported on the gfx1030 system and if there is any difference comparing to the MI250 system in that respect. I think that device assert is one of the calls that require PCIe atomics in order to work correctly. |
I've also verified that the same error occurs if I use rocm5.6.1 with that 5.6.1 driver. I can't easily check the 5.7.1 driver.
In general are there any amd docs on PCIe atomics requirements for parts of the hip runtime? Thanks. |
@JackAKirk please see https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html device assert (similar to printf() and device-side malloc) is implemented based on a hostcall service that in turn requires the system to support PCIe atomics. (Although for printf() specifically there is a non-hostcall implementation introduced in 5.7 https://rocm.docs.amd.com/en/docs-5.7.0/release.html#) |
Thanks very much for this information. Could you confirm that this hardware does not support pcie atomics from the lspci output: I think the relevant part is probably: PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] (prog-if 00 [Normal decode]) ? Thanks |
Hi @JackAKirk yes I think that the relevant part is this: |
I can't get lspci -t on that machine. But here is the output on another machine where I have the same issue:
|
Hi @JackAKirk on this new machine can you now check the atomics for 00:01.0 I expect it will show something like AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS- indicating that the atomics are disabled. |
This is the output of the command.
See the full output attached [ |
I'm pretty sure this indicates that the card supports pcie atomics right? So in this case I don't think that can be the issue. Do you have any unit testing set up for kernel asserts on w6800? |
Can you please try the following as well:
|
Here is the log: I don't see any errors relating to missing pcie atomics. |
$ dmesg -wH |
Can you try to run dmesg with sudo. Also, can you post the output of this: |
Sure
Thanks |
@JackAKirk there is no indication of missing pcie atomics from the logs as far as I can see.
|
Do you have testing for printf/kernel asserts on w6800? Does it work for you? |
@JackAKirk printf is part of unit tests https://github.com/ROCm/hip-tests/tree/develop/catch/unit/printf and these are quite well tested on gfx1030. |
Thanks for the info. I've ran hip_tests on the w6800. The ones with printf in their name:
The only failing tests are:
and there are no |
Hi @JackAKirk so to confirm, If you replace assert() with printf() in the vectoradd test does it go through or it still fail with a bus error. |
|
Hi @JackAKirk but at the same time the printf unit tests pass, which is interesting. I would recommend we start with the printf unit test as a reference. The code of the test is the following you can try to just compile as a standalone outside unit tests using the same flags as the failing test.
If it works as a standalone, you can then try to strip down the failing test to match this. For example, the unit test uses a hipDeviceSynchronize after the kernel but the vectorAdd test does not. Does the vectorAdd test still fail with hipDeviceSynchronize. The unit tests launches 1 thread, try to do the same in vectorAdd etc. In this way we can likely narrow it down. |
Hi @iassiour Your example with printf passes. However if I add an assert it hangs. Can you try this to reproduce it:
However if I comment out
it doesn't hang. So the problem seems to be calling a device sync following an assert. Do your unit tests cover this? |
Hi @JackAKirk the example with the assert works for me. I think that with the current implementation both the assert and printf require a synchronization on the host side before exiting (either hipDeviceSynchronize or implicitly with a blocking call like hipMemCpy) otherwise the program exits too soon. i.e removing hipDeviceSynchronize() I think it just hides the issue. I would suggest to focus on printf testing only for the time being. In that case we know that a test succeeds and a test fails with the bus error. Can we narrow down what is different in the failing test that causes the error? This may shed some light on what happens with assert as well. |
Do you think that you could arrange for someone to test this on a w6800, to check whether you reproduce the hanging issue with assert? |
Hi @JackAKirk I managed to reproduce the hanging issue with assert on a w6800 machine on windows. |
Due to ROCm/HIP#3368 Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
If for example i add
assert(0);
to the kernel in the vectorAdd sample: https://github.com/ROCm-Developer-Tools/HIP-Examples/blob/master/vectorAdd/vectoradd_hip.cppvia
Then on mi250x I get the expected behavior
etc
However on gfx1030 using ubuntu 2204.03 and rocm5.7.1 (an officially supported combination), I get:
i.e. the assert message diagnostic is removed and replaced with "Bus error".
The text was updated successfully, but these errors were encountered: