-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue: HCC STATUS_CHECK Error: HSA_STATUS_ERROR_VARIABLE_UNDEFINED (0x1015) while executing Thrust API on HIP/ROCm Path #588
Comments
After migrating to ROCm 1.7.60, instead of HCC status check error we encounter the below error: Memory access fault by GPU node-1 on address (nil). Reason: Page not present or supervisor privilege. Now the control enters into main() but this error is reported at the end while exercising the functionality of API. |
@sriharikarnam based on the new error log after upgrading to ROCm 1.7 I suspect it could be:
please help:
|
Updated Information @whchung Configuration Details HIP version : 1.4.17494 == hcc == Linux Kernel |
@whchung |
@sriharikarnam Based on the runtime log However, I'm not sure if the attached IR/ISA dump is correct. It seems it's from another Also, if indeed there is a mismatch between host-side kernel arguments and what's expected in the GPU kernel. There are several places to mend it:
|
@whchung |
@sriharikarnam Thanks. Based on the updated IR/ISA dump. The faulty kernel is:
From LLVM IR dump, Unfortunately, all kernel arguments seems to be in primitive types, and it seems we are passing the correct amount of kernel arguments (96 bytes) from the ISA dump. So we can basically rule out the issue in the compiler but more on the application side. Please help commenting out codes in the kernel to identify offending lines. Looking at IR dump it seems the kernel is trying to do some pointer arithmetic from the 1st kernel argument which is a |
@whchung Thanks for your inputs. We have identified a temporary workaround for the issue. Currently no regression is observed,in case of any we will re-open/raise new ticket |
Below error is seen while executing some of the Thrust API's on HIP/ROCm path. This issue is seen before control enters application / main.
Error Log:
Runtime Error: HCC STATUS_CHECK Error: HSA_STATUS_ERROR_VARIABLE_UNDEFINED (0x1015) at file:mcwamp_hsa.cpp line:2936
Aborted (core dumped)
Backtrace:
0x00007fc1258714de: Kalmar::HSADevice::BuildOfflineFinalizedProgramImpl(void*, int) + 0x32e
0x00007fc1258609aa: Kalmar::HSADevice::BuildProgram(void*, void*) + 0x23a
0x000000000040e4e4: Kalmar::KalmarBootstrap::KalmarBootstrap() + 0x124
0x000000000040e399: __hcc_shared_library_init + 0x29
0x000000000044902d: __libc_csu_init + 0x4d
0x00007fc12716c7bf: __libc_start_main + 0x7f
0x000000000040f4f9: _start + 0x29
### HCC STATUS_CHECK Error: HSA_STATUS_ERROR_VARIABLE_UNDEFINED (0x1015) at file:mcwamp_hsa.cpp line:2936
Aborted (core dumped)
Analysis: Backtrace log
Thread 1 "summary_statist" received signal SIGABRT, Aborted.
0x00007ffff5f58428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(ROCm-gdb) bt
#0 0x00007ffff5f58428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff5f5a02a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff4648532 in BuildOfflineFinalizedProgramImpl () at /home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.6/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:2940
#3 0x00007ffff46379aa in BuildProgram () at /home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.6/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:2403
#4 0x000000000040e4e4 in BuildProgram () at /home/tcs/HCC/hcc/lib/mcwamp.cpp:355
#5 KalmarBootstrap () at /home/tcs/HCC/hcc/lib/mcwamp.cpp:406
#6 0x000000000040e399 in __hcc_shared_library_init () at /home/tcs/HCC/hcc/lib/mcwamp.cpp:416
#7 0x000000000044902d in __libc_csu_init ()
#8 0x00007ffff5f437bf in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x000000000040f4f9 in _start ()
(ROCm-gdb)
$ ldd summary_statistics.cpp.out
linux-vdso.so.1 => (0x00007ffe7f930000)
libhip_hcc.so => /opt/rocm/hip/lib/libhip_hcc.so (0x00007f1c84f79000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f1c84d53000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1c84a4a000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1c8482d000)
libunwind.so.8 => /usr/lib/x86_64-linux-gnu/libunwind.so.8 (0x00007f1c84611000)
libhc_am.so => /opt/rocm/hcc-1.1/lib/libhc_am.so (0x00007f1c843a7000)
libhsa-runtime64.so.1 => /opt/rocm/hsa/lib/libhsa-runtime64.so.1 (0x00007f1c840fe000)
libhsakmt.so.1 => /opt/rocm/libhsakmt/lib/libhsakmt.so.1 (0x00007f1c83edf000)
libCXLActivityLogger.so => /opt/rocm/profiler/CXLActivityLogger/bin/x86_64 /libCXLActivityLogger.so (0x00007f1c83c7d000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f1c838fb000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1c836e4000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1c8331a000)
/lib64/ld-linux-x86-64.so.2 (0x0000562e8e3fb000)
liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f1c830f8000)
libelf.so.1 => /usr/lib/x86_64-linux-gnu/libelf.so.1 (0x00007f1c82edf000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f1c82cd7000)
libpci.so.3 => /lib/x86_64-linux-gnu/libpci.so.3 (0x00007f1c82ac9000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f1c828af000)
libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007f1c82693000)
libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007f1c82673000)
Steps to reproduce:
$ git clone https://github.com/ROCmSoftwarePlatform/Thrust.git
$ cd Thrust
$ export HIP_PLATFORM=hcc (For HCC Platform )
$ cd examples
$ cp summary_statistics.cu summary_statistics.cpp
$ /opt/rocm/bin/hipcc summary_statistics.cpp -I../ -o summary_statistics.cpp.out
$ ./summary_statistics.cpp.out (Executable to be run on AMD hardware)
The text was updated successfully, but these errors were encountered: