Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: HCC STATUS_CHECK Error: HSA_STATUS_ERROR_VARIABLE_UNDEFINED (0x1015) while executing Thrust API on HIP/ROCm Path #588

Closed
sriharikarnam opened this issue Dec 18, 2017 · 8 comments
Assignees

Comments

@sriharikarnam
Copy link

Below error is seen while executing some of the Thrust API's on HIP/ROCm path. This issue is seen before control enters application / main.

Error Log:
Runtime Error: HCC STATUS_CHECK Error: HSA_STATUS_ERROR_VARIABLE_UNDEFINED (0x1015) at file:mcwamp_hsa.cpp line:2936
Aborted (core dumped)

Backtrace:
0x00007fc1258714de: Kalmar::HSADevice::BuildOfflineFinalizedProgramImpl(void*, int) + 0x32e
0x00007fc1258609aa: Kalmar::HSADevice::BuildProgram(void*, void*) + 0x23a
0x000000000040e4e4: Kalmar::KalmarBootstrap::KalmarBootstrap() + 0x124
0x000000000040e399: __hcc_shared_library_init + 0x29
0x000000000044902d: __libc_csu_init + 0x4d
0x00007fc12716c7bf: __libc_start_main + 0x7f
0x000000000040f4f9: _start + 0x29

### HCC STATUS_CHECK Error: HSA_STATUS_ERROR_VARIABLE_UNDEFINED (0x1015) at file:mcwamp_hsa.cpp line:2936

Aborted (core dumped)

Analysis: Backtrace log

Thread 1 "summary_statist" received signal SIGABRT, Aborted.
0x00007ffff5f58428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(ROCm-gdb) bt
#0 0x00007ffff5f58428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff5f5a02a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff4648532 in BuildOfflineFinalizedProgramImpl () at /home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.6/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:2940
#3 0x00007ffff46379aa in BuildProgram () at /home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.6/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:2403
#4 0x000000000040e4e4 in BuildProgram () at /home/tcs/HCC/hcc/lib/mcwamp.cpp:355
#5 KalmarBootstrap () at /home/tcs/HCC/hcc/lib/mcwamp.cpp:406
#6 0x000000000040e399 in __hcc_shared_library_init () at /home/tcs/HCC/hcc/lib/mcwamp.cpp:416
#7 0x000000000044902d in __libc_csu_init ()
#8 0x00007ffff5f437bf in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x000000000040f4f9 in _start ()
(ROCm-gdb)

$ ldd summary_statistics.cpp.out
linux-vdso.so.1 => (0x00007ffe7f930000)
libhip_hcc.so => /opt/rocm/hip/lib/libhip_hcc.so (0x00007f1c84f79000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f1c84d53000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1c84a4a000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1c8482d000)
libunwind.so.8 => /usr/lib/x86_64-linux-gnu/libunwind.so.8 (0x00007f1c84611000)
libhc_am.so => /opt/rocm/hcc-1.1/lib/libhc_am.so (0x00007f1c843a7000)
libhsa-runtime64.so.1 => /opt/rocm/hsa/lib/libhsa-runtime64.so.1 (0x00007f1c840fe000)
libhsakmt.so.1 => /opt/rocm/libhsakmt/lib/libhsakmt.so.1 (0x00007f1c83edf000)
libCXLActivityLogger.so => /opt/rocm/profiler/CXLActivityLogger/bin/x86_64 /libCXLActivityLogger.so (0x00007f1c83c7d000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f1c838fb000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1c836e4000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1c8331a000)
/lib64/ld-linux-x86-64.so.2 (0x0000562e8e3fb000)
liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f1c830f8000)
libelf.so.1 => /usr/lib/x86_64-linux-gnu/libelf.so.1 (0x00007f1c82edf000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f1c82cd7000)
libpci.so.3 => /lib/x86_64-linux-gnu/libpci.so.3 (0x00007f1c82ac9000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f1c828af000)
libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007f1c82693000)
libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007f1c82673000)

Steps to reproduce:
$ git clone https://github.com/ROCmSoftwarePlatform/Thrust.git
$ cd Thrust
$ export HIP_PLATFORM=hcc (For HCC Platform )
$ cd examples
$ cp summary_statistics.cu summary_statistics.cpp
$ /opt/rocm/bin/hipcc summary_statistics.cpp -I../ -o summary_statistics.cpp.out
$ ./summary_statistics.cpp.out (Executable to be run on AMD hardware)

@sriharikarnam
Copy link
Author

After migrating to ROCm 1.7.60, instead of HCC status check error we encounter the below error:

Memory access fault by GPU node-1 on address (nil). Reason: Page not present or supervisor privilege.

Now the control enters into main() but this error is reported at the end while exercising the functionality of API.

@whchung
Copy link
Collaborator

whchung commented Jan 3, 2018

@sriharikarnam based on the new error log after upgrading to ROCm 1.7 I suspect it could be:

  1. issue in kernel logic : issue inside Thrust, or the port you are building
  2. issue in kernel argument preparation logic : issue inside HIP / HCC

please help:

  1. revise reproduce steps, clearly indicate:

  2. reduce your kernel logic to identify where the offending lines are.

  3. collect LLVM IR and GCN ISA dump for your test. to do this, rebuild your test with env var KMDUMPLLVM=1 KMDUMPISA=1. paste the files onto places like pastebin or anywhere you can share.

  4. collect runtime logs on what the kernel arguments are passed from host side to the kernel. to do this, run your application with env var HIP_TRACE_API=2 HCC_DB=0x48A. paste the log onto places like pastebin or anywhere you can share.

@sriharikarnam
Copy link
Author

Updated Information @whchung
Steps to reproduce:
$ git clone https://github.com/ROCmSoftwarePlatform/Thrust.git (master branch)
$ cd Thrust
$ export HIP_PLATFORM=hcc (For HCC Platform )
$ cd examples
$ cp summary_statistics.cu summary_statistics.cpp
$ /opt/rocm/bin/hipcc summary_statistics.cpp -I../ -o summary_statistics.cpp.out
$ ./summary_statistics.cpp.out (Executable to be run on AMD hardware)

Configuration Details
Package: rocm-dev
Status: install ok installed
Priority: optional
Section: devel
Installed-Size: 13
Maintainer: Advanced Micro Devices Inc.
Architecture: amd64
Version: 1.7.60
Depends: hsa-rocr-dev, hsa-ext-rocr-dev, rocm-device-libs, rocm-utils, hcc, hip_base, hip_doc, hip_hcc, hip_samples, rocm-smi, hsakmt-roct, hsakmt-roct-dev, hsa-amd-aqlprofile
Description: Radeon Open Compute (ROCm) Runtime software stack
Homepage: https://github.com/RadeonOpenCompute/ROCm

HIP version : 1.4.17494
== hipconfig
HIP_PATH : /opt/rocm
HIP_PLATFORM : hcc
CPP_CONFIG : -D__HIP_PLATFORM_HCC__= -I/opt/rocm/include -I/opt/rocm/hcc/include

== hcc
HSA_PATH : /opt/rocm/hsa
HCC_HOME : /opt/rocm/hcc
HCC clang version 6.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 42ceed861a212d9bd0aef883ee7981144f3ecc02) (ssh://gerritgit/compute/ec/hcc-tot/llvm 23e086be6f627e6e983c6789d2e77da6bf85ebb6) (based on HCC 1.1.17493-2f85d8a-42ceed8-23e086b )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/hcc/bin
Can't exec "/opt/rocm/hcc/compiler/bin/llc": No such file or directory at /opt/rocm/bin/hipconfig line 132.
HCC-cxxflags : -hc -std=c++amp -I/opt/rocm/hcc-1.0/include -I/opt/rocm/includeHCC-ldflags : -hc -std=c++amp -L/opt/rocm/hcc-1.0/lib -Wl,--rpath=/opt/rocm/hcc-1.0/lib -ldl -lm -lpthread -lunwind -lhc_am -Wl,--whole-archive -lmcwamp -Wl,--no-whole-archive

== Linux Kernel
Hostname : tcs-amd
Linux tcs-amd 4.4.0-104-generic #127-Ubuntu SMP Mon Dec 11 12:16:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial

summary_statistics_runtime_log.txt

@sriharikarnam
Copy link
Author

@whchung
Attached LLVM and ISA dump for the issue.
Thrust_LLVM_ISA_Dump.zip

@whchung
Copy link
Collaborator

whchung commented Jan 18, 2018

@sriharikarnam Based on the runtime log summary_statistics_runtime_log.txt it shows host side passed 96 bytes to the kernel. But inside dump-gfx803.isa from Thrust_LLVM_ISA_Dump.zip it seems there is NO kernel which takes 96 bytes. You can grep KernargSegmentSize to verify that. There could be a mismatch between host logic and GPU kernel.

However, I'm not sure if the attached IR/ISA dump is correct. It seems it's from another minmax test, not summary_statistics. Please help provide the exact IR/ISA dump for summary_statistics before we can be 100% sure.

Also, if indeed there is a mismatch between host-side kernel arguments and what's expected in the GPU kernel. There are several places to mend it:

  1. At kernels, abolish hipLaunchParm in every HIPified GPU kernel.
  2. At kernel call sites, stop using hipLaunchKernel, use hipLaunchKernelGGL.
  3. At kernel call sites, please make sure the data type passed to the kernel at hipLaunchKernelGGL match EXACTLY the type asked by the kernel. You can not pass a long to an int.
  4. Checkout hcc commit # 81b2672 , which contains a crucial update in LLVM code generation for passing aggregated types by value to GPU kernels. Please do remember run git submodule update after you checkout that commit to make sure all submodules are in the right commit.

@sriharikarnam
Copy link
Author

@whchung
As requested above we are attaching IR/ISA dump for summary_statistics
Thrust_LLVM_ISA_Dump.zip

@whchung
Copy link
Collaborator

whchung commented Jan 22, 2018

@sriharikarnam Thanks. Based on the updated IR/ISA dump. The faulty kernel is:

_ZZZN6thrust6system4cuda6detail5bulk_6detail23triple_chevron_launcherILj0ENS4_9cuda_taskINS3_14parallel_groupINS3_16concurrent_groupINS3_5agentILm1EEELm0EEELm0EEENS4_7closureINS2_13reduce_detail17reduce_partitionsENS_5tupleINS4_6cursorILj1EEENS_18transform_iteratorI22summary_stats_unary_opIfENS_6detail15normal_iteratorINS_10device_ptrIfEEEE18summary_stats_dataIfENS_11use_defaultEEENS2_21uniform_decompositionIlEENSN_INS_7pointerISS_NS2_3tagEST_ST_EEEESS_23summary_stats_binary_opIfENS_9null_typeES13_S13_S13_EEEEEELb1EE6launchEjjmP12ihipStream_tS16_EN10workaround14supported_pathEjjmS19_S16_EN67HIP_kernel_functor_name_begin_unnamed_HIP_kernel_functor_name_end_119__cxxamp_trampolineEPflllPSS_fffffffiiiiiii

From LLVM IR dump, dump-gfx900.opt.bc, you can disassemble it with llvm-dis dump-gfx900.opt.ll and look for this symbol. You can also find the same symbol in GCN ISA dump, dump-gfx900.isa.

Unfortunately, all kernel arguments seems to be in primitive types, and it seems we are passing the correct amount of kernel arguments (96 bytes) from the ISA dump. So we can basically rule out the issue in the compiler but more on the application side.

Please help commenting out codes in the kernel to identify offending lines.

Looking at IR dump it seems the kernel is trying to do some pointer arithmetic from the 1st kernel argument which is a float* and then load once again from the result. Please check if that's the offending lines in the kernel source code, try comment them out and see if the program run (even with incorrect result).

@sriharikarnam
Copy link
Author

@whchung Thanks for your inputs. We have identified a temporary workaround for the issue. Currently no regression is observed,in case of any we will re-open/raise new ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants