Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use natural dispatch syntax #246

Merged

Conversation

AlexVlx
Copy link
Contributor

@AlexVlx AlexVlx commented Nov 1, 2017

This switches HIP from its currently convoluted macro + pfe based dispatch mechanism to a more natural one partially based on the existing module API. The basic idea is that HCC will always correctly emit global functions: as empty-bodied stubs, on host, and as kernels, on device. It then becomes trivial to obtain the mangled name on host, at dispatch, from the function's address, and then to use the mangled name to retrieve the kernel. This should address all problems stemming from serialisation, dubious mismatches due to the manufactured functor, macro-isms et al. It also immediately enables support for generalised globals as a consequence of that being available in the module API. Finally, it will make debug much easier, since the actual names of the global functions will automatically be used in traces etc. One detail is that due to how dispatch works now (hipLaunchKernel and hipLaunchKernelGGL are themselves variadic function templates which deduce the function type of the callee), in certain cases it may be necesssary to insert explicit casts to ensure that the variadic argument list selects a viable overload - this can be observed in some unit tests. Eventually we may be able to remove this limitation, but for now it does not appear terribly onerous. The code is not extremely HIPpie, nor is it fully optimised, but rather is intended as a starting point for the HIP team to make its own.

…patch mechanism to a more natural one partially based on the existing module API. The basic idea is that HCC will always correctly emit __global__ functions: as empty-bodied stubs, on host, and as kernels, on device. It then becomes trivial to obtain the mangled name on host, at dispatch, from the function's address, and then to use the mangled name to retrieve the kernel. This should address all problems stemming from serialisation, dubious mismatches due to the manufactured functor, macro-isms et al. It also immediately enables support for generalised globals as a consequence of that being available in the module API. Finally, it will make debug much easier, since the actual names of the __global__ functions will automatically be used in traces etc. One detail is that due to how dispatch works now (hipLaunchKernel and hipLaunchKernelGGL are themselves variadic function templates which deduce the function type of the callee), in certain cases it may be necesssary to insert explicit casts to ensure that the variadic argument list selects a viable overload - this can be observed in some unit tests. Eventually we may be able to remove this limitation, but for now it does not appear terribly onerous. The code is not extremely HIPpie, nor is it fully optimised, but rather is intended as a starting point for the HIP team to make its own.

auto agent = target_agent(stream);

const auto it1 = find_if(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be std::find_if, or is there a reason to leave this up to ADL?

inline
section* find_section_if(elfio& reader, P p)
{
const auto it = find_if(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, here is another ADL-based find_if.

Copy link
Contributor

@bensander bensander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR seems to change a number of files for "other" reasons such as reformatting or refactoring. Some of the test changes (for example replacing hipFree of device memory with delete[] in some of the tests) seem unrelated and perhaps incorrect.

Would you focus this PR on just the new dispatch functionality?

…m visible VA == so_base_va + st_value(function_symbol). Remove quaint usage of pfe for hipMemset (which is actually fill_n).
@AlexVlx
Copy link
Contributor Author

AlexVlx commented Nov 2, 2017

ROCm/hcc-clang-upgrade#104 is needed in HCC's upstream clang to support this.

@whchung whchung self-requested a review November 7, 2017 20:54
@whchung
Copy link
Contributor

whchung commented Nov 7, 2017

@AlexVlx I think this PR deserves some high-level overview, and possibly be broken into multiple PRs, as it seems to try achieve several different targets at the same time.

  • remove trailing whitespaces and re-indent some kernel call sites: better be filed as a separate PR
  • use clang_offload_bundler to bundle code objects: the syntax of the triple, and the header used for clang_offload_bundler should better be documented.
  • revise kernel compilation / launch logic
  • revise hipMemset logic

@whchung
Copy link
Contributor

whchung commented Nov 7, 2017

@AlexVlx one extra question: would hccgenco.sh be obsolete with this PR?

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Nov 7, 2017

@pfultz2 : good eyes:)
@bensander: please see below for a partial answer.
@whchung:
a) trailing whitespace - I'll separate that, it's accidental but maybe useful;
b) re-indentation - that is actually a side-effect and the least important part of modifying the call-site to accommodate differences introduced by the new launch mechanism (argument pack must match a valid overload), therefore moving it elsewhere is undesirable since context would be lost;
c) not sure what you mean by "use clang_offload_bundler to..." - I think that this is what we use and have been using for more than a year now, so this merely builds on that; the format is documented here (perhaps adding a link in the header would be of use) https://code.woboq.org/llvm/clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp.html#173; the triple is still possibly subject to change, so it appears uncertain that we want to lock that in just yet;
d) changing hipMemset to use natural launch syntax seems rather related, and avoids quaint side-effects / interactions with PFE; the alternative would be to change it later, but that may add noise in the current validation;
e) hccgenco.sh should become obsolete, yes, since we can directly load a bundled code object from an ELF file and can also easily build the functionality that takes a code object packaged as a binary blob (foo.hipbin or whatever, if we want to retain alignment with CUDA) without needing script magic.
Thank you for the feedback gentlemen!

AlexVlx added a commit to AlexVlx/HIP that referenced this pull request Nov 8, 2017
@AlexVlx
Copy link
Contributor Author

AlexVlx commented Nov 8, 2017

@whchung @bensander #255 handles the noisy whitespace differences.

mangupta added a commit that referenced this pull request Nov 8, 2017
Clean up trailing whitespace so as to reduce noise in #246.
@whchung
Copy link
Contributor

whchung commented Nov 8, 2017

@AlexVlx i guess my question for clang-offload-bundler is do we follow the same triple syntax currently used in hcc when we create a bundle:

hcc-amdgcn--amdhsa-$AMDGPU_TARGET

from the logic I think the answer is yes, but i'd like to double check with you.

for hipMemset I agree it's definitely good to use the new natural dispatch syntax.

since hccgenco.sh would be made obsolete, i believe relevant tests / samples should also be modified. do you expect to amend them in this PR, or they would come in a separate PR?

@whchung whchung requested a review from mangupta November 8, 2017 05:12
@whchung
Copy link
Contributor

whchung commented Nov 8, 2017

invite @mangupta as directed tests / samples which depend on hccgenco.sh may have to be changed.

…based_dispatch_instead_of_pfe

# Conflicts:
#	tests/src/runtimeApi/stream/hipStreamSync2.cpp
@AlexVlx
Copy link
Contributor Author

AlexVlx commented Nov 8, 2017

@whchung @mangupta making hccgenco.sh obsolete is a two step process, which I think can follow going through separately, after we're done with this:
a) due to ROCm/hcc-clang-upgrade#104 we should no longer need to use WrapperGen / define GENERIC_GRID_LAUNCH=0 for genco, since all global functions will get emitted; we can and should explore this change now, since it may fix some lingering bugs in / due to WrapperGen;
b) we want to consider extending --filetype in llc (perhaps) to include --filetype=hsaco to emit just the code object for a source file / collection of source files (or a similar mechanism); once that is in place, genco's more nefarious role as gutripper is no longer necessary, and it can be fully removed.

@whchung
Copy link
Contributor

whchung commented Nov 8, 2017

@bensander / @mangupta , ROCm/hcc-clang-upgrade#104 has been merged into HCC mainline now. I think we may need your extra round of review for this PR.

@whchung
Copy link
Contributor

whchung commented Nov 8, 2017

@AlexVlx i agree it's time to try get rid of WrapperGen now.

for llc --filetype=hsaco I'm not sure if that's a good idea. doing this seems to mix 2 things: llc (IR -> object / assembly) and lld (object -> ELF) together.

@bensander
Copy link
Contributor

bensander commented Nov 8, 2017 via email

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Nov 8, 2017

@bensander: the changes in the tip of HCC are non-intrusive, it will keep working with existing HIP.
@whchung: yeah, agreed, I think that instead of llc what I probably should have said is that we should consider introducing a flag which can be consumed by HCC and which yields a .hsaco file which is basically a serialised code object; that would simplify this workflow a lot.

ChrisKitching and others added 9 commits November 28, 2017 17:23
I'm particularly running into issues with `device_types.h` in real
CUDA code...
…bitwise conversion functions, by using simple reinterpret_casts, as is idiomatic. These functions are supposed to be re-entrant, correct and efficient. Sadly, they were neither: they hid a massive race condition against a value stored in global memory, which means that they were also unreasonably slow if they ever managed to be correct, and relied on union based type punning which is in a grey area of the standard. It is difficult to ascertain what may have been the reason for coming up with this quirky solution.
…cpy for bitcasting, and not rely on undefined behaviour of a different flavour as a substitute for the original undefined behaviour. Note that the compiler will (should) optimise down to the same emitted code, since this is a pattern it understands.
Change-Id: I67943859a6344c5eec0eaa23418c9b802ef72468
…based_dispatch_instead_of_pfe

# Conflicts:
#	src/hip_module.cpp
@AlexVlx
Copy link
Contributor Author

AlexVlx commented Nov 29, 2017

@bensander done and done - if you'd care to experiment with it and counter-verify that things pass on your end as well, including on old(er) HCC, it would be neat. Thank you.

bensander
bensander previously approved these changes Nov 29, 2017
@bensander
Copy link
Contributor

@AlexVlx - the 1.6 tests failed, see the Jenkins results. Here is a snippet of the error:

  • make build_tests -i -j8
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipTestConstant.dir/tests/src/directed_tests.kernel.hipTestConstant_generated_test_common.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipGridLaunch.dir/tests/src/directed_tests.kernel.hipGridLaunch_generated_test_common.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipPrintfKernel.dir/tests/src/directed_tests.kernel.hipPrintfKernel_generated_test_common.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipTestMemKernel.dir/tests/src/directed_tests.kernel.hipTestMemKernel_generated_test_common.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipLanguageExtensions.dir/tests/src/directed_tests.kernel.hipLanguageExtensions_generated_test_common.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipLaunchParm.dir/tests/src/directed_tests.kernel.hipLaunchParm_generated_test_common.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.inline_asm_vadd.dir/tests/src/kernel/directed_tests.kernel.inline_asm_vadd_generated_inline_asm_vadd.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipEmptyKernel.dir/tests/src/directed_tests.kernel.hipEmptyKernel_generated_test_common.cpp.o
    :1:2: error: invalid instruction mnemonic 'v_add_f32_e32'
    v_add_f32_e32 %xmm0, %xmm0, %xmm1
    ^~~~~~~~~~~~~
    error: cannot compile inline asm
    1 error generated.
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipLaunchParm.dir/tests/src/kernel/directed_tests.kernel.hipLaunchParm_generated_hipLaunchParm.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipTestMemKernel.dir/tests/src/kernel/directed_tests.kernel.hipTestMemKernel_generated_hipTestMemKernel.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipLanguageExtensions.dir/tests/src/kernel/directed_tests.kernel.hipLanguageExtensions_generated_hipLanguageExtensions.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipPrintfKernel.dir/tests/src/kernel/directed_tests.kernel.hipPrintfKernel_generated_hipPrintfKernel.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipEmptyKernel.dir/tests/src/kernel/directed_tests.kernel.hipEmptyKernel_generated_hipEmptyKernel.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipGridLaunch.dir/tests/src/kernel/directed_tests.kernel.hipGridLaunch_generated_hipGridLaunch.cpp.o
    [ 0%] Building HIPCC object CMakeFiles/directed_tests.kernel.hipTestConstant.dir/tests/src/kernel/directed_tests.kernel.hipTestConstant_generated_hipTestConstant.cpp.o
    Scanning dependencies of target directed_tests.kernel.inline_asm_vadd
    [ 0%] Linking HIP executable directed_tests/kernel/inline_asm_vadd
    /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../x86_64-linux-gnu/crt1.o: In function _start': (.text+0x20): undefined reference to main'
    /var/jenkins/workspace/-Developer-Tools_HIP_PR-246-QRTDCR73DQXLTD7NCFVSTVKTUIIWTE7UK2UTWTRZVJCQUECR3I3Q/build/release/staging/lib/libhip_hcc.so: undefined reference to Coordinates<&hc_get_workitem_id>::x' /var/jenkins/workspace/-Developer-Tools_HIP_PR-246-QRTDCR73DQXLTD7NCFVSTVKTUIIWTE7UK2UTWTRZVJCQUECR3I3Q/build/release/staging/lib/libhip_hcc.so: undefined reference to Coordinates<&hc_get_group_size>::X::operator unsigned int() const'
    /var/jenkins/workspace/-Developer-Tools_HIP_PR-246-QRTDCR73DQXLTD7NCFVSTVKTUIIWTE7UK2UTWTRZVJCQUECR3I3Q/build/release/staging/lib/libhip_hcc.so: undefined reference to Coordinates<&hc_get_num_groups>::x' /var/jenkins/workspace/-Developer-Tools_HIP_PR-246-QRTDCR73DQXLTD7NCFVSTVKTUIIWTE7UK2UTWTRZVJCQUECR3I3Q/build/release/staging/lib/libhip_hcc.so: undefined reference to Coordinates<&hc_get_num_groups>::X::operator unsigned int() const'
    /var/jenkins/workspace/-Developer-Tools_HIP_PR-246-QRTDCR73DQXLTD7NCFVSTVKTUIIWTE7UK2UTWTRZVJCQUECR3I3Q/build/release/staging/lib/libhip_hcc.so: undefined reference to Coordinates<&hc_get_group_id>::x' /var/jenkins/workspace/-Developer-Tools_HIP_PR-246-QRTDCR73DQXLTD7NCFVSTVKTUIIWTE7UK2UTWTRZVJCQUECR3I3Q/build/release/staging/lib/libhip_hcc.so: undefined reference to Coordinates<&hc_get_workitem_id>::X::operator unsigned int() const'
    /var/jenkins/workspace/-Developer-Tools_HIP_PR-246-QRTDCR73DQXLTD7NCFVSTVKTUIIWTE7UK2UTWTRZVJCQUECR3I3Q/build/release/staging/lib/libhip_hcc.so: undefined reference to Coordinates<&hc_get_group_id>::X::operator unsigned int() const' /var/jenkins/workspace/-Developer-Tools_HIP_PR-246-QRTDCR73DQXLTD7NCFVSTVKTUIIWTE7UK2UTWTRZVJCQUECR3I3Q/build/release/staging/lib/libhip_hcc.so: undefined reference to Coordinates<&hc_get_group_size>::x'
    clang-6.0: error: linker command failed with exit code 1 (use -v to see invocation)
    Died at /var/jenkins/workspace/-Developer-Tools_HIP_PR-246-QRTDCR73DQXLTD7NCFVSTVKTUIIWTE7UK2UTWTRZVJCQUECR3I3Q/build/release/staging/bin/hipcc line 500.
    [ 0%] Built target directed_tests.kernel.inline_asm_vadd

… later versions of the compiler, just like module based dispatch, and thus must be guarded against usage in earlier (e.g. 1.6) versions.
@bensander
Copy link
Contributor

LGTM. @kknox - how do we get the CI results to run again?

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Nov 29, 2017

@bensander I''d wait up a bit since I think I have a better solution that the one embodied in the latest commit (definitely less noisy). As for CI I think it automatically runs when the PR is updated.

…ork with later versions of the compiler, just like module based dispatch, and thus must be guarded against usage in earlier (e.g. 1.6) versions."

This reverts commit d2fd1f5
…ork with later versions of the compiler, just like module based dispatch, and thus must be guarded against usage in earlier (e.g. 1.6) versions."

This reverts commit d2fd1f5
…le. In this mode, there exist two executables per each code object, one created by HCC and one created by HIP. Since we dispatch through HCC in legacy mode, we should obtain the address for an agent allocated variable from the latter's executable. Also add two omitted validity checks, whose absence could lead to segfaults when the current process had no .kernel section and / or when an invalid or empty blob was extracted from the latter.
@bensander bensander merged commit dac57ba into ROCm:master Dec 1, 2017
kzhuravl pushed a commit to ROCm/llvm-project that referenced this pull request Oct 11, 2019
is emitted with full knowledge of its status as a kernel entry-point.
More specifically, we need the function to have the AMDGPU_KERNEL
calling convention. Unfortunately, only FunctionDecls with the
OpenCLKernel attribute are emitted accordingly, but we do not want (and
cannot handle) all of the overhead of the OpenCL specification (e.g.
adding explicit address space qualifiers to the kernel signature). As
such, we use this workaround which marks a __global__ function as an
OpenCL kernel as late as possible, after OpenCL semantic checks, but
before emitting the llvm::Function. This is rather unpleasant and
unlikely to ever be upstreamed - the right solution is to have
AMDGPU_KERNEL as its own calling convention, a la __stdcall, which can
be used orthogonally to OpenCL. We need this change for correct
code generation when using
ROCm/HIP#246.

(cherry picked from commit 7dd467d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants