Add `__launch_bounds__` to kernels based on Executor properties #853

esseivaju · 2023-07-13T00:20:13Z

Add template parameters to ActionLauncher to specify __launch_bounds__ for each kernel launch. It probably make sense to add a compile-time option to enable/disable launch_bounds but we'd ideally enable it independently for each kernel since it might improve performance only for a subset. Not sure what would be the most ergonomic way, maybe a cmake variable with a list of action labels?

src/celeritas/global/ActionLauncher.device.hh

sethrj · 2023-07-13T15:08:04Z

src/celeritas/global/ActionLauncher.device.hh

 __global__ void
+__launch_bounds__(threadsPerBlock, blocksPerSm)


Remember that these have different meanings for CUDA and HIP: "minimum blocks per CU" vs "minimum warps per EU" (AMD has 4 EU per CU, and NVIDIA has 1). I think we should choose the latter as our argument since it's more general and convert in the CUDA case.

Right it's not exactly the same semantics. I'm pretty sure CUDA has 4 EU (warp scheduler) per SM for almost all architectures except CC6.0 which has 2.

I think the difference is that with NVIDIA, warps can move around execution units in an SM while that block is being executed (it's the block pinned to the SM), whereas with AMD warps are pinned to a specific EU. It's another layer of granularity.

Is that so? I'll need to double check but I thought that a warp is bound to a scheduler in cuda as well.

…ction object

esseivaju · 2023-07-13T23:36:28Z

I don't like having to duplicate the kernel and constructors that much but haven't found another way other than using pre-processor directives.

Also, we'd need to compile vecgeom with limited register usage as well otherwise the compile device code would use more registers than the launch bounds specified by the ActionLauncher.

src/celeritas/global/ActionLauncher.device.hh

esseivaju · 2023-07-20T20:12:42Z

@sethrj I'm not sure what is triggering the warning

Warning: Function too large, generated debug information may not be accurate.

in the vecgeom tests. 6b74aeb passed even though that commit only updated comments, also the files that are failing to compile change between each CI run.

src/corecel/sys/Device.hh

src/celeritas/global/alongstep/AlongStepUniformMscAction.cu

src/celeritas/global/ActionLauncher.device.hh

…efault" This reverts commit ef681cf.

This reverts commit 9c39a82.

esseivaju · 2023-07-27T18:41:02Z

@sethrj If we limit the number of registers for propagation in a uniform field, e.g. by specifying min_warps_per_eu=16 ( == 2 blocks per SM in CUDA semantics), the # of registers / thread is correctly limited to 128 however device functions compiled in vecgeom require more, resulting in link-time errors:

[7/30] Linking CUDA shared library CMakeFiles/celeritas_final.dir/cmake_device_link.o
FAILED: CMakeFiles/celeritas_final.dir/cmake_device_link.o 
/home/esseivaj/bld4/devel/spack/var/spack/environments/celeritas/.spack-env/view/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/cvmfs/sft.cern.ch/lcg/releases/gcc/11.3.0-ad0f5/x86_64-centos7/bin/g++ -Xcompiler -Wno-psabi -O3 -DNDEBUG --generate-code=arch=compute_80,code=[compute_80,sm_80] /home/esseivaj/bld4/devel/celeritas/build-ndebug/lib64/libceleritas_static.a /home/esseivaj/bld4/devel/celeritas/build-ndebug/lib64/libcorecel_static.a /home/esseivaj/bld4/devel/spack/var/spack/environments/celeritas/.spack-env/view/lib64/libvecgeomcuda_static.a -Xnvlink --suppress-stack-size-warning -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink src/celeritas/CMakeFiles/celeritas_final.dir/CMakeFiles/celeritas_emptyfile.cu.o -o CMakeFiles/celeritas_final.dir/cmake_device_link.o  -lrt -lpthread -lcudadevrt -lcudart
nvlink error   : entry function '__nv_static_57__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_180862__ZN9celeritas68_GLOBAL__N__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_18086226launch_bounded_action_implINS_24ConditionalTrackExecutorINS_6detail22IsAlongStepActionEqualENS3_18PropagationApplierINS3_29UniformFieldPropagatorFactoryEvEEEELi256ELi16ELi2EEEvNS_5RangeINS_8OpaqueIdINS_6ThreadEjEEEET_' with max regcount of 128 calls function '_ZNK7vecgeom4cuda22VSafetyEstimatorHelperINS0_18BVHSafetyEstimatorEE13ComputeSafetyERKNS0_8Vector3DIdEERKNS0_13NavStateIndexE' with regcount of 139
nvlink error   : entry function '__nv_static_57__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_180862__ZN9celeritas68_GLOBAL__N__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_18086226launch_bounded_action_implINS_24ConditionalTrackExecutorINS_6detail22IsAlongStepActionEqualENS3_18PropagationApplierINS3_29UniformFieldPropagatorFactoryEvEEEELi256ELi16ELi2EEEvNS_5RangeINS_8OpaqueIdINS_6ThreadEjEEEET_' with max regcount of 128 calls function '_ZNK7vecgeom4cuda30CommonUnplacedVolumeImplHelperINS0_22PolyconeImplementationINS0_9ConeTypes13UniversalConeEEENS0_16UnplacedPolyconeEE13DistanceToOutERKNS0_8Vector3DIdEESB_d' with regcount of 156
nvlink error   : entry function '__nv_static_57__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_180862__ZN9celeritas68_GLOBAL__N__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_18086226launch_bounded_action_implINS_24ConditionalTrackExecutorINS_6detail22IsAlongStepActionEqualENS3_18PropagationApplierINS3_29UniformFieldPropagatorFactoryEvEEEELi256ELi16ELi2EEEvNS_5RangeINS_8OpaqueIdINS_6ThreadEjEEEET_' with max regcount of 128 calls function '_ZNK7vecgeom4cuda30CommonUnplacedVolumeImplHelperINS0_22PolyconeImplementationINS0_9ConeTypes13UniversalConeEEENS0_16UnplacedPolyconeEE12DistanceToInERKNS0_8Vector3DIdEESB_d' with regcount of 158
nvlink error   : entry function '__nv_static_57__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_180862__ZN9celeritas68_GLOBAL__N__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_18086226launch_bounded_action_implINS_24ConditionalTrackExecutorINS_6detail22IsAlongStepActionEqualENS3_18PropagationApplierINS3_29UniformFieldPropagatorFactoryEvEEEELi256ELi16ELi2EEEvNS_5RangeINS_8OpaqueIdINS_6ThreadEjEEEET_' with max regcount of 128 calls function '_ZNK7vecgeom4cuda30CommonUnplacedVolumeImplHelperINS0_24PolyhedronImplementationIL11EInnerRadii0EL10EPhiCutout0EEENS0_15VUnplacedVolumeEE11SafetyToOutERKNS0_8Vector3DIdEE' with regcount of 220
nvlink error   : entry function '__nv_static_57__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_180862__ZN9celeritas68_GLOBAL__N__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_18086226launch_bounded_action_implINS_24ConditionalTrackExecutorINS_6detail22IsAlongStepActionEqualENS3_18PropagationApplierINS3_29UniformFieldPropagatorFactoryEvEEEELi256ELi16ELi2EEEvNS_5RangeINS_8OpaqueIdINS_6ThreadEjEEEET_' with max regcount of 128 calls function '_ZNK7vecgeom4cuda30CommonUnplacedVolumeImplHelperINS0_21GenTrapImplementationENS0_15VUnplacedVolumeEE13DistanceToOutERKNS0_8Vector3DIdEES8_d' with regcount of 254

…an Applier member type

esseivaju · 2023-07-28T21:41:12Z

@sethrj I think this is ready, save for default values for max_block_size but I wouldn't use any. Either we don't specify a launch bound or we use optimal bounds after profiling.

src/celeritas/global/ApplierInterface.hh

sethrj

This is great stuff: clever but neat and simple. I've got just a couple of comments.

src/celeritas/global/ApplierInterface.hh

src/celeritas/global/ActionLauncher.device.hh

src/celeritas/global/ApplierTrait.hh

src/celeritas/global/ActionLauncher.device.hh

src/celeritas/global/ApplierTrait.hh

src/celeritas/global/alongstep/detail/UniformFieldPropagatorFactory.hh

src/celeritas/global/ActionLauncher.device.hh

…on pointer

src/celeritas/global/ActionLauncher.device.hh

sethrj · 2023-10-05T17:19:41Z

@esseivaju it looks like the min_warps_per_eu in UniformFieldPropagatorFactory caused a serious performance regression (overall factor of 3 slower!) on AMD by forcing the occupancy too high:

 {
 "const_mem": 0,
 "heap_size": 68702699520,
-"local_mem": 424,
-"max_blocks_per_cu": 4,
-"max_threads_per_block": 1024,
-"max_warps_per_eu": 4,
+"local_mem": 744,
+"max_blocks_per_cu": 8,
+"max_threads_per_block": 256,
+"max_warps_per_eu": 8,
 "name": "along-step-uniform-msc-propagate",
-"num_regs": 128,
-"occupancy": 0.5,
+"num_regs": 64,
+"occupancy": 1.0,
 "print_buffer_size": 0,
 "threads_per_block": 256
 },

The equivalent kernel in CUDA (V100) is

{
"const_mem": 0,
"heap_size": 8388608,
"local_mem": 0,
"max_blocks_per_cu": 1,
"max_threads_per_block": 256,
"max_warps_per_eu": 8,
"name": "along-step-uniform-msc-propagate",
"num_regs": 184,
"occupancy": 0.125,
"print_buffer_size": 5242880,
"stack_size": 1024,
"threads_per_block": 256
}

Maybe we should be using "blocks per CU" as the criteria as you originally suggested, and scale it for HIP...

templated __launch_bounds__ ActionLauncher

9ec00c9

esseivaju requested a review from sethrj July 13, 2023 00:20

esseivaju added the enhancement New feature or request label Jul 13, 2023

sethrj reviewed Jul 13, 2023

View reviewed changes

variadic template, configure maxThreadsPerBlock, minWarpsPerWS in fun…

07f3c94

…ction object

esseivaju commented Jul 14, 2023

View reviewed changes

src/celeritas/global/ActionLauncher.device.hh Outdated Show resolved Hide resolved

esseivaju added 4 commits July 18, 2023 16:34

Merge branch 'develop' into launch-bounds

3aa696f

hip-compatible __launch_bounds__ arguments

a2ab974

Match CUDA/HIP semantics of __launch_bounds__ 2nd argument

8a5224c

factorize kernel implementation

910ebc7

esseivaju marked this pull request as ready for review July 19, 2023 21:43

esseivaju added 4 commits July 19, 2023 15:22

split default_block_size and block_size to have a constexpr default

ef681cf

fix url

6b74aeb

typos

1ce4ab9

update block_size call

9c39a82

preprocessor error instead of compile error

c232a0b

sethrj reviewed Jul 26, 2023

View reviewed changes

esseivaju added 8 commits July 26, 2023 09:35

HIP semantics for launch_bounds second argument

e4ebd27

Revert "split default_block_size and block_size to have a constexpr d…

bcf7dc0

…efault" This reverts commit ef681cf.

Revert "update block_size call"

74a6187

This reverts commit 9c39a82.

Rename launch_bounds arguments

43804ce

Use SFINAE to detect if ActionLauncher's executer has kernel bounds

fd789ac

factorize predicates

c4f97fd

cleanup syntax and doc update

27ee153

rename executor to applier

8772e80

esseivaju added 2 commits July 28, 2023 09:08

Empty commit to test ci

8451438

Add an instantation for select_kernel if the executor doesn't define …

a8c0321

…an Applier member type

esseivaju commented Jul 28, 2023

View reviewed changes

src/celeritas/global/ApplierInterface.hh Outdated Show resolved Hide resolved

esseivaju added 2 commits July 28, 2023 14:45

renaming

d3c2429

more modern void cast

60812c5

sethrj requested changes Jul 31, 2023

View reviewed changes

esseivaju added 3 commits July 31, 2023 09:55

Address comments

4268b4e

doc

4e6aa86

header

cb052bb

esseivaju commented Jul 31, 2023

View reviewed changes

src/celeritas/global/ActionLauncher.device.hh Outdated Show resolved Hide resolved

esseivaju and others added 4 commits July 31, 2023 16:49

instantiate one kernel for a given executor F instead of using functi…

5cf851a

…on pointer

Pass function ptr to KernelParamCalculator

0bb1c5a

Minor formatting tweaks

ce5c537

Simplify changeset for propagation applier

df1369c

sethrj approved these changes Aug 1, 2023

View reviewed changes

__device__ ctor

263a662

esseivaju commented Aug 1, 2023

View reviewed changes

src/celeritas/global/ActionLauncher.device.hh Show resolved Hide resolved

sethrj self-requested a review August 1, 2023 23:32

pass max_block_size to KernelParamsCalculator

f0fb9ff

sethrj changed the title ~~Templated __launch_bounds__ kernel launch in ActionLauncher~~ Add __launch_bounds__ to kernels based on Executor properties Aug 2, 2023

sethrj approved these changes Aug 2, 2023

View reviewed changes

sethrj merged commit 0649a62 into celeritas-project:develop Aug 2, 2023

sethrj added the core Software engineering infrastructure label Aug 2, 2023

esseivaju mentioned this pull request Aug 2, 2023

Validate KernelParamCalculator block size using CUDA/HIP function attributes #877

Merged

sethrj mentioned this pull request Oct 5, 2023

Fix performance regression on HIP #968

Merged

esseivaju deleted the launch-bounds branch October 26, 2023 16:15

sethrj added performance Changes for performance optimization and removed core Software engineering infrastructure labels Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `__launch_bounds__` to kernels based on Executor properties #853

Add `__launch_bounds__` to kernels based on Executor properties #853

esseivaju commented Jul 13, 2023

sethrj Jul 13, 2023

esseivaju Jul 13, 2023

sethrj Jul 13, 2023

esseivaju Jul 13, 2023

esseivaju commented Jul 13, 2023

esseivaju commented Jul 20, 2023

esseivaju commented Jul 27, 2023 •

edited

Loading

esseivaju commented Jul 28, 2023

sethrj left a comment

sethrj commented Oct 5, 2023

		__global__ void
		__launch_bounds__(threadsPerBlock, blocksPerSm)

Add __launch_bounds__ to kernels based on Executor properties #853

Add __launch_bounds__ to kernels based on Executor properties #853

Conversation

esseivaju commented Jul 13, 2023

sethrj Jul 13, 2023

Choose a reason for hiding this comment

esseivaju Jul 13, 2023

Choose a reason for hiding this comment

sethrj Jul 13, 2023

Choose a reason for hiding this comment

esseivaju Jul 13, 2023

Choose a reason for hiding this comment

esseivaju commented Jul 13, 2023

esseivaju commented Jul 20, 2023

esseivaju commented Jul 27, 2023 • edited Loading

esseivaju commented Jul 28, 2023

sethrj left a comment

Choose a reason for hiding this comment

sethrj commented Oct 5, 2023

Add `__launch_bounds__` to kernels based on Executor properties #853

Add `__launch_bounds__` to kernels based on Executor properties #853

esseivaju commented Jul 27, 2023 •

edited

Loading