Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add __launch_bounds__ to kernels based on Executor properties #853

Merged
merged 32 commits into from
Aug 2, 2023

Conversation

esseivaju
Copy link
Contributor

Add template parameters to ActionLauncher to specify __launch_bounds__ for each kernel launch. It probably make sense to add a compile-time option to enable/disable launch_bounds but we'd ideally enable it independently for each kernel since it might improve performance only for a subset. Not sure what would be the most ergonomic way, maybe a cmake variable with a list of action labels?

@esseivaju esseivaju requested a review from sethrj July 13, 2023 00:20
@esseivaju esseivaju added the enhancement New feature or request label Jul 13, 2023
src/celeritas/global/ActionLauncher.device.hh Outdated Show resolved Hide resolved
__global__ void
__launch_bounds__(threadsPerBlock, blocksPerSm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that these have different meanings for CUDA and HIP: "minimum blocks per CU" vs "minimum warps per EU" (AMD has 4 EU per CU, and NVIDIA has 1). I think we should choose the latter as our argument since it's more general and convert in the CUDA case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right it's not exactly the same semantics. I'm pretty sure CUDA has 4 EU (warp scheduler) per SM for almost all architectures except CC6.0 which has 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the difference is that with NVIDIA, warps can move around execution units in an SM while that block is being executed (it's the block pinned to the SM), whereas with AMD warps are pinned to a specific EU. It's another layer of granularity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that so? I'll need to double check but I thought that a warp is bound to a scheduler in cuda as well.

@esseivaju
Copy link
Contributor Author

I don't like having to duplicate the kernel and constructors that much but haven't found another way other than using pre-processor directives.

Also, we'd need to compile vecgeom with limited register usage as well otherwise the compile device code would use more registers than the launch bounds specified by the ActionLauncher.

@esseivaju esseivaju marked this pull request as ready for review July 19, 2023 21:43
@esseivaju
Copy link
Contributor Author

@sethrj I'm not sure what is triggering the warning

Warning: Function too large, generated debug information may not be accurate.

in the vecgeom tests. 6b74aeb passed even though that commit only updated comments, also the files that are failing to compile change between each CI run.

src/corecel/sys/Device.hh Outdated Show resolved Hide resolved
src/celeritas/global/ActionLauncher.device.hh Outdated Show resolved Hide resolved
src/celeritas/global/ActionLauncher.device.hh Outdated Show resolved Hide resolved
@esseivaju
Copy link
Contributor Author

esseivaju commented Jul 27, 2023

@sethrj If we limit the number of registers for propagation in a uniform field, e.g. by specifying min_warps_per_eu=16 ( == 2 blocks per SM in CUDA semantics), the # of registers / thread is correctly limited to 128 however device functions compiled in vecgeom require more, resulting in link-time errors:

[7/30] Linking CUDA shared library CMakeFiles/celeritas_final.dir/cmake_device_link.o
FAILED: CMakeFiles/celeritas_final.dir/cmake_device_link.o 
/home/esseivaj/bld4/devel/spack/var/spack/environments/celeritas/.spack-env/view/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/cvmfs/sft.cern.ch/lcg/releases/gcc/11.3.0-ad0f5/x86_64-centos7/bin/g++ -Xcompiler -Wno-psabi -O3 -DNDEBUG --generate-code=arch=compute_80,code=[compute_80,sm_80] /home/esseivaj/bld4/devel/celeritas/build-ndebug/lib64/libceleritas_static.a /home/esseivaj/bld4/devel/celeritas/build-ndebug/lib64/libcorecel_static.a /home/esseivaj/bld4/devel/spack/var/spack/environments/celeritas/.spack-env/view/lib64/libvecgeomcuda_static.a -Xnvlink --suppress-stack-size-warning -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink src/celeritas/CMakeFiles/celeritas_final.dir/CMakeFiles/celeritas_emptyfile.cu.o -o CMakeFiles/celeritas_final.dir/cmake_device_link.o  -lrt -lpthread -lcudadevrt -lcudart
nvlink error   : entry function '__nv_static_57__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_180862__ZN9celeritas68_GLOBAL__N__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_18086226launch_bounded_action_implINS_24ConditionalTrackExecutorINS_6detail22IsAlongStepActionEqualENS3_18PropagationApplierINS3_29UniformFieldPropagatorFactoryEvEEEELi256ELi16ELi2EEEvNS_5RangeINS_8OpaqueIdINS_6ThreadEjEEEET_' with max regcount of 128 calls function '_ZNK7vecgeom4cuda22VSafetyEstimatorHelperINS0_18BVHSafetyEstimatorEE13ComputeSafetyERKNS0_8Vector3DIdEERKNS0_13NavStateIndexE' with regcount of 139
nvlink error   : entry function '__nv_static_57__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_180862__ZN9celeritas68_GLOBAL__N__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_18086226launch_bounded_action_implINS_24ConditionalTrackExecutorINS_6detail22IsAlongStepActionEqualENS3_18PropagationApplierINS3_29UniformFieldPropagatorFactoryEvEEEELi256ELi16ELi2EEEvNS_5RangeINS_8OpaqueIdINS_6ThreadEjEEEET_' with max regcount of 128 calls function '_ZNK7vecgeom4cuda30CommonUnplacedVolumeImplHelperINS0_22PolyconeImplementationINS0_9ConeTypes13UniversalConeEEENS0_16UnplacedPolyconeEE13DistanceToOutERKNS0_8Vector3DIdEESB_d' with regcount of 156
nvlink error   : entry function '__nv_static_57__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_180862__ZN9celeritas68_GLOBAL__N__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_18086226launch_bounded_action_implINS_24ConditionalTrackExecutorINS_6detail22IsAlongStepActionEqualENS3_18PropagationApplierINS3_29UniformFieldPropagatorFactoryEvEEEELi256ELi16ELi2EEEvNS_5RangeINS_8OpaqueIdINS_6ThreadEjEEEET_' with max regcount of 128 calls function '_ZNK7vecgeom4cuda30CommonUnplacedVolumeImplHelperINS0_22PolyconeImplementationINS0_9ConeTypes13UniversalConeEEENS0_16UnplacedPolyconeEE12DistanceToInERKNS0_8Vector3DIdEESB_d' with regcount of 158
nvlink error   : entry function '__nv_static_57__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_180862__ZN9celeritas68_GLOBAL__N__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_18086226launch_bounded_action_implINS_24ConditionalTrackExecutorINS_6detail22IsAlongStepActionEqualENS3_18PropagationApplierINS3_29UniformFieldPropagatorFactoryEvEEEELi256ELi16ELi2EEEvNS_5RangeINS_8OpaqueIdINS_6ThreadEjEEEET_' with max regcount of 128 calls function '_ZNK7vecgeom4cuda30CommonUnplacedVolumeImplHelperINS0_24PolyhedronImplementationIL11EInnerRadii0EL10EPhiCutout0EEENS0_15VUnplacedVolumeEE11SafetyToOutERKNS0_8Vector3DIdEE' with regcount of 220
nvlink error   : entry function '__nv_static_57__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_180862__ZN9celeritas68_GLOBAL__N__9ea30712_28_AlongStepUniformMscAction_cu_ad79c574_18086226launch_bounded_action_implINS_24ConditionalTrackExecutorINS_6detail22IsAlongStepActionEqualENS3_18PropagationApplierINS3_29UniformFieldPropagatorFactoryEvEEEELi256ELi16ELi2EEEvNS_5RangeINS_8OpaqueIdINS_6ThreadEjEEEET_' with max regcount of 128 calls function '_ZNK7vecgeom4cuda30CommonUnplacedVolumeImplHelperINS0_21GenTrapImplementationENS0_15VUnplacedVolumeEE13DistanceToOutERKNS0_8Vector3DIdEES8_d' with regcount of 254

@esseivaju
Copy link
Contributor Author

@sethrj I think this is ready, save for default values for max_block_size but I wouldn't use any. Either we don't specify a launch bound or we use optimal bounds after profiling.

Copy link
Member

@sethrj sethrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great stuff: clever but neat and simple. I've got just a couple of comments.

src/celeritas/global/ApplierInterface.hh Outdated Show resolved Hide resolved
src/celeritas/global/ApplierInterface.hh Outdated Show resolved Hide resolved
src/celeritas/global/ActionLauncher.device.hh Outdated Show resolved Hide resolved
src/celeritas/global/ApplierTrait.hh Outdated Show resolved Hide resolved
src/celeritas/global/ActionLauncher.device.hh Outdated Show resolved Hide resolved
src/celeritas/global/ActionLauncher.device.hh Outdated Show resolved Hide resolved
src/celeritas/global/ApplierTrait.hh Outdated Show resolved Hide resolved
src/celeritas/global/ApplierTrait.hh Outdated Show resolved Hide resolved
src/celeritas/global/ActionLauncher.device.hh Outdated Show resolved Hide resolved
@sethrj sethrj self-requested a review August 1, 2023 23:32
@sethrj sethrj changed the title Templated __launch_bounds__ kernel launch in ActionLauncher Add __launch_bounds__ to kernels based on Executor properties Aug 2, 2023
@sethrj sethrj merged commit 0649a62 into celeritas-project:develop Aug 2, 2023
@sethrj sethrj added the core Software engineering infrastructure label Aug 2, 2023
@sethrj
Copy link
Member

sethrj commented Oct 5, 2023

@esseivaju it looks like the min_warps_per_eu in UniformFieldPropagatorFactory caused a serious performance regression (overall factor of 3 slower!) on AMD by forcing the occupancy too high:

 {
 "const_mem": 0,
 "heap_size": 68702699520,
-"local_mem": 424,
-"max_blocks_per_cu": 4,
-"max_threads_per_block": 1024,
-"max_warps_per_eu": 4,
+"local_mem": 744,
+"max_blocks_per_cu": 8,
+"max_threads_per_block": 256,
+"max_warps_per_eu": 8,
 "name": "along-step-uniform-msc-propagate",
-"num_regs": 128,
-"occupancy": 0.5,
+"num_regs": 64,
+"occupancy": 1.0,
 "print_buffer_size": 0,
 "threads_per_block": 256
 },

The equivalent kernel in CUDA (V100) is

{
"const_mem": 0,
"heap_size": 8388608,
"local_mem": 0,
"max_blocks_per_cu": 1,
"max_threads_per_block": 256,
"max_warps_per_eu": 8,
"name": "along-step-uniform-msc-propagate",
"num_regs": 184,
"occupancy": 0.125,
"print_buffer_size": 5242880,
"stack_size": 1024,
"threads_per_block": 256
}

Maybe we should be using "blocks per CU" as the criteria as you originally suggested, and scale it for HIP...

@esseivaju esseivaju deleted the launch-bounds branch October 26, 2023 16:15
@sethrj sethrj added performance Changes for performance optimization and removed core Software engineering infrastructure labels Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Changes for performance optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants