adds dynamic shared mem allocation to cuda kernels #413

kings177 · 2024-08-16T22:00:51Z

with this hvm.cu should work virtually on every NVIDIA GPU out there (assuming 5.0 CC and above). it dynamically allocates shared memory based on the GPU's capabilities, specifically 3072 less bytes than the max opt-in shared mem available, as some shared arrays use roughly (a little bit less than) that amount of shared mem.

since shared mem allocation needs to be known at compile time, get_shared_mem.cu calculates the available shared mem in build time, which is ran by build.rs that then generates a header file shared_mem_config.h with the correct hex value for the local net.

Closes: #283 and #314 (supposedly)

enricozb · 2024-08-17T06:37:40Z

src/get_shared_mem.cu

+    // Subtract 3KB (3072 bytes) from the max shared memory as is allocated somewhere else
+    maxSharedMem -= 3072;


How did you compute this? I think this is from the automatic variables in some of the kernels. I think overall this is a fine approach for now but if we change the 3KB alloc in the runtime this will have to change too.

the sum of the alloc size of some local shared arrays is roughly ~3KB, i know that this is not the perfect way to do this, but it works for now.

developedby · 2024-08-17T07:51:04Z

src/hvm.cu

 // Local Net
-const u32 L_NODE_LEN = 0x2000;
-const u32 L_VARS_LEN = 0x2000;
+const u32 L_NODE_LEN = HVM_SHARED_MEM;


Aren't there places in the code that expect L_NODE_LEN to have this exact value?

If you change it I think we'll start to get either memory leaks, out of bounds access or non local memory use.

Also, if this number is smaller then default, the programs compiled by bend won't run.

If it's smaller the performance will also tank incredibly fast. (see the issue where someone halved this number and got worse performamce than in the cpu)

Aren't there places in the code that expect L_NODE_LEN to have this exact value?

i wouldn't say that it expects to have this value, but rather that it expects to have the numbers of a device with 8.9 of compute capability, but then again, this is not a general optimization of the CUDA version, this is just to make the allocation of shared mem on local net dynamic, so that users can install the hvm without having to manually change it on the hvm.cu, the rest of the code is still the same, made with the numbers of a 4090 in mind. We can slowly improve the code to make it device based, instead of hard-coded to the specs of a 4090.

Also, if this number is smaller then default, the programs compiled by bend won't run.

can you show me an example of what was ran and what was the device/numbers used when this happened?

If it's smaller the performance will also tank incredibly fast. (see the issue where someone halved this number and got worse performamce than in the cpu)

i mean, that's of course, if you give it less memory, it will have less memory, besides the fact that the rest of the code is 'optimized' for a 4090.

we can like, whenever someone installs it using a GPU with <99KB of max shared mem per block, we give a warning saying something along the lines of:
"HVM is currently optimized to run on devices with >=96KB of max shared mem, please be aware that your GPU performance will be reduced dramatically"

can you show me an example of what was ran and what was the device/numbers used when this happened?

Also, if this number is smaller then default, the programs compiled by bend won't run.

can you show me an example of what was ran and what was the device/numbers used when this happened?

The number of nodes in the local buffer determines the maximum size of hvm definitions. If we decrease L_NODE_LEN then Bend also needs to generate programs with smaller maximum definition size.

developedby · 2024-08-17T07:53:31Z

build.rs

+      .and_then(|_| Command::new("./get_shared_mem").output()) {
+        if output.status.success() {
+          let shared_mem_str = String::from_utf8_lossy(&output.stdout).trim().to_string();
+          std::fs::write("src/shared_mem_config.h", format!("#define HVM_SHARED_MEM {}", shared_mem_str))


Instead of writing it to a build generated file, can't we just set a define flag, like we do for the number of cores? It's prpbably possible with nvcc as well

true, that sounds cleaner indeed, will change it to that

kings177 added 3 commits August 16, 2024 18:24

add dynamic shared mem allocation for hvm.cu

5919734

.gitignore autogenerated cuda headers

d5e3261

remove unnecessary print line

1579e64

kings177 requested review from enricozb, developedby and VictorTaelin and removed request for enricozb August 16, 2024 22:02

enricozb approved these changes Aug 17, 2024

View reviewed changes

developedby requested changes Aug 17, 2024

View reviewed changes

directly pass the value of shared mem to a define method

2fa9f59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds dynamic shared mem allocation to cuda kernels #413

adds dynamic shared mem allocation to cuda kernels #413

kings177 commented Aug 16, 2024

enricozb Aug 17, 2024

kings177 Aug 17, 2024

developedby Aug 17, 2024

developedby Aug 17, 2024

kings177 Aug 17, 2024

developedby Aug 18, 2024

developedby Aug 17, 2024

kings177 Aug 17, 2024

		// Subtract 3KB (3072 bytes) from the max shared memory as is allocated somewhere else
		maxSharedMem -= 3072;

adds dynamic shared mem allocation to cuda kernels #413

Are you sure you want to change the base?

adds dynamic shared mem allocation to cuda kernels #413

Conversation

kings177 commented Aug 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment