Cuda performance #2 #266

adam-ce · 2014-11-06T19:35:08Z

Hi,
I experimented a bit more with my tests of glm's CUDA performance. My testing environment is again Linux, Cuda 6.5 and a Geforce GTX 550Ti. GLM's released version 0.9.5.4 was used as the current trunk doesn't compile on CUDA.

The result in short

GLM is still 12% behind CUDA's native types and helper_math.h in certain cases.

About the tests made

In the last bug report (issue #257 ) I used only test1 (watch out, the link contains the revision number). It was quite synthetic and therefore not very good. In my own code glm still turned out to be slower.

So I made sort of a minimal example only for testing glm, based on my code: test 2a. This is the example showing that glm is 12% behind cuda.

One interesting thing is, that when removing the "early exit" from the for loop in glm/cudaKernel, the difference between glm and cuda native is much lower (and the overall performance is better): test 2b. Note that the only difference compared to test 2a are lines 265ff.

Test results

I already found out some possibilities to improve glm's performance (see below), here are the results after the best/fastest changes:

#test 1
CUDA kernel launch with 19532 blocks of 256 threads
time for cuda glm (matrix):         546 milliseconds
time for cuda helper math (matrix): 660 milliseconds
time for cuda glm (dot):            471 milliseconds
time for cuda helper math (dot):    491 milliseconds
time for cuda glm (cross):          246 milliseconds
time for cuda helper math (cross):  246 milliseconds
#test 2a
time for glm:   468 milliseconds
time for cuda:  417 milliseconds
#test 2b
time for glm:   373 milliseconds
time for cuda:  370 milliseconds

I made a file containing all test results.

Code changes

One important change is aligning vec4 to 16 bytes, this is the same as in issue Bad matrix-vector multiplication performance in Cuda #257.
I also aligned vec2 to 8 bytes, but didn't test.
Aligning vec3 to 16 bytes gives improvements in the synthetic test, but in the test 2 it gives worse performance. The reason is probably different loads and register/memory usage.
removing all const references from the base classes' (mat4, vec3, vec4 etc.) methods gave an improvement in some cases (surprisingly test2a didn't change, but test2b did, this might be an error in my testing method though. I checked it several times and couldn't find anyting). It seems like passing by value is faster on the gpu, but in test 2a this is shadowed by some other bottleneck. The usage of const references is controlled by the "#define GLM_REFERENCE const &" in setup.hpp in my version of glm.
I experimented with the operator *(mat4, vec4), which changed the performance only in case of using const references.

Conclusion

There is still some issue with glm's cuda performance. I would be happy to continue testing if you give me some ideas on what to test. I just hope that it's not glm's elaborate usage of templates.

Something that could lead us to the solution is the difference from test 2a to 2b. It could mean that loading something (instructions?) from memory is more expensive in glm, and the early exit causes cache misses or something like that. But i'm just speculating.

The text was updated successfully, but these errors were encountered:

adam-ce · 2014-11-23T12:58:20Z

Groovounet wrote:

I created a new extension exposing aligned types. Alignment is definitely not what we always want even if it's extremely useful.

If you want an aligned flavor of a vec4, include and you can use aligned_vec4.

You can also define your own aligned type in a cross platform manner using:
GLM_ALIGNED_TYPEDEF(vec3, my_vec3, 16);

Where my_vec3 is a vec3 aligned to 16 bytes.

Somebody could easily miss that when starting to use glm with cuda. besides that writing glm::aligned_vec4 would be cumbersome and in cuda you would always use the aligned version.

I think something like

#ifndef __NVCC__
typedef tvec4.. vec4
#else
GLM_ALIGNED_TYPEDEF(tvec4.., vec4, 16);
#endif

and the same for vec2 would be better.

thanks for your work anyway : )

Groovounet · 2014-11-23T15:54:22Z

I don't think this is true. Actually if you pack correctly your structures and use aligned memory allocator _aligned_malloc (with GCC, don't know in Cuda) you problably don't even need aligned types.

So anyway, I think both are somewhat useful hence the solution to expose more types. :p

On the contrary, when requiring specify alignment, data won't be as well packed which might consume more bandwitch and individual cache line fetches.

adam-ce · 2014-11-23T16:05:52Z

hehe, this issue is only about cuda. so gcc doesn't matter here imho. neither was issue #257 about gcc.

and i agree that both data types can be useful in gcc, intel or ms compilers, but an unaligned vec4 is not useful in cuda. performance will be always worse as shown in the quite extensive benchmarks. the nvidia engineers certainly know what they are doing and they aligned their native float4 and float2.

if you want to expose the unaligned vec4 in cuda, an unaligned_vec4 datatype would make more sense. but the default should really be an aligned datatype. oh, and the same also goes for ivec4, ivec2 etc..

moreover this issue is not closed by any means :) performance is still behind cuda native types. maybe you mistook it with issue #257? i should have commented there, but i'm unsure whether you see comments on closed issues.

Groovounet · 2014-11-23T20:40:32Z

I strongly disagree with your point of view. Alignment is a compiler independent outside the fact that they require a minimum global alignment.

Alignment is a real topic that should not handle lightly. Sure in your test cases, it always show better performance but you are looking at a single scenario that is SoA based. With AoS cases, you will be very happen to used unaligned vec3 + a float if that case makes sense in your real life case scenario and such scenario happens all the time: Processing vertices in a compute shader for example.

Thanks,
Christophe

adam-ce · 2014-11-23T21:22:00Z

right now i was talking only about vec4 and vec2. I agree vec3 shouldn't be aligned to 16 bytes :)

moreover i didn't provide only one test case, i provided 3 different ones. one of them taken from real life, 2 of them having AoS (granted, small ones). all of them are clearly faster with aligned vec4.

and granted, it's not about compiler, it's about architecture. unaligned fetches are quite expensive on the gpu. an unaligned vec4 needs two fetches instead of one, multiply it by 1024 threads that can be in on group and you have a great bottleneck. there is a reason that the native float2 and float4 are aligned in cuda :)

with the situation of using unaligned_vec4 there are two issues in my opinion. first, it'll be hard to find. and second, it is a lot to write, code gets longer etc.

with the proposed solution nothing would change for non cuda projects. and cuda projects would benefit from a faster glm..

additionally it hurts a bit that you closed this issue/enhancement without even commenting on the performance issue. I put a lot of work into developing the tests and documenting the result. right now it seems a bit as if you didn't even read it. I mean, I like glm a lot and i would understand if you said that you don't have time or something, but then it shouldn't be closed but instead it should wait for later imo..

Groovounet · 2014-11-23T21:44:38Z

Another example: struct{vec3, vec2, vec3};
I have seen nothing like this in your tests.

I think it's a terrible idea to expose different behaviors on different platforms especially that alignment is not a platform specific issue. Furthermore, systematically aligning vec4 is not a consistent behavior with the rest of the world in that domain, this is not what people would expect so it's not a good idea.

aligned_vec4 resolves the miss aligned case of SoA and I am confident with the current resolution to be effective for all data structure scenarios.

I won't be commenting furthermore, this case is close to me.

Thanks,
Christophe

fhoenig · 2015-02-03T23:51:24Z

Hey I know you guys closed this but I just looked at some PTX output using GLM and it seems rather verbose. Has anybody done more in-depth tests on CUDA/GLM?

Groovounet added this to the GLM 0.9.6 milestone Nov 23, 2014

Groovounet added the enhancement label Nov 23, 2014

Groovounet self-assigned this Nov 23, 2014

Groovounet pushed a commit that referenced this issue Nov 23, 2014

Completed GTC_type_aligned #266 #257

411511c

Groovounet closed this as completed Nov 23, 2014

Groovounet added the wontfix label Nov 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda performance #2 #266

Cuda performance #2 #266

adam-ce commented Nov 6, 2014

adam-ce commented Nov 23, 2014

Groovounet commented Nov 23, 2014

adam-ce commented Nov 23, 2014

Groovounet commented Nov 23, 2014

adam-ce commented Nov 23, 2014

Groovounet commented Nov 23, 2014

fhoenig commented Feb 3, 2015

Cuda performance #2 #266

Cuda performance #2 #266

Comments

adam-ce commented Nov 6, 2014

The result in short

About the tests made

Test results

Code changes

Conclusion

adam-ce commented Nov 23, 2014

Groovounet commented Nov 23, 2014

adam-ce commented Nov 23, 2014

Groovounet commented Nov 23, 2014

adam-ce commented Nov 23, 2014

Groovounet commented Nov 23, 2014

fhoenig commented Feb 3, 2015