Bad matrix-vector multiplication performance in Cuda #257

adam-ce · 2014-10-28T12:28:20Z

During development I noticed that glm slowed down my project quite a bit. So I made some simple tests with cuda + glm.

I basically do

result[i] = matrix * vectors[i];

several times in a cuda kernel. Once using glm and once cuda's float4 + a very simple custom mat4.

time for cuda glm (matrix): 859 milliseconds
time for cuda helper math (matrix): 225 milliseconds

interestingly dot and cross product are faster with glm in a similar test:

time for cuda glm (dot and cross): 175 milliseconds
time for cuda helper math (dot and cross): 248 milliseconds

I used a GForce GTX 550 Ti, CUDA 6.5 and GLM 0.9.5.4 on linux for the test.

i've put the full test onto bitbucket (https://bitbucket.org/adamce/cuda-glm-performance-test/src/).

the test is in src/main.cu
the custom mat4 and multiplication operator are in helper_math.h on the very bottom.

The text was updated successfully, but these errors were encountered:

Groovounet · 2014-10-28T12:38:09Z

Very interesting.

It's possible that the strategies in place are better fit for SSE register size allocations than for scalar GPUs. Actually, the implementation is really based SSE registers.

It seems very possible to come with a better strategy for scalar GPUs. I'll investigate when I find the time for this.

Thanks,
Christophe

adam-ce · 2014-10-28T13:44:24Z

Could you give a rough estimate so that I can decide whether to use glm for my project?

Thanks, adam

Groovounet · 2014-10-28T14:12:28Z

Sadly I can't.

adam-ce · 2014-10-28T17:30:44Z

ok : )

I made one additional test by replacing operator*(..) with my own function:

 __device__ __host__ glm::vec4 mul(glm::mat4 m, glm::vec4 v) {
    return glm::vec4(m[0].x*v.x + m[1].x*v.y + m[2].x*v.z + m[3].x*v.w,
                 m[0].y*v.x + m[1].y*v.y + m[2].y*v.z + m[3].y*v.w,
                 m[0].z*v.x + m[1].z*v.y + m[2].z*v.z + m[3].z*v.w,
                 m[0].w*v.x + m[1].w*v.y + m[2].w*v.z + m[3].w*v.w);
 }

the performance didn't change. it's still 4 times slower. therefore I think it might have something to do with how the matrices are stored. in my custom code rowMajor is used, glm uses column major as far as i know. on the other hand, i also tried storing row major and flipping my custom mul with no success.

bottom line: i think it's neither row major storage nor the multiplication function, but something else.

chbaker0 · 2014-10-28T19:37:15Z

Looking in glm/detail/type_vec4.hpp I saw the storage type used in vec4 is this:

    template <typename T>
    struct simd
    {
        typedef T type[4];
    };

#   if(GLM_ARCH & GLM_ARCH_SSE2)
        template <>
        struct simd<float>
        {
            typedef __m128 type;
        };
#   endif

#   if(GLM_ARCH & GLM_ARCH_AVX)
        template <>
        struct simd<double>
        {
            typedef __m256d type;
        };
#   endif

And the mat4 is just an array of 4 vec4. It's reasonable to believe that CUDA's builtin float4 has specific alignment, while the regular simd struct doesn't. Just speculation of course, but you might want to try applying alignment and see how that affects performance.

adam-ce · 2014-10-28T22:30:06Z

yey : D

those are the new benchmarks 8)

time for cuda glm (matrix): 233 milliseconds
time for cuda helper math (matrix): 225 milliseconds
time for cuda glm (dot): 187 milliseconds
time for cuda helper math (dot): 307 milliseconds
time for cuda glm (cross): 45 milliseconds
time for cuda helper math (cross): 163 milliseconds

the matrix multiplication is not totally there yet, but very close. the other two are even faster than before.

this is a diff to version 0.9.5.4:

diff -r 653704ab2abc src/glm/detail/type_vec4.hpp
--- a/src/glm/detail/type_vec4.hpp  Tue Oct 28 19:35:36 2014 +0100
+++ b/src/glm/detail/type_vec4.hpp  Tue Oct 28 23:24:52 2014 +0100
@@ -29,6 +29,11 @@
#ifndef glm_core_type_gentype4
#define glm_core_type_gentype4

+#if !defined(__CUDA_LIBDEVICE__)
+#include "builtin_types.h"
+#endif /* !__CUDA_LIBDEVICE__ */
+#include "host_defines.h"
+
//#include "../fwd.hpp"
#include "setup.hpp"
#include "type_vec.hpp"
@@ -45,7 +50,7 @@
namespace detail
{
    template <typename T, precision P>
-   struct tvec4
+    struct __builtin_align__(16) tvec4
    {
        //////////////////////////////////////
        // Implementation detail

but i'm afraid it can't go into master like that. anyway, it fixes glm for me now and i can use it until the change goes in. thanks, i'm happy.

Groovounet · 2014-10-28T22:44:25Z

Interesting results.

fvec4SIMD does have these alignments code in place and yes it's pretty important for CPU SSE code paths.

Do you actually need:
+#if !defined(CUDA_LIBDEVICE)
+#include "builtin_types.h"
+#endif /* !CUDA_LIBDEVICE */
+#include "host_defines.h"
?

adam-ce · 2014-10-28T22:59:52Z

hm, it seems that i don't need them. i put them originally because qtcreator showed me an error.

but at the moment i'm not sure of anything because after messing around with glm's mat4*vec4 operator suddenly cuda's helper math matrix operations take 333 milliseconds (100 more). So i'll reboot and recheck. (edit: after reboot it's back to 225)

is it possible to use some define or something to enable it, so that i can use vanilla glm?

one more info:
I don't know if it's significant, but using

 return typename tmat4x4<T, P>::col_type(
        m[0][0] * v[0] + m[1][0] * v[1] + m[2][0] * v[2] + m[3][0] * v[3],
        m[0][1] * v[0] + m[1][1] * v[1] + m[2][1] * v[2] + m[3][1] * v[3],
        m[0][2] * v[0] + m[1][2] * v[1] + m[2][2] * v[2] + m[3][2] * v[3],
        m[0][3] * v[0] + m[1][3] * v[1] + m[2][3] * v[2] + m[3][3] * v[3]);

instead of the code with Mov0(v[0]); gives an improvement from 233 to 230 msec.

Groovounet · 2014-10-28T23:08:45Z

Use GLM_ALIGN(16) instead of builtin_align(16) and you might be good to go...
Erm, actually I probably need to update setup.hpp that defines GLM_ALIGN to support CUDA...

But that's probably all that is needed here.

adam-ce · 2014-10-28T23:25:47Z

diff -r b492363e33cf src/glm/detail/setup.hpp
--- a/src/glm/detail/setup.hpp  Tue Oct 28 23:28:14 2014 +0100
+++ b/src/glm/detail/setup.hpp  Wed Oct 29 00:23:18 2014 +0100
@@ -755,7 +755,7 @@
#  define GLM_ALIGNED_STRUCT(x) __declspec(align(x)) struct
#  define GLM_RESTRICT
#  define GLM_RESTRICT_VAR __restrict
-#elif(GLM_COMPILER & (GLM_COMPILER_GCC | GLM_COMPILER_CLANG))
+#elif(GLM_COMPILER & (GLM_COMPILER_GCC | GLM_COMPILER_CLANG | GLM_COMPILER_CUDA))
#  define GLM_DEPRECATED __attribute__((__deprecated__))
#  define GLM_ALIGN(x) __attribute__((aligned(x)))
#  define GLM_ALIGNED_STRUCT(x) struct __attribute__((aligned(x)))
diff -r b492363e33cf src/glm/detail/type_vec4.hpp
--- a/src/glm/detail/type_vec4.hpp  Tue Oct 28 23:28:14 2014 +0100
+++ b/src/glm/detail/type_vec4.hpp  Wed Oct 29 00:23:18 2014 +0100
@@ -50,7 +45,7 @@
namespace detail
{
    template <typename T, precision P>
-    struct tvec4
+    GLM_ALIGNED_STRUCT(16) tvec4
    {
        //////////////////////////////////////
        // Implementation detail

this works for me.

Groovounet · 2014-10-28T23:44:39Z

This changes made it to master for GLM 0.9.6 release.

Thanks for contributing!
Christophe

adam-ce · 2014-10-29T19:34:38Z

do you think it would be also possible to add the alignment rules for vec2 and 3?

it would be align(8) for vec2 and align(16) for vec3.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
edit: cross that for vec3. it's not aligned in cuda.

i did some tests with the cross kernel:
float4
time for cuda glm (cross): 44 milliseconds
time for cuda helper math (cross): 44 milliseconds

float3
time for cuda glm (cross): 87 milliseconds (this includes align(16), before it was also at 98 msec)
time for cuda helper math (cross): 98 milliseconds

i don't understand why vec3 aligned to 16 bytes is slower than vec4. it might be the testing method. but the thing is that the cuda docs recommend align 16, so it should be better..

Groovounet · 2014-11-12T20:21:53Z

The fix for this bug turns out to be not valid as it would force all vec4 types to be aligned to 16 bytes including i8vec4 for example.

The fix has been reverted for the moment.

Groovounet · 2014-11-23T12:05:59Z

I created a new extension exposing aligned types. Alignment is definitely not what we always want even if it's extremely useful.

If you want an aligned flavor of a vec4, include <glm/gtc/type_aligned.hpp> and you can use aligned_vec4.

You can also define your own aligned type in a cross platform manner using:
GLM_ALIGNED_TYPEDEF(vec3, my_vec3, 16);

Where my_vec3 is a vec3 aligned to 16 bytes.

Thanks for contributing,
Christophe

Groovounet · 2014-11-23T20:45:35Z

Hi,

What's left to fix here?
Thanks,
Christophe

Groovounet mentioned this issue Oct 28, 2014

Add vec4 alignment rules #258

Closed

Groovounet pushed a commit that referenced this issue Oct 28, 2014

Fixed bad matrix-vector performance in Cuda #257, #258

e8923c9

Groovounet pushed a commit that referenced this issue Oct 28, 2014

Fixed bad matrix-vector performance in Cuda #257, #258

2a2cd0b

Groovounet added the bug label Oct 28, 2014

Groovounet added this to the GLM 0.9.6 milestone Oct 28, 2014

Groovounet self-assigned this Oct 28, 2014

Groovounet closed this as completed Oct 28, 2014

adam-ce mentioned this issue Nov 6, 2014

Cuda performance #2 #266

Closed

Groovounet reopened this Nov 12, 2014

Groovounet pushed a commit that referenced this issue Nov 12, 2014

Reverse #257, breaking many vec4 types size (i8vec4, u16vec4 etc)

f09888a

Groovounet pushed a commit that referenced this issue Nov 23, 2014

Completed GTC_type_aligned #266 #257

411511c

Groovounet closed this as completed Nov 23, 2014

Groovounet added wontfix enhancement and removed bug labels Nov 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad matrix-vector multiplication performance in Cuda #257

Bad matrix-vector multiplication performance in Cuda #257

adam-ce commented Oct 28, 2014

Groovounet commented Oct 28, 2014

adam-ce commented Oct 28, 2014

Groovounet commented Oct 28, 2014

adam-ce commented Oct 28, 2014

chbaker0 commented Oct 28, 2014

adam-ce commented Oct 28, 2014

Groovounet commented Oct 28, 2014

adam-ce commented Oct 28, 2014

Groovounet commented Oct 28, 2014

adam-ce commented Oct 28, 2014

Groovounet commented Oct 28, 2014

adam-ce commented Oct 29, 2014

Groovounet commented Nov 12, 2014

Groovounet commented Nov 23, 2014

Groovounet commented Nov 23, 2014

Bad matrix-vector multiplication performance in Cuda #257

Bad matrix-vector multiplication performance in Cuda #257

Comments

adam-ce commented Oct 28, 2014

Groovounet commented Oct 28, 2014

adam-ce commented Oct 28, 2014

Groovounet commented Oct 28, 2014

adam-ce commented Oct 28, 2014

chbaker0 commented Oct 28, 2014

adam-ce commented Oct 28, 2014

Groovounet commented Oct 28, 2014

adam-ce commented Oct 28, 2014

Groovounet commented Oct 28, 2014

adam-ce commented Oct 28, 2014

Groovounet commented Oct 28, 2014

adam-ce commented Oct 29, 2014

Groovounet commented Nov 12, 2014

Groovounet commented Nov 23, 2014

Groovounet commented Nov 23, 2014