Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad matrix-vector multiplication performance in Cuda #257

Closed
adam-ce opened this issue Oct 28, 2014 · 15 comments
Closed

Bad matrix-vector multiplication performance in Cuda #257

adam-ce opened this issue Oct 28, 2014 · 15 comments
Assignees
Milestone

Comments

@adam-ce
Copy link

adam-ce commented Oct 28, 2014

During development I noticed that glm slowed down my project quite a bit. So I made some simple tests with cuda + glm.

I basically do

result[i] = matrix * vectors[i];

several times in a cuda kernel. Once using glm and once cuda's float4 + a very simple custom mat4.

time for cuda glm (matrix): 859 milliseconds
time for cuda helper math (matrix): 225 milliseconds

interestingly dot and cross product are faster with glm in a similar test:

time for cuda glm (dot and cross): 175 milliseconds
time for cuda helper math (dot and cross): 248 milliseconds

I used a GForce GTX 550 Ti, CUDA 6.5 and GLM 0.9.5.4 on linux for the test.

i've put the full test onto bitbucket (https://bitbucket.org/adamce/cuda-glm-performance-test/src/).

  • the test is in src/main.cu
  • the custom mat4 and multiplication operator are in helper_math.h on the very bottom.
@Groovounet
Copy link
Member

Very interesting.

It's possible that the strategies in place are better fit for SSE register size allocations than for scalar GPUs. Actually, the implementation is really based SSE registers.

It seems very possible to come with a better strategy for scalar GPUs. I'll investigate when I find the time for this.

Thanks,
Christophe

@adam-ce
Copy link
Author

adam-ce commented Oct 28, 2014

Could you give a rough estimate so that I can decide whether to use glm for my project?

Thanks, adam

@Groovounet
Copy link
Member

Sadly I can't.

@adam-ce
Copy link
Author

adam-ce commented Oct 28, 2014

ok : )

I made one additional test by replacing operator*(..) with my own function:

 __device__ __host__ glm::vec4 mul(glm::mat4 m, glm::vec4 v) {
    return glm::vec4(m[0].x*v.x + m[1].x*v.y + m[2].x*v.z + m[3].x*v.w,
                 m[0].y*v.x + m[1].y*v.y + m[2].y*v.z + m[3].y*v.w,
                 m[0].z*v.x + m[1].z*v.y + m[2].z*v.z + m[3].z*v.w,
                 m[0].w*v.x + m[1].w*v.y + m[2].w*v.z + m[3].w*v.w);
 }

the performance didn't change. it's still 4 times slower. therefore I think it might have something to do with how the matrices are stored. in my custom code rowMajor is used, glm uses column major as far as i know. on the other hand, i also tried storing row major and flipping my custom mul with no success.

bottom line: i think it's neither row major storage nor the multiplication function, but something else.

@chbaker0
Copy link

Looking in glm/detail/type_vec4.hpp I saw the storage type used in vec4 is this:

    template <typename T>
    struct simd
    {
        typedef T type[4];
    };

#   if(GLM_ARCH & GLM_ARCH_SSE2)
        template <>
        struct simd<float>
        {
            typedef __m128 type;
        };
#   endif

#   if(GLM_ARCH & GLM_ARCH_AVX)
        template <>
        struct simd<double>
        {
            typedef __m256d type;
        };
#   endif

And the mat4 is just an array of 4 vec4. It's reasonable to believe that CUDA's builtin float4 has specific alignment, while the regular simd struct doesn't. Just speculation of course, but you might want to try applying alignment and see how that affects performance.

@adam-ce
Copy link
Author

adam-ce commented Oct 28, 2014

yey : D

those are the new benchmarks 8)

time for cuda glm (matrix): 233 milliseconds
time for cuda helper math (matrix): 225 milliseconds
time for cuda glm (dot): 187 milliseconds
time for cuda helper math (dot): 307 milliseconds
time for cuda glm (cross): 45 milliseconds
time for cuda helper math (cross): 163 milliseconds

the matrix multiplication is not totally there yet, but very close. the other two are even faster than before.

this is a diff to version 0.9.5.4:

diff -r 653704ab2abc src/glm/detail/type_vec4.hpp
--- a/src/glm/detail/type_vec4.hpp  Tue Oct 28 19:35:36 2014 +0100
+++ b/src/glm/detail/type_vec4.hpp  Tue Oct 28 23:24:52 2014 +0100
@@ -29,6 +29,11 @@
#ifndef glm_core_type_gentype4
#define glm_core_type_gentype4

+#if !defined(__CUDA_LIBDEVICE__)
+#include "builtin_types.h"
+#endif /* !__CUDA_LIBDEVICE__ */
+#include "host_defines.h"
+
//#include "../fwd.hpp"
#include "setup.hpp"
#include "type_vec.hpp"
@@ -45,7 +50,7 @@
namespace detail
{
    template <typename T, precision P>
-   struct tvec4
+    struct __builtin_align__(16) tvec4
    {
        //////////////////////////////////////
        // Implementation detail

but i'm afraid it can't go into master like that. anyway, it fixes glm for me now and i can use it until the change goes in. thanks, i'm happy.

@Groovounet
Copy link
Member

Interesting results.

fvec4SIMD does have these alignments code in place and yes it's pretty important for CPU SSE code paths.

Do you actually need:
+#if !defined(CUDA_LIBDEVICE)
+#include "builtin_types.h"
+#endif /* !CUDA_LIBDEVICE */
+#include "host_defines.h"
?

@adam-ce
Copy link
Author

adam-ce commented Oct 28, 2014

hm, it seems that i don't need them. i put them originally because qtcreator showed me an error.

but at the moment i'm not sure of anything because after messing around with glm's mat4*vec4 operator suddenly cuda's helper math matrix operations take 333 milliseconds (100 more). So i'll reboot and recheck. (edit: after reboot it's back to 225)

is it possible to use some define or something to enable it, so that i can use vanilla glm?

one more info:
I don't know if it's significant, but using

 return typename tmat4x4<T, P>::col_type(
        m[0][0] * v[0] + m[1][0] * v[1] + m[2][0] * v[2] + m[3][0] * v[3],
        m[0][1] * v[0] + m[1][1] * v[1] + m[2][1] * v[2] + m[3][1] * v[3],
        m[0][2] * v[0] + m[1][2] * v[1] + m[2][2] * v[2] + m[3][2] * v[3],
        m[0][3] * v[0] + m[1][3] * v[1] + m[2][3] * v[2] + m[3][3] * v[3]);

instead of the code with Mov0(v[0]); gives an improvement from 233 to 230 msec.

@Groovounet
Copy link
Member

Use GLM_ALIGN(16) instead of builtin_align(16) and you might be good to go...
Erm, actually I probably need to update setup.hpp that defines GLM_ALIGN to support CUDA...

But that's probably all that is needed here.

@adam-ce
Copy link
Author

adam-ce commented Oct 28, 2014

diff -r b492363e33cf src/glm/detail/setup.hpp
--- a/src/glm/detail/setup.hpp  Tue Oct 28 23:28:14 2014 +0100
+++ b/src/glm/detail/setup.hpp  Wed Oct 29 00:23:18 2014 +0100
@@ -755,7 +755,7 @@
#  define GLM_ALIGNED_STRUCT(x) __declspec(align(x)) struct
#  define GLM_RESTRICT
#  define GLM_RESTRICT_VAR __restrict
-#elif(GLM_COMPILER & (GLM_COMPILER_GCC | GLM_COMPILER_CLANG))
+#elif(GLM_COMPILER & (GLM_COMPILER_GCC | GLM_COMPILER_CLANG | GLM_COMPILER_CUDA))
#  define GLM_DEPRECATED __attribute__((__deprecated__))
#  define GLM_ALIGN(x) __attribute__((aligned(x)))
#  define GLM_ALIGNED_STRUCT(x) struct __attribute__((aligned(x)))
diff -r b492363e33cf src/glm/detail/type_vec4.hpp
--- a/src/glm/detail/type_vec4.hpp  Tue Oct 28 23:28:14 2014 +0100
+++ b/src/glm/detail/type_vec4.hpp  Wed Oct 29 00:23:18 2014 +0100
@@ -50,7 +45,7 @@
namespace detail
{
    template <typename T, precision P>
-    struct tvec4
+    GLM_ALIGNED_STRUCT(16) tvec4
    {
        //////////////////////////////////////
        // Implementation detail

this works for me.

@Groovounet Groovounet added the bug label Oct 28, 2014
@Groovounet Groovounet added this to the GLM 0.9.6 milestone Oct 28, 2014
@Groovounet Groovounet self-assigned this Oct 28, 2014
@Groovounet
Copy link
Member

This changes made it to master for GLM 0.9.6 release.

Thanks for contributing!
Christophe

@adam-ce
Copy link
Author

adam-ce commented Oct 29, 2014

do you think it would be also possible to add the alignment rules for vec2 and 3?

it would be align(8) for vec2 and align(16) for vec3.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
edit: cross that for vec3. it's not aligned in cuda.

i did some tests with the cross kernel:
float4
time for cuda glm (cross): 44 milliseconds
time for cuda helper math (cross): 44 milliseconds

float3
time for cuda glm (cross): 87 milliseconds (this includes align(16), before it was also at 98 msec)
time for cuda helper math (cross): 98 milliseconds

i don't understand why vec3 aligned to 16 bytes is slower than vec4. it might be the testing method. but the thing is that the cuda docs recommend align 16, so it should be better..

@Groovounet
Copy link
Member

The fix for this bug turns out to be not valid as it would force all vec4 types to be aligned to 16 bytes including i8vec4 for example.

The fix has been reverted for the moment.

Groovounet pushed a commit that referenced this issue Nov 23, 2014
@Groovounet
Copy link
Member

I created a new extension exposing aligned types. Alignment is definitely not what we always want even if it's extremely useful.

If you want an aligned flavor of a vec4, include <glm/gtc/type_aligned.hpp> and you can use aligned_vec4.

You can also define your own aligned type in a cross platform manner using:
GLM_ALIGNED_TYPEDEF(vec3, my_vec3, 16);

Where my_vec3 is a vec3 aligned to 16 bytes.

Thanks for contributing,
Christophe

@Groovounet
Copy link
Member

Hi,

What's left to fix here?
Thanks,
Christophe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants