-
What is best practice for small-vector linear algebra, especially in GPU scheduling? My particular use case is sampling a 3D buffer positioned in space via 4x4 homogeneous transform matrices. It would be nice if there were some way to get a I'm aware of I've had some success using I've tried moving my matrices into 4x4 and 4x1 Curious if there's some silver bullet in the Halide library I just haven't found yet, or if there's some generally recommended approach to this. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Here's my usual approach: https://github.com/halide/Halide/blob/main/apps/bgu/bgu_generator.cpp In the solves there are serial dependencies between matrix elements, so it's pointless to try to actually use SIMD to represent columns of the matrix or something (plus only 4 elements is way too small for simd on x86). I just vectorize across a different axis - one that's truly data-parallel. I think there might be 4x4 matrix acceleration on some mobile GPUs, but I don't think that has been a thing on desktop GPUs for a while now. cuda supports low-precision 16x16 matrix multiplies on the tensor cores, but that's probably not good enough for the classic graphics 4x4 homogeneous transform matrices. The only simd cuda instructions I'm aware of are the loads and stores (https://docs.nvidia.com/cuda/archive/9.0/parallel-thread-execution/#data-movement-and-conversion-instructions-ld), which you can vectorize to reduce the number of memory transactions. There are also the simd fixed-point video instructions, but last time I tried to use those I found they were basically deprecated (they compiled to non-vectorized integer code that emulated the instruction). Halide's attitude is that SIMD/vectorization as to do with control flow, not data. SIMD is a way of running a data parallel loop, not a data type. |
Beta Was this translation helpful? Give feedback.
Here's my usual approach: https://github.com/halide/Halide/blob/main/apps/bgu/bgu_generator.cpp
In the solves there are serial dependencies between matrix elements, so it's pointless to try to actually use SIMD to represent columns of the matrix or something (plus only 4 elements is way too small for simd on x86). I just vectorize across a different axis - one that's truly data-parallel.
I think there might be 4x4 matrix acceleration on some mobile GPUs, but I don't think that has been a thing on desktop GPUs for a while now. cuda supports low-precision 16x16 matrix multiplies on the tensor cores, but that's probably not good enough for the classic graphics 4x4 homogeneous transform matri…