Best practice for small-vector linear algebra? #8281

allemangD · 2024-06-10T14:54:43Z

allemangD
Jun 10, 2024

What is best practice for small-vector linear algebra, especially in GPU scheduling? My particular use case is sampling a 3D buffer positioned in space via 4x4 homogeneous transform matrices.

It would be nice if there were some way to get a Halide::Expr to hold a small matrix or vector that can do accelerated linear algebra, but I haven't yet found anything like this mentioned in the docs.

I'm aware of Halide::Tuple but it seems like there's not a good way to hold a matrix or do matrix multiply.

I've had some success using Eigen::Matrix<Halide::Expr, 4, 4> and Eigen::Matrix<Halide::Expr, 4, 1> for expressive C++ generator code, but I'm suspicious of the performance here. It doesn't seem like the generator could vectorize or benefit from 4x4-matrix hardware acceleration (especially on GPU).

I've tried moving my matrices into 4x4 and 4x1 Halide::Buffer, doing matrix multiply via Halide::RDom. It is difficult, because each value in my output has multiple intermediate results associated with it that I'm not sure how to handle. This would let me schedule it with .vectorize(), but I'm still suspicious that GPU hardware acceleration might not be in use. The C++ generator code is also much harder to understand; I haven't yet been able to create an equivalent implementation for benchmarking. I'm certain it's possible, just trickier than I expected and this prevents me comparing yet.

Curious if there's some silver bullet in the Halide library I just haven't found yet, or if there's some generally recommended approach to this.

Answered by abadams

Jun 10, 2024

Here's my usual approach: https://github.com/halide/Halide/blob/main/apps/bgu/bgu_generator.cpp

In the solves there are serial dependencies between matrix elements, so it's pointless to try to actually use SIMD to represent columns of the matrix or something (plus only 4 elements is way too small for simd on x86). I just vectorize across a different axis - one that's truly data-parallel.

I think there might be 4x4 matrix acceleration on some mobile GPUs, but I don't think that has been a thing on desktop GPUs for a while now. cuda supports low-precision 16x16 matrix multiplies on the tensor cores, but that's probably not good enough for the classic graphics 4x4 homogeneous transform matri…

View full answer

abadams · 2024-06-10T16:37:21Z

abadams
Jun 10, 2024
Maintainer

Here's my usual approach: https://github.com/halide/Halide/blob/main/apps/bgu/bgu_generator.cpp

In the solves there are serial dependencies between matrix elements, so it's pointless to try to actually use SIMD to represent columns of the matrix or something (plus only 4 elements is way too small for simd on x86). I just vectorize across a different axis - one that's truly data-parallel.

I think there might be 4x4 matrix acceleration on some mobile GPUs, but I don't think that has been a thing on desktop GPUs for a while now. cuda supports low-precision 16x16 matrix multiplies on the tensor cores, but that's probably not good enough for the classic graphics 4x4 homogeneous transform matrices. The only simd cuda instructions I'm aware of are the loads and stores (https://docs.nvidia.com/cuda/archive/9.0/parallel-thread-execution/#data-movement-and-conversion-instructions-ld), which you can vectorize to reduce the number of memory transactions. There are also the simd fixed-point video instructions, but last time I tried to use those I found they were basically deprecated (they compiled to non-vectorized integer code that emulated the instruction).

Halide's attitude is that SIMD/vectorization as to do with control flow, not data. SIMD is a way of running a data parallel loop, not a data type.

1 reply

allemangD Jun 10, 2024
Author

I think there might be 4x4 matrix acceleration on some mobile GPUs, but I don't think that has been a thing on desktop GPUs for a while now. cuda supports low-precision 16x16 matrix multiplies on the tensor cores, but that's probably not good enough for the classic graphics 4x4 homogeneous transform matrices.

I need f32 on recent nvidia consumer desktop cards. We're targeting roughly 3070 and up at the moment. I could probably get away with a different datatype if the performance improvement is significant but it sounds like that's not the case.

The question was predicated on the assumption that there is some 4x4 hardware acceleration available on desktop GPU. If that assumption is not right, then it looks like the Eigen::Matrix<Halide::Expr, ...> approach is roughly equivalent to that approach in bgu_generator. This is promising!

I just vectorize across a different axis - one that's truly data-parallel.

I do already vectorize on the innermost data-parallel var. Now you point it out, the reason for that inversion makes a lot of sense.

I think I'll go ahead and mark as answer since it seems my question was based on a bad mental model on my part. Thanks for the clarification!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice for small-vector linear algebra? #8281

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Best practice for small-vector linear algebra? #8281

allemangD Jun 10, 2024

Replies: 1 comment · 1 reply

abadams Jun 10, 2024 Maintainer

allemangD Jun 10, 2024 Author

allemangD
Jun 10, 2024

Replies: 1 comment 1 reply

abadams
Jun 10, 2024
Maintainer

allemangD Jun 10, 2024
Author