Skip to content

Matrix Multiplication

rshipley160 edited this page Dec 10, 2021 · 5 revisions

Previous: Reduction

Parallel matrix arithmetic is one of the foundational operations of GPU computing and has its roots all the way back in the earliest stages of GPU computing when the sole task of a GPU was to convert 3D shapes into sets of 2D triangles to render on a screen.
In this example, we are going to use CUDA to multiply two matrices A (which will be an i x m matrix) and B (which will be an m x j matrix). The resulting matrix C will be an i x j matrix.

How Matrix Multiplication Works

If you haven't thought about matrices in a while and need a refresher on how matrix multiplication works, take a look at the image to the right.

As you can see, each cell in C (C[i][j]) is the dot product of row A[i] and column B[j]. If we wanted to perform matrix multiplication in serial code, the pseudocode for that operation might look something like this:

// Iterate over rows of A
for (int i = 1; i <= A.rows; i++) {
    // Iterate over columns of B
    for (int j = 0; j <= B.cols; j++) {
        // Iterate over columns of A / rows of B
        for (int k = 0; k <= A.cols; k++) {
            C[i][j] += A[i][k] * B[k][i];
        }
    }
}

Matrix Multiplication Using CUDA

While this method works, it is considerably slow since it has to sequentially loop over every individual element in both of the input arrays.
Instead, let's use the fact that each cell in C can be calculated independently of each other to perform the multiplication in parallel.

Essentially, we are going to have each thread represent the intersection of a single row from A and a single column for B, so that each one just performs the innermost loop of our pseudocode from earlier. In essence, each thread runs this pseudocode:

for (int k = 0; k <= A.cols; k++) {
    C[i][j] += A[i][k] * B[k][i];
}

But how do we determine i, j, and k? you might be asking. The answer is that we will use our kernel configuration to solve this for us!
When launching the multiplication kernel, we will set our block and grid dimensions so that the total number of threads along the x-axis matches the number of columns in the output matrix C, and the total number of threads along the y-axis matches the total number of rows in C.
If the kernel is configured in this way, we can simply calculate each thread's value for i and j like so:

// The row of the output cell
int i = blockDim.y * blockIdx.y + threadIdx.y;

// The column of the output cell
int j = blockDim.x * blockIdx.x + threadIdx.x;

for (int k = 0; k <= A.cols; k++) {
    C[i][j] += A[i][k] * B[k][i];
}

The keen CUDA programmer may notice that we are still using pseudocode here; there are no built-in matrices in CUDA that we can access using double bracket ([][]) notation, nor are there any that have the .cols property we are using to determine the length of this for loop.

Instead, the best way we have to approximate a matrix in CUDA is to linearize the contents of our matrices, which is to format their two-dimensional contents into a one-dimensional container. All we have to do in order to achieve this is use the row value of the matrix to skip ahead one row's worth of values in the array, which looks like this in our pseudocode turned actual code example:

// The row of the output cell
int i = blockDim.y * blockIdx.y + threadIdx.y;

// The column of the output cell
int j = blockDim.x * blockIdx.x + threadIdx.x;

// Total number of threads in one row of the kernel grid
int rowSize = gridDim.x * blockDim.x;

for (int k = 0; k <= A.cols; k++) {
    C[rowSize*i + j] += A[rowSize*i + k] * B[rowSize*k + i];
}

The last thing we need to make our example code runnable is the value for A.cols. Since this dimension is the only one that doesn't correspond to a value in our kernel configuration, the easiest thing to do is pass it directly to our kernel along with the arrays A, B, and C. I've decided to call it hiddenDim since it is the dimension of A and B that isn't part of the resultant matrix C. With that decision made, here is our completed matrix multiplication kernel:

__global__ void matrix_mul(float *A, float *B, float *C, int hiddenDim) {
    // The row of the output cell
    int i = blockDim.y * blockIdx.y + threadIdx.y;

    // The column of the output cell
    int j = blockDim.x * blockIdx.x + threadIdx.x;

    // Total number of threads in one row of the kernel grid
    int rowSize = gridDim.x * blockDim.x;

    for (int k = 0; k <= hiddenDim; k++) {
        C[rowSize*i + j] += A[rowSize*i + k] * B[rowSize*k + i];
    }
}

From there, all you need to do to test out this kernel is initialize some matrices (flattened matrices in the form of arrays, that is) and configure the kernel with the same amount of threads as there are columns in A and rows in B. After you've had a go at it, take a look at the full example code and see how your code stacks up.

Next: Intro to Asynchronous Computing