Performance Experiment: Synchronization Methods

Previous: Events and Dependencies

In this experiment, we will pitting three synchronization methods against each other in an effort to determine which method offers the best performance.

To do this, we will be modifying the quadratic equation solver we created in the last tutorial so that it contains three functions that solve the quadratic formula: one which uses events (existing method), one which uses stream synchronization, and one which uses device synchronization.

The main idea is that we will run each version and time how long each takes, but since the quadratic equation solver runs on the order of microseconds, we are going to run each version several times and record how long it takes to complete all of the runs so that it is easier to determine if there is a difference between methods.

We will also run each test several times so that we can ensure our time readings are consistent and aren't one off flukes.

A Note About Device Synchronization

We haven't actually covered device synchronization in this series, but that is because it is not widely used.

A call to cudaDeviceSynchronize() causes the default stream to wait on every single thread of the GPU to complete its work and enter the ready state, regardless of the streams they are in. Because of this, it can be very cumbersome to use in applications which have a lot of streams, especially so when those streams have unbalanced workloads. The silver lining is that, in the case where you do want all of the streams synchronized, synchronizing the device in a single API call can potentially be faster than using multiple stream synchronization calls.

Experiment Setup

To run this experiment, all we really need to do is copy the code from our quadratic_solver program in the last tutorial into three new functions, each of which will use a different type of synchronization. For ease of reference, these functions have been named quadraticUsingEvents, quadraticUsingStreamSync, and quadraticUsingDeviceSync.

Next, we need to modify all three of these functions to take the number of elements they need to use as a parameter, since it was previously just a constant in our main function.

float quadraticUsingEvents(int numElements) {
    int gridSize = numElements / BLOCK_SIZE + 1;

    float *a, *b, *c, *sol1, *sol2, *tmp;
    cudaMalloc(&a, sizeof(float)*numElements);
    cudaMalloc(&b, sizeof(float)*numElements);
    // ...
}

int main(int argc, char *argv[]) {
    const int NUM_ELEMENTS = 16382;

    quadraticUsingDeviceSync(NUM_ELEMENTS);
    quadraticUsingStreamSync(NUM_ELEMENTS);
    quadraticUsingEvents(NUM_ELEMENTS);
}

In quadraticUsingDeviceSync and quadraticUsingStreamSync, we need to remove the allocation and deallocation of the bMinusComplete and bPlusComplete events, and replace the event synchronization calls with the appropriate other synchronization method.
In the stream sync function, the part of the function doing the work now looks like this:

    elementwiseProduct<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(b, b, sol1, numElements);
    elementScalarProduct<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(a, c, -4, sol2, numElements);

    // Sync streams to ensure these complete before next step
    cudaStreamSynchronize(bMinus);
    cudaStreamSynchronize(bPlus);

    elementwiseSum<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(sol1, sol2, sol1, numElements);
    elementwiseSqrt<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(sol1, tmp, numElements);

    // Sync again - must have determinant before proceeding
    cudaStreamSynchronize(bPlus);
    elementwiseDifference<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(b, tmp, sol1, numElements);
    elementwiseSum<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(b, tmp, sol2, numElements);

    elementwiseQuotient<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(sol1, a, 0.5, sol1, numElements);
    elementwiseQuotient<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(sol2, a, 0.5, sol2, numElements);

    // Make sure that both streams are done before stopping timer
    cudaStreamSynchronize(bPlus);
    cudaStreamSynchronize(bMinus);

And in the device sync function any calls to record and wait on events are replaced with a single call to cudaDeviceSynchronize

    elementwiseProduct<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(b, b, sol1, numElements);
    elementScalarProduct<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(a, c, -4, sol2, numElements);

    // Sync streams to ensure these complete before next step
    cudaDeviceSynchronize();

    elementwiseSum<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(sol1, sol2, sol1, numElements);
    elementwiseSqrt<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(sol1, tmp, numElements);

    // Sync again - must have determinant before proceeding
    cudaDeviceSynchronize();
    elementwiseDifference<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(b, tmp, sol1, numElements);
    elementwiseSum<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(b, tmp, sol2, numElements);

    elementwiseQuotient<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(sol1, a, 0.5, sol1, numElements);
    elementwiseQuotient<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(sol2, a, 0.5, sol2, numElements);

    // Make sure that both streams are done before stopping timer
    cudaDeviceSynchronize();

Now that we have established each function is completing the correct workflow, we can put a for loop around the part of each function that is doing the work in order to have it repeated so that our timing results are more distinct. For the number of iterations, we will add a parameter to each of the three functions that specifies how many times the work loop should be repeated.
When modifying your functions, make sure the for loop begins before the first elementwiseProduct kernel and ends after the last call to synchronize. Here's the stream sync version for reference:

float quadraticUsingStreamSync(int numElements, int iterations) {
    // Allocate streams and memory...

    for (int i=0; i<iterations; i++) {
        elementwiseProduct<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(b, b, sol1, numElements);
        
        // Continue running solver tasks...

        //Synchronize for the last time to end workflow
        cudaStreamSynchronize(bPlus);
        cudaStreamSynchronize(bMinus);
    }

    // Deallocate and clean up...
}

The last thing we have to do in each of our quadratic solvers is to add the timing elements, as well as a warm-up tasks for each stream to do before starting the timer. First allocate the events as we have many times at this point, then record clockStart before the for loop and clockStop after the for loop. After that, we then read the elapsed time from the events into a new float, which is then returned at the end of the function after everything we used has been deallocated.
Don't forget to warm up the streams with a task before recording! I chose to copy the first operations of each stream from the solver workflow above the timed section, but the operations you choose don't matter. The important thing is that you synchronize and ensure those operations have completed before beginning the timing. If you've done it all right, each function should look something like this, but with the appropriate synchronization method.

float quadraticUsingStreamSync(int numElements, int iterations) {
    int gridSize = numElements / BLOCK_SIZE + 1;

    float *a, *b, *c, *sol1, *sol2, *tmp;
    // Allocate and fill memory...

    // Allocate streams...

    cudaEvent_t clockStart, clockStop;
    cudaEventCreate(&clockStart);
    cudaEventCreate(&clockStop);

    // Warm up both streams before beginning timing
    elementwiseProduct<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(b, b, sol1, numElements);
    elementScalarProduct<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(a, c, -4, sol2, numElements);
    cudaStreamSynchronize(bMinus);
    cudaStreamSynchronize(bPlus);

    cudaEventRecord(clockStart);

        for (int i=0; i<iterations; i++) { 

            elementwiseProduct<<<gridSize, BLOCK_SIZE, 0, bMinus>>>(b, b, sol1, numElements);
            elementScalarProduct<<<gridSize, BLOCK_SIZE, 0, bPlus>>>(a, c, -4, sol2, numElements);
            // Complete solver workflow...
            
            // Make sure that both streams are done before stopping timer
            cudaStreamSynchronize(bPlus);
            cudaStreamSynchronize(bMinus);
        }

    cudaEventRecord(clockStop);
    cudaEventSynchronize(clockStop);
    float timeElapsed;
    cudaEventElapsedTime(&timeElapsed, clockStart, clockStop);

    cudaEventDestroy(clockStart);
    cudaEventDestroy(clockStop);

    // Deallocate streams and memory

    return timeElapsed;
}

From here, we can write a simple program driver that runs each functions a number of times to generate multiple trials per sync type, and output them in a CSV format

int main(int argc, char *argv[]) {
    const int NUM_ELEMENTS = 16382;
    const int ITERATIONS = 1024;
    const int TRIALS = 20;

    printf("Device,Stream,Event\n");
    for (int i=0; i<TRIALS; i++) {
        printf("%.4f,",quadraticUsingDeviceSync(NUM_ELEMENTS, ITERATIONS));
        printf("%.4f,",quadraticUsingStreamSync(NUM_ELEMENTS, ITERATIONS));
        printf("%.4f\n",quadraticUsingEvents(NUM_ELEMENTS, ITERATIONS));
    }
}

Running this I get:

Device,Stream,Event
40.1521,41.0563,37.6412
40.2596,41.2600,37.1261
...

These results are then saved into results.csv to be visualized using Python, which we'll take a look at next

Analysis

When analyzing a range of continuous values that belong to discrete categories, a box plot is often a good choice because it allows the full range of each category's values to be shown while also allowing for comparison between categories. Box plots can also be modified to show statistical values of each category's values such as mean, median, mode, etc., which can also be very helpful.

In this case I have used a short Python script to create a plot (right) which includes a box plot for each type of synchronization and places them all on a shared access to make it easy to compare values between them. I have also marked the median completion time (marked in cyan) for each synchronization type to give an idea as to the distribution of values within each box plot.

Insights

Clearly events are the superior form of synchronization for this workflow, which matches their purpose, which is to allow streams to synchronize with each other in a lighter fashion than using the stream synchronize function. Stream synchronization is more cumbersome because waits on the entire stream to complete (rather than just sending a signal or waiting to receive one) and because of the additional overhead required because stream 0 also blocks until the other streams can complete.

What may be more surprising is that device synchronization is more efficient than stream synchronization. Although both of them involve stream 0, stream synchronization requires two separate calls to the CUDA API, both of which have separate overhead costs of being sent to the GPU, while the device synchronize only incurs this cost one time.

Don't let this fool you into thinking that device synchronization will always be faster though: this is likely only the case because we are synchronizing the same three streams regardless of which method we use in this example. If you were to perform the same experiment in an environment that had more unrelated streams doing other work, device synchronization would almost certainly be slower since it would wait on both the relevant streams and the non-relevant streams.

Next: CUDA Graphs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Experiment: Synchronization Methods

A Note About Device Synchronization

Experiment Setup

Analysis

Insights

Clone this wiki locally