Skip to content

Commit

Permalink
Merge pull request #11 from efwright/updating_graphs
Browse files Browse the repository at this point in the history
Updating images
  • Loading branch information
jefflarkin authored Jun 3, 2021
2 parents b55c62d + 9eb561e commit dc18a50
Show file tree
Hide file tree
Showing 6 changed files with 19 additions and 18 deletions.
25 changes: 13 additions & 12 deletions 04-Parallelize.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -626,25 +626,26 @@ discussed in a later chapter.
At this point we have expressed all of the parallelism in the example code and
the compiler has parallelized it for an accelerator device. Analyzing the
performance of this code may yield surprising results on some accelerators,
however. The results below demonstrate the performance of this code on 1 - 8
CPU threads on a modern CPU at the time of publication and an NVIDIA Tesla K40
however. The results below demonstrate the performance of this code on 1 - 16
CPU threads on an AMD Threadripper CPU and an NVIDIA Volta V100
GPU using both implementations above. The *y axis* for figure 3.1 is execution
time in seconds, so smaller is better. For the two OpenACC versions, the bar is
divided by time transferring data between the host and device, time executing
on the device, and other time.
divided by time transferring data between the host and device and time executing
on the device.

![Jacobi Iteration Performance - Step 1](images/jacobi_step1_graph.png)

Notice that the performance of this code improves as CPU threads are added to
the calcuation, but the OpenACC versions perform poorly compared to the CPU
baseline. The OpenACC `kernels` version performs slightly better than the
serial version, but the `parallel loop` case performs dramaticaly worse than
even the slowest CPU version. Further performance analysis is necessary to
The performance of this improves as more CPU threads are added to the calculation,
however, since the code is memory-bound the performance benefit of adding
additional threads quickly diminishes. Also, the OpenACC versions perform poorly
compared to the CPU
baseline. The both the OpenACC `kernels` and `parallel loop` versions perform
worse than the serial CPU baseline. It is also clear that the `parallel loop` version
spends significantly more time in data transfer than the `kernels` version.
Further performance analysis is necessary to
identify the source of this slowdown. This analysis has already been applied to
the graph above, which breaks down time spent
computing the solution, copying data to and from the accelerator, and
miscelaneous time, which includes various overheads involved in scheduling data
transfers and computation.
computing the solution and copying data to and from the accelerator.

A variety of tools are available for performing this analysis, but since this
case study was compiled for an NVIDIA GPU, NVIDIA Nsight Systems will be
Expand Down
12 changes: 6 additions & 6 deletions 06-Loops.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -486,7 +486,7 @@ parallelism to fill each *gang* with more of these short vectors. Below is the
modified code.

~~~~ {.c .numberLines}
#pragma acc parallel loop gang worker num_workers(32) vector_length(32)
#pragma acc parallel loop gang worker num_workers(4) vector_length(32)
for(int i=0;i<num_rows;i++) {
double sum=0;
int row_start=row_offsets[i];
Expand Down Expand Up @@ -521,19 +521,19 @@ modified code.
enddo
~~~~

In this version of the code, I've explicitly mapped the outermost look to both
In this version of the code, I've explicitly mapped the outermost loop to both
gang and worker parallelism and will vary the number of workers using the
`num_workers` clause. The results follow.

![Speed-up from varying number of workers for a vector length of
32.](images/spmv_speedup_num_workers.png)

On this particular hardware, the best performance comes from a vector length of
32 and 32 workers. This turns out to be the maximum amount of parallelism that
the particular accelerator being used supports within a gang. In this case, we
observed a 1.3X speed-up from decreasing the vector length and another 2.1X
32 and 4 workers, which is similar to the simpler loop with a default vector length of 128.
In this case, we
observed a 2.5X speed-up from decreasing the vector length and another 1.26X
speed-up from varying the number of workers within each gang, resulting in an
overall 2.9X performance improvement from the untuned OpenACC code.
overall 3.15X performance improvement from the untuned OpenACC code.

***Best Practice:*** Although not shown in order to save space, it's generally
best to use the `device_type` clause whenever specifying the sorts of
Expand Down
Binary file modified images/jacobi_step1_graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/jacobi_step2_graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/spmv_speedup_num_workers.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/spmv_speedup_vector_length.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit dc18a50

Please sign in to comment.