GitHub - nayakajay/micro-benchmarks: Experiments with CUDA to reverse engineer micro-architectural details

nayakajay / micro-benchmarks Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Experiments with CUDA to reverse engineer micro-architectural details

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
nearestNeighbor		nearestNeighbor
newReport		newReport
p40-pdfs		p40-pdfs
src		src
Makefile		Makefile
README		README

Repository files navigation

Tool  : NSight tool Version: 2019.3.0
GPU   : Tesla P40.
NVCC  : release 10.1, V10.1.168
Driver: 418.67
OS    : Ubuntu 16.04.4 LTS

Nsight Command: nv-nsight-cu-cli --section MemoryWorkloadAnalysis.* -f -o <output-file> <executable>

===================================================================================================
We have added the cuda source files as well just if you need them:
Source file: fine_grain_tlb.cu

This program takes 3 input arguments.
First : Starting size of array in MBs
Second: End size of array in MBs
Third : Stride size in KBs

For each array size in range of start and end, the program traverses
at strides, and records time taken for each access.

Compilation command:
$> nvcc -Xptxas -dlcm=cs -arch=sm_61 -lineinfo src/fine_grain_tlb.cu -o cs_original
(Hoping that the L2 level of cache is also disabled, so that all requests go to device memory)

nv-nsight-cu-cli --section MemoryWorkloadAnalysis_Chart -f -o <output-file> <executable>
The output files for this are:
cs_xx_yy.nsight-cuprof-report
eg $> ./cs_original 512 512 32768 => cs_all_512_32.pdf

Discrepancy Observed:
1. Most probably the writes are getting cached to L2, as Nsight does not represent cached/uncached
stores, it just represents 'Global Stores'.
2. When data is collected only Chart section, we see 100% L2 Hit rate (surprising), but when all
MemoryWorlkloadAnalysis.* is run the hit rate decreases, but is still quite high.

===================================================================================================
We conducted another experiment for pinning the pages in host memory (DRAM).

Source file: host_alloc.cu

This file takes 4 arguments:
First  : Starting array size in MB
Second : End array size in MB
Third  : Stride size in KB
Fourth : Device number to set (if multiple devices are available)

It follows the same principle as of above program but creates memory in host pinned (cudaHostAlloc)
and provide with a single flag, 'cudaHostAllocMapped' to the function.

Compliation command:
$> nvcc -Xptxas -dlcm=cs -arch=sm_61 -lineinfo src/host_alloc.cu -o host_pinned
(Again hoping that L2 cache is disabled)

The output files for this are:
cs_host_xx_yy.nsight-cuprof-report. There are 2 kernels in this code.
So each run generates 2 files.

eg. $> ./host_pinned 512 512 32768 0 => cs_all_host_initk_512_32.pdf
                                        cs_all_host_global_512_32.pdf

In the NSight report we see that there are memory accesses to DRAM and not to Host Memory, it says
(!) mark over host memory., which is something not understandable.

Discrepancy Observed:
1. Init kernel shows only 1 instruction, but there are atleast 16 loads (incase of 512M array with
32M strides.)
2. If data is pinned to host, the device memory should not have been used, but we see quite a bit
of usage of device memory.

===================================================================================================

For code corresponding Nearest Neighbor, follow ./nearestNeighbor
    data corresponding Nearest Neighbor, follow ./newReport