-
Notifications
You must be signed in to change notification settings - Fork 0
nayakajay/micro-benchmarks
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Tool : NSight tool Version: 2019.3.0 GPU : Tesla P40. NVCC : release 10.1, V10.1.168 Driver: 418.67 OS : Ubuntu 16.04.4 LTS Nsight Command: nv-nsight-cu-cli --section MemoryWorkloadAnalysis.* -f -o <output-file> <executable> =================================================================================================== We have added the cuda source files as well just if you need them: Source file: fine_grain_tlb.cu This program takes 3 input arguments. First : Starting size of array in MBs Second: End size of array in MBs Third : Stride size in KBs For each array size in range of start and end, the program traverses at strides, and records time taken for each access. Compilation command: $> nvcc -Xptxas -dlcm=cs -arch=sm_61 -lineinfo src/fine_grain_tlb.cu -o cs_original (Hoping that the L2 level of cache is also disabled, so that all requests go to device memory) nv-nsight-cu-cli --section MemoryWorkloadAnalysis_Chart -f -o <output-file> <executable> The output files for this are: cs_xx_yy.nsight-cuprof-report eg $> ./cs_original 512 512 32768 => cs_all_512_32.pdf Discrepancy Observed: 1. Most probably the writes are getting cached to L2, as Nsight does not represent cached/uncached stores, it just represents 'Global Stores'. 2. When data is collected only Chart section, we see 100% L2 Hit rate (surprising), but when all MemoryWorlkloadAnalysis.* is run the hit rate decreases, but is still quite high. =================================================================================================== We conducted another experiment for pinning the pages in host memory (DRAM). Source file: host_alloc.cu This file takes 4 arguments: First : Starting array size in MB Second : End array size in MB Third : Stride size in KB Fourth : Device number to set (if multiple devices are available) It follows the same principle as of above program but creates memory in host pinned (cudaHostAlloc) and provide with a single flag, 'cudaHostAllocMapped' to the function. Compliation command: $> nvcc -Xptxas -dlcm=cs -arch=sm_61 -lineinfo src/host_alloc.cu -o host_pinned (Again hoping that L2 cache is disabled) The output files for this are: cs_host_xx_yy.nsight-cuprof-report. There are 2 kernels in this code. So each run generates 2 files. eg. $> ./host_pinned 512 512 32768 0 => cs_all_host_initk_512_32.pdf cs_all_host_global_512_32.pdf In the NSight report we see that there are memory accesses to DRAM and not to Host Memory, it says (!) mark over host memory., which is something not understandable. Discrepancy Observed: 1. Init kernel shows only 1 instruction, but there are atleast 16 loads (incase of 512M array with 32M strides.) 2. If data is pinned to host, the device memory should not have been used, but we see quite a bit of usage of device memory. =================================================================================================== For code corresponding Nearest Neighbor, follow ./nearestNeighbor data corresponding Nearest Neighbor, follow ./newReport
About
Experiments with CUDA to reverse engineer micro-architectural details
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published