Improve helicity filtering #24

valassi · 2020-08-17T16:16:40Z

In studying issue #16, I just realised that the 274 requests to access global memory from each warp reduce to 16 (!) if I disable helicity filtering. There is no other indication that this is an issue (andactually the present implementation saves a factor 2 in CUDA and a factor 4 in CPP), but it's worth having a look...

This decreases throughpout by a factor 2 on GPU (to 3E8) and by a factor 4 on CPP (1.5 sec becomes 6sec). However, this reduces drastically the access to global memory on GPU: ./profile.sh -nogui -p 1 32 1 gProc::sigmaKin(double const*, double*), 2020-Aug-17 18:13:50, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 16 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 128 ---------------------------------------------------------------------- --------------- ------------------------------ ./profile.sh -nogui -p 1 4 1 gProc::sigmaKin(double const*, double*), 2020-Aug-17 18:14:03, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 16 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 16 ---------------------------------------------------------------------- --------------- ------------------------------

It gets back the speedup of 2 on GPU (keeps the old implementation on CPP). So throughput is again in the 6E8+. Also, it reduces the number of requests for global memory. This may not be a performance improvement but makes the analysis much simpler. time ./gcheck.exe -p 16384 32 12 *************************************** NumIterations = 12 NumThreadsPerBlock = 32 NumBlocksPerGrid = 16384 --------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Curand generation = DEVICE (CUDA code) --------------------------------------- NumberOfEntries = 12 TotalTimeInWaveFuncs = 1.039935e-02 sec MeanTimeInWaveFuncs = 8.666127e-04 sec StdDevTimeInWaveFuncs = 4.685819e-06 sec MinTimeInWaveFuncs = 8.582820e-04 sec MaxTimeInWaveFuncs = 8.754520e-04 sec --------------------------------------- TotalEventsComputed = 6291456 RamboEventsPerSec = 6.196162e+07 sec^-1 MatrixElemEventsPerSec = 6.049854e+08 sec^-1 *************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = 1.372152e-02 GeV^0 StdErrMatrixElemValue = 3.269516e-06 GeV^0 StdDevMatrixElemValue = 8.200854e-03 GeV^0 MinMatrixElemValue = 6.071582e-03 GeV^0 MaxMatrixElemValue = 3.374925e-02 GeV^0 *************************************** 00 CudaFree : 0.144857 sec 0a ProcInit : 0.000547 sec 0b MemAlloc : 0.081633 sec 0c GenCreat : 0.015051 sec 1a GenSeed : 0.000014 sec 1b GenRnGen : 0.007956 sec 2a RamboIni : 0.000133 sec 2b RamboFin : 0.000083 sec 2c CpDTHwgt : 0.008282 sec 2d CpDTHmom : 0.093040 sec 3a SGoodHel : 0.001629 sec 3b SigmaKin : 0.000088 sec 3c CpDTHmes : 0.010311 sec 4a DumpLoop : 0.023784 sec 9a DumpAll : 0.023715 sec 9b GenDestr : 0.000061 sec 9c MemFree : 0.025542 sec 9d CudReset : 0.043790 sec TOTAL : 0.480517 sec TOTAL(n-2) : 0.291870 sec *************************************** real 0m0.491s user 0m0.182s sys 0m0.308s ./profile.sh -nogui -p 1 4 1 gProc::sigmaKin(double const*, double*), 2020-Aug-18 09:25:37, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 16 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 16 ---------------------------------------------------------------------- --------------- ------------------------------

valassi · 2020-08-18T07:32:22Z

Ok this is done - though ugly and can be improved, and certainly code should be repackaged: 194ffc1

Ugly but effective.

It gets back the speedup of 2 on GPU (keeps the old implementation on CPP). So throughput is again in the 6E8+.
Also, it reduces the number of requests for global memory. This may not be a performance improvement but makes the analysis much simpler.

time ./gcheck.exe -p 16384 32 12
***************************************
NumIterations             = 12
NumThreadsPerBlock        = 32
NumBlocksPerGrid          = 16384
---------------------------------------
FP precision              = DOUBLE (nan=0)
Complex type              = THRUST::COMPLEX
RanNumb memory layout     = AOSOA[4]
Momenta memory layout     = AOSOA[4]
Wavefunction GPU memory   = LOCAL
Curand generation         = DEVICE (CUDA code)
---------------------------------------
NumberOfEntries           = 12
TotalTimeInWaveFuncs      = 1.039935e-02 sec
MeanTimeInWaveFuncs       = 8.666127e-04 sec
StdDevTimeInWaveFuncs     = 4.685819e-06 sec
MinTimeInWaveFuncs        = 8.582820e-04 sec
MaxTimeInWaveFuncs        = 8.754520e-04 sec
---------------------------------------
TotalEventsComputed       = 6291456
RamboEventsPerSec         = 6.196162e+07 sec^-1
MatrixElemEventsPerSec    = 6.049854e+08 sec^-1
***************************************
NumMatrixElements(notNan) = 6291456
MeanMatrixElemValue       = 1.372152e-02 GeV^0
StdErrMatrixElemValue     = 3.269516e-06 GeV^0
StdDevMatrixElemValue     = 8.200854e-03 GeV^0
MinMatrixElemValue        = 6.071582e-03 GeV^0
MaxMatrixElemValue        = 3.374925e-02 GeV^0
***************************************
00 CudaFree : 0.144857 sec
0a ProcInit : 0.000547 sec
0b MemAlloc : 0.081633 sec
0c GenCreat : 0.015051 sec
1a GenSeed  : 0.000014 sec
1b GenRnGen : 0.007956 sec
2a RamboIni : 0.000133 sec
2b RamboFin : 0.000083 sec
2c CpDTHwgt : 0.008282 sec
2d CpDTHmom : 0.093040 sec
3a SGoodHel : 0.001629 sec
3b SigmaKin : 0.000088 sec
3c CpDTHmes : 0.010311 sec
4a DumpLoop : 0.023784 sec
9a DumpAll  : 0.023715 sec
9b GenDestr : 0.000061 sec
9c MemFree  : 0.025542 sec
9d CudReset : 0.043790 sec
TOTAL       : 0.480517 sec
TOTAL(n-2)  : 0.291870 sec
***************************************
real    0m0.491s
user    0m0.182s
sys     0m0.308s

./profile.sh -nogui -p 1 4 1
  gProc::sigmaKin(double const*, double*), 2020-Aug-18 09:25:37, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                             16
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                             16
    ---------------------------------------------------------------------- --------------- ------------------------------

valassi added the enhancement A feature we want to develop label Aug 17, 2020

valassi self-assigned this Aug 18, 2020

valassi closed this as completed Aug 18, 2020

valassi mentioned this issue Aug 18, 2020

AOS/SOA for input particle 4-momenta (and random numbers) #16

Closed

valassi mentioned this issue Nov 23, 2020

good helicity #60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve helicity filtering #24

Improve helicity filtering #24

valassi commented Aug 17, 2020

valassi commented Aug 18, 2020

Improve helicity filtering #24

Improve helicity filtering #24

Comments

valassi commented Aug 17, 2020

valassi commented Aug 18, 2020