-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve helicity filtering #24
Labels
enhancement
A feature we want to develop
Comments
valassi
added a commit
that referenced
this issue
Aug 18, 2020
This decreases throughpout by a factor 2 on GPU (to 3E8) and by a factor 4 on CPP (1.5 sec becomes 6sec). However, this reduces drastically the access to global memory on GPU: ./profile.sh -nogui -p 1 32 1 gProc::sigmaKin(double const*, double*), 2020-Aug-17 18:13:50, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 16 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 128 ---------------------------------------------------------------------- --------------- ------------------------------ ./profile.sh -nogui -p 1 4 1 gProc::sigmaKin(double const*, double*), 2020-Aug-17 18:14:03, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 16 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 16 ---------------------------------------------------------------------- --------------- ------------------------------
valassi
added a commit
that referenced
this issue
Aug 18, 2020
It gets back the speedup of 2 on GPU (keeps the old implementation on CPP). So throughput is again in the 6E8+. Also, it reduces the number of requests for global memory. This may not be a performance improvement but makes the analysis much simpler. time ./gcheck.exe -p 16384 32 12 *************************************** NumIterations = 12 NumThreadsPerBlock = 32 NumBlocksPerGrid = 16384 --------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Curand generation = DEVICE (CUDA code) --------------------------------------- NumberOfEntries = 12 TotalTimeInWaveFuncs = 1.039935e-02 sec MeanTimeInWaveFuncs = 8.666127e-04 sec StdDevTimeInWaveFuncs = 4.685819e-06 sec MinTimeInWaveFuncs = 8.582820e-04 sec MaxTimeInWaveFuncs = 8.754520e-04 sec --------------------------------------- TotalEventsComputed = 6291456 RamboEventsPerSec = 6.196162e+07 sec^-1 MatrixElemEventsPerSec = 6.049854e+08 sec^-1 *************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = 1.372152e-02 GeV^0 StdErrMatrixElemValue = 3.269516e-06 GeV^0 StdDevMatrixElemValue = 8.200854e-03 GeV^0 MinMatrixElemValue = 6.071582e-03 GeV^0 MaxMatrixElemValue = 3.374925e-02 GeV^0 *************************************** 00 CudaFree : 0.144857 sec 0a ProcInit : 0.000547 sec 0b MemAlloc : 0.081633 sec 0c GenCreat : 0.015051 sec 1a GenSeed : 0.000014 sec 1b GenRnGen : 0.007956 sec 2a RamboIni : 0.000133 sec 2b RamboFin : 0.000083 sec 2c CpDTHwgt : 0.008282 sec 2d CpDTHmom : 0.093040 sec 3a SGoodHel : 0.001629 sec 3b SigmaKin : 0.000088 sec 3c CpDTHmes : 0.010311 sec 4a DumpLoop : 0.023784 sec 9a DumpAll : 0.023715 sec 9b GenDestr : 0.000061 sec 9c MemFree : 0.025542 sec 9d CudReset : 0.043790 sec TOTAL : 0.480517 sec TOTAL(n-2) : 0.291870 sec *************************************** real 0m0.491s user 0m0.182s sys 0m0.308s ./profile.sh -nogui -p 1 4 1 gProc::sigmaKin(double const*, double*), 2020-Aug-18 09:25:37, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 16 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 16 ---------------------------------------------------------------------- --------------- ------------------------------
Ok this is done - though ugly and can be improved, and certainly code should be repackaged: 194ffc1 Ugly but effective.
|
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In studying issue #16, I just realised that the 274 requests to access global memory from each warp reduce to 16 (!) if I disable helicity filtering. There is no other indication that this is an issue (andactually the present implementation saves a factor 2 in CUDA and a factor 4 in CPP), but it's worth having a look...
The text was updated successfully, but these errors were encountered: