Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve helicity filtering #24

Closed
valassi opened this issue Aug 17, 2020 · 1 comment
Closed

Improve helicity filtering #24

valassi opened this issue Aug 17, 2020 · 1 comment
Assignees
Labels
enhancement A feature we want to develop

Comments

@valassi
Copy link
Member

valassi commented Aug 17, 2020

In studying issue #16, I just realised that the 274 requests to access global memory from each warp reduce to 16 (!) if I disable helicity filtering. There is no other indication that this is an issue (andactually the present implementation saves a factor 2 in CUDA and a factor 4 in CPP), but it's worth having a look...

@valassi valassi added the enhancement A feature we want to develop label Aug 17, 2020
valassi added a commit that referenced this issue Aug 18, 2020
This decreases throughpout by a factor 2 on GPU (to 3E8) and by a factor 4 on CPP (1.5 sec becomes 6sec).

However, this reduces drastically the access to global memory on GPU:

./profile.sh -nogui -p 1 32 1
  gProc::sigmaKin(double const*, double*), 2020-Aug-17 18:13:50, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                             16
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                            128
    ---------------------------------------------------------------------- --------------- ------------------------------

./profile.sh -nogui -p 1 4 1
  gProc::sigmaKin(double const*, double*), 2020-Aug-17 18:14:03, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                             16
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                             16
    ---------------------------------------------------------------------- --------------- ------------------------------
valassi added a commit that referenced this issue Aug 18, 2020
It gets back the speedup of 2 on GPU (keeps the old implementation on CPP).
So throughput is again in the 6E8+.

Also, it reduces the number of requests for global memory.
This may not be a performance improvement but makes the analysis much simpler.

time ./gcheck.exe -p 16384 32 12
***************************************
NumIterations             = 12
NumThreadsPerBlock        = 32
NumBlocksPerGrid          = 16384
---------------------------------------
FP precision              = DOUBLE (nan=0)
Complex type              = THRUST::COMPLEX
RanNumb memory layout     = AOSOA[4]
Momenta memory layout     = AOSOA[4]
Wavefunction GPU memory   = LOCAL
Curand generation         = DEVICE (CUDA code)
---------------------------------------
NumberOfEntries           = 12
TotalTimeInWaveFuncs      = 1.039935e-02 sec
MeanTimeInWaveFuncs       = 8.666127e-04 sec
StdDevTimeInWaveFuncs     = 4.685819e-06 sec
MinTimeInWaveFuncs        = 8.582820e-04 sec
MaxTimeInWaveFuncs        = 8.754520e-04 sec
---------------------------------------
TotalEventsComputed       = 6291456
RamboEventsPerSec         = 6.196162e+07 sec^-1
MatrixElemEventsPerSec    = 6.049854e+08 sec^-1
***************************************
NumMatrixElements(notNan) = 6291456
MeanMatrixElemValue       = 1.372152e-02 GeV^0
StdErrMatrixElemValue     = 3.269516e-06 GeV^0
StdDevMatrixElemValue     = 8.200854e-03 GeV^0
MinMatrixElemValue        = 6.071582e-03 GeV^0
MaxMatrixElemValue        = 3.374925e-02 GeV^0
***************************************
00 CudaFree : 0.144857 sec
0a ProcInit : 0.000547 sec
0b MemAlloc : 0.081633 sec
0c GenCreat : 0.015051 sec
1a GenSeed  : 0.000014 sec
1b GenRnGen : 0.007956 sec
2a RamboIni : 0.000133 sec
2b RamboFin : 0.000083 sec
2c CpDTHwgt : 0.008282 sec
2d CpDTHmom : 0.093040 sec
3a SGoodHel : 0.001629 sec
3b SigmaKin : 0.000088 sec
3c CpDTHmes : 0.010311 sec
4a DumpLoop : 0.023784 sec
9a DumpAll  : 0.023715 sec
9b GenDestr : 0.000061 sec
9c MemFree  : 0.025542 sec
9d CudReset : 0.043790 sec
TOTAL       : 0.480517 sec
TOTAL(n-2)  : 0.291870 sec
***************************************
real    0m0.491s
user    0m0.182s
sys     0m0.308s

./profile.sh -nogui -p 1 4 1
  gProc::sigmaKin(double const*, double*), 2020-Aug-18 09:25:37, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                             16
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                             16
    ---------------------------------------------------------------------- --------------- ------------------------------
@valassi valassi self-assigned this Aug 18, 2020
@valassi
Copy link
Member Author

valassi commented Aug 18, 2020

Ok this is done - though ugly and can be improved, and certainly code should be repackaged: 194ffc1

Ugly but effective.

  • It gets back the speedup of 2 on GPU (keeps the old implementation on CPP). So throughput is again in the 6E8+.
  • Also, it reduces the number of requests for global memory. This may not be a performance improvement but makes the analysis much simpler.
time ./gcheck.exe -p 16384 32 12
***************************************
NumIterations             = 12
NumThreadsPerBlock        = 32
NumBlocksPerGrid          = 16384
---------------------------------------
FP precision              = DOUBLE (nan=0)
Complex type              = THRUST::COMPLEX
RanNumb memory layout     = AOSOA[4]
Momenta memory layout     = AOSOA[4]
Wavefunction GPU memory   = LOCAL
Curand generation         = DEVICE (CUDA code)
---------------------------------------
NumberOfEntries           = 12
TotalTimeInWaveFuncs      = 1.039935e-02 sec
MeanTimeInWaveFuncs       = 8.666127e-04 sec
StdDevTimeInWaveFuncs     = 4.685819e-06 sec
MinTimeInWaveFuncs        = 8.582820e-04 sec
MaxTimeInWaveFuncs        = 8.754520e-04 sec
---------------------------------------
TotalEventsComputed       = 6291456
RamboEventsPerSec         = 6.196162e+07 sec^-1
MatrixElemEventsPerSec    = 6.049854e+08 sec^-1
***************************************
NumMatrixElements(notNan) = 6291456
MeanMatrixElemValue       = 1.372152e-02 GeV^0
StdErrMatrixElemValue     = 3.269516e-06 GeV^0
StdDevMatrixElemValue     = 8.200854e-03 GeV^0
MinMatrixElemValue        = 6.071582e-03 GeV^0
MaxMatrixElemValue        = 3.374925e-02 GeV^0
***************************************
00 CudaFree : 0.144857 sec
0a ProcInit : 0.000547 sec
0b MemAlloc : 0.081633 sec
0c GenCreat : 0.015051 sec
1a GenSeed  : 0.000014 sec
1b GenRnGen : 0.007956 sec
2a RamboIni : 0.000133 sec
2b RamboFin : 0.000083 sec
2c CpDTHwgt : 0.008282 sec
2d CpDTHmom : 0.093040 sec
3a SGoodHel : 0.001629 sec
3b SigmaKin : 0.000088 sec
3c CpDTHmes : 0.010311 sec
4a DumpLoop : 0.023784 sec
9a DumpAll  : 0.023715 sec
9b GenDestr : 0.000061 sec
9c MemFree  : 0.025542 sec
9d CudReset : 0.043790 sec
TOTAL       : 0.480517 sec
TOTAL(n-2)  : 0.291870 sec
***************************************
real    0m0.491s
user    0m0.182s
sys     0m0.308s

./profile.sh -nogui -p 1 4 1
  gProc::sigmaKin(double const*, double*), 2020-Aug-18 09:25:37, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                             16
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                             16
    ---------------------------------------------------------------------- --------------- ------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature we want to develop
Projects
None yet
Development

No branches or pull requests

1 participant