Some efficiency savings for pycbc_fit_sngls_over_multiparam #4957

GarethCabournDavies · 2024-11-26T14:48:10Z

Add in a few efficiency savings for pycbc_fit_sngls_over_multiparam

Standard information about the request

This is an efficiency update
This change affects the offline search, the live search
This change changes nothing for the output
This change follows style guidelines (See e.g. PEP8), has been proposed using the contribution guidelines

Motivation

I was looking at pycbc_fit_sngls_over_multiparam thinking it would be a good place to learn / implement some GPU efficiency, but then we (mainly Ian) saw that there were some huge efficiency savings that could be made fairly quickly

Move one instance where an array is indexed to re-sort to outside of a loop
Additional savings:
Pre-allocate memory for a list where we know how long it will be rather than appending to a list in a loop.
Index on a boolean in the case that we are using a significant portion of the templates (smooth_distance_weighted)
Allow numpy.average() to revert to the more-efficient numpy.mean() method rather than using equal weights.
Use pre-calculated number of templates in the reporting of progress rather than calculating every time

Testing performed

Note that for this testing, I set the loop over templates to break at 10,000 iterations.
This meant that for the "old" testing, I needed to implement the pre-allocated array saving described above in order to get outputs to match

Differences in output files

All equivalent files' datasets match using equality (i.e. numpy.array_equal(dataset1, dataset2)), dtypes match and attributes match.

Hashes of files do not match - I am unsure how this can be the case though.

Profiling

smooth_tophat (default) smoothing:

Summary:

The "old" profiling graph shows that something which is not in a function is the dominant cost. I found that by setting the smooth() function to not be called this drastically improved performance. We found that the problem was in the arguments being passed to the function, not the function itself.

time output shows that the 'new' method takes approx 1/35 of the 'old' time.
There are also a bunch of page faults and voluntary context switches in the old version, which I don't know what they are, but it sounds bad.

Profiling graphs:
new and old

time -v output:
Old:

    User time (seconds): 1391.56
    System time (seconds): 612.39
    Percent of CPU this job got: 94%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 35:21.35
    Maximum resident set size (kbytes): 412880
    Major (requiring I/O) page faults: 803
    Minor (reclaiming a frame) page faults: 139735804
    Voluntary context switches: 94621
    Involuntary context switches: 276684

New:

    User time (seconds): 67.32
    System time (seconds): 2.73
    Percent of CPU this job got: 102%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:08.55
    Maximum resident set size (kbytes): 419592
    Major (requiring I/O) page faults: 12
    Minor (reclaiming a frame) page faults: 97534
    Voluntary context switches: 10431
    Involuntary context switches: 702104

distance_weighted smoothing

Summary:
In this, there is the extra cost of generating a normal PDF for every template, this means that the savings aren't quite as significant, but still noteworthy

Profiling graphs:
new
and old

time -v output:

old:

    User time (seconds): 1837.79
    System time (seconds): 580.82
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 40:17.43
    Maximum resident set size (kbytes): 434604
    Minor (reclaiming a frame) page faults: 236214226
    Voluntary context switches: 9713
    Involuntary context switches: 670573

new:

    User time (seconds): 1082.32
    System time (seconds): 3.87
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 18:04.86
    Maximum resident set size (kbytes): 414860
    Minor (reclaiming a frame) page faults: 95858
    Voluntary context switches: 10375
    Involuntary context switches: 487334

n_closest smoothing

This is essentially unchanged, as the major savings part of the PR is not affecting this code path.

For completeness, profiles can be found at this link under fit_over_n_closest_{new,old}.{txt,png} for new/old and time output/ profiling graph respectively

The author of this pull request confirms they will adhere to the code of conduct