Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some efficiency savings for pycbc_fit_sngls_over_multiparam #4957

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

GarethCabournDavies
Copy link
Contributor

@GarethCabournDavies GarethCabournDavies commented Nov 26, 2024

Add in a few efficiency savings for pycbc_fit_sngls_over_multiparam

Standard information about the request

This is an efficiency update
This change affects the offline search, the live search
This change changes nothing for the output
This change follows style guidelines (See e.g. PEP8), has been proposed using the contribution guidelines

Motivation

I was looking at pycbc_fit_sngls_over_multiparam thinking it would be a good place to learn / implement some GPU efficiency, but then we (mainly Ian) saw that there were some huge efficiency savings that could be made fairly quickly

Contents

Biggest saving:

  • Move one instance where an array is indexed to re-sort to outside of a loop
    Additional savings:
  • Pre-allocate memory for a list where we know how long it will be rather than appending to a list in a loop.
  • Index on a boolean in the case that we are using a significant portion of the templates (smooth_distance_weighted)
  • Allow numpy.average() to revert to the more-efficient numpy.mean() method rather than using equal weights.
  • Use pre-calculated number of templates in the reporting of progress rather than calculating every time

Testing performed

Note that for this testing, I set the loop over templates to break at 10,000 iterations.
This meant that for the "old" testing, I needed to implement the pre-allocated array saving described above in order to get outputs to match

Differences in output files

All equivalent files' datasets match using equality (i.e. numpy.array_equal(dataset1, dataset2)), dtypes match and attributes match.

Hashes of files do not match - I am unsure how this can be the case though.

Profiling

smooth_tophat (default) smoothing:

Summary:

The "old" profiling graph shows that something which is not in a function is the dominant cost. I found that by setting the smooth() function to not be called this drastically improved performance. We found that the problem was in the arguments being passed to the function, not the function itself.

time output shows that the 'new' method takes approx 1/35 of the 'old' time.
There are also a bunch of page faults and voluntary context switches in the old version, which I don't know what they are, but it sounds bad.

Profiling graphs:
new and old

time -v output:
Old:

    User time (seconds): 1391.56
    System time (seconds): 612.39
    Percent of CPU this job got: 94%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 35:21.35
    Maximum resident set size (kbytes): 412880
    Major (requiring I/O) page faults: 803
    Minor (reclaiming a frame) page faults: 139735804
    Voluntary context switches: 94621
    Involuntary context switches: 276684

New:

    User time (seconds): 67.32
    System time (seconds): 2.73
    Percent of CPU this job got: 102%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:08.55
    Maximum resident set size (kbytes): 419592
    Major (requiring I/O) page faults: 12
    Minor (reclaiming a frame) page faults: 97534
    Voluntary context switches: 10431
    Involuntary context switches: 702104

distance_weighted smoothing

Summary:
In this, there is the extra cost of generating a normal PDF for every template, this means that the savings aren't quite as significant, but still noteworthy

Profiling graphs:
new
and old

time -v output:

old:

    User time (seconds): 1837.79
    System time (seconds): 580.82
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 40:17.43
    Maximum resident set size (kbytes): 434604
    Minor (reclaiming a frame) page faults: 236214226
    Voluntary context switches: 9713
    Involuntary context switches: 670573

new:

    User time (seconds): 1082.32
    System time (seconds): 3.87
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 18:04.86
    Maximum resident set size (kbytes): 414860
    Minor (reclaiming a frame) page faults: 95858
    Voluntary context switches: 10375
    Involuntary context switches: 487334

n_closest smoothing

This is essentially unchanged, as the major savings part of the PR is not affecting this code path.

For completeness, profiles can be found at this link under fit_over_n_closest_{new,old}.{txt,png} for new/old and time output/ profiling graph respectively

  • The author of this pull request confirms they will adhere to the code of conduct

@@ -68,7 +69,6 @@ def smooth_templates(nabove, invalphan, ntotal, template_idx,
Third float: the smoothed total count in template value

"""
if weights is None: weights = numpy.ones_like(template_idx)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change allows numpy.average to revert to numpy.mean, which saves some cost on this operation

n_templates = len(nabove)
rang = numpy.arange(0, n_templates)

nabove_smoothed = numpy.zeros_like(parvals[0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preallocation saves some time, writing to indexes of the array rather than appending

Comment on lines +464 to +466
nabove_sort = nabove[par_sort]
invalphan_sort = invalphan[par_sort]
ntotal_sort = ntotal[par_sort]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the dominant saving cost - this was being done within the loop, meaning an $O(n^2)$ operation rather than $O(n)$

@spxiwh
Copy link
Contributor

spxiwh commented Nov 28, 2024

I feel someone else should review this, given my involvement in creating this.

@GarethCabournDavies GarethCabournDavies removed the request for review from spxiwh November 28, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants