Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hafnian repeated omp #120

Merged
merged 11 commits into from
Jan 17, 2020
Merged

Hafnian repeated omp #120

merged 11 commits into from
Jan 17, 2020

Conversation

trevor-vincent
Copy link
Contributor

@trevor-vincent trevor-vincent commented Jan 13, 2020

Context:

The hafnian repeated moment algorithm was missing OpenMP support.

Description of the Change:

Add OpenMP support to repeated moment hafnian algorithm.

Benefits:

The code will be sped up proportional to the number of cores available.

Possible Drawbacks:

None

Related GitHub Issues:

#46

@codecov
Copy link

codecov bot commented Jan 13, 2020

Codecov Report

Merging #120 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #120      +/-   ##
==========================================
- Coverage   97.75%   97.72%   -0.03%     
==========================================
  Files          12       12              
  Lines         891      881      -10     
==========================================
- Hits          871      861      -10     
  Misses         20       20
Impacted Files Coverage Δ
thewalrus/fock_gradients.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba11956...f5c7214. Read the comment docs.

@josh146
Copy link
Member

josh146 commented Jan 13, 2020

Thanks @trevor-vincent! It looks like the windows build is failing with the following error:

  include\repeated_hafnian.hpp(125): error C3861: 'omp_get_num_threads': identifier not found
  include\repeated_hafnian.hpp(126): error C3861: 'omp_get_thread_num': identifier not found

I don't think I ever got OpenMP compiling with windows/MSVCC, so I had disabled it in the setup.py.
Unless you know how to enable OpenMP on windows, probably sufficient to have the compiler check if it is using MSVCC, and if so, not use those functions.

@trevor-vincent
Copy link
Contributor Author

Okay thanks Josh. Let me see if I can put some preprocessor code to get around this issue for the windows build.

@josh146
Copy link
Member

josh146 commented Jan 13, 2020

It seems that the _OPENMP flag is defined on Windows for some reason 🤔 Another option is to use _MSC_VER (which should be defined)

@nquesada
Copy link
Collaborator

@trevor-vincent Could you share with us some benchmarking of the implementation in this PR vs the old one? And also some basic idea of how embarrassing is the parallelization, i.e. run a non negligible calculation with OMP_NUM_THREAD = 1,2,4 ?

@trevor-vincent
Copy link
Contributor Author

trevor-vincent commented Jan 13, 2020

@trevor-vincent Could you share with us some benchmarking of the implementation in this PR vs the old one? And also some basic idea of how embarrassing is the parallelization, i.e. run a non negligible calculation with OMP_NUM_THREAD = 1,2,4 ?

It is pretty much embarrassingly parallel, here is some results for n = 30 loop hafnian on a 4-core machine. The speedup would probably get even better as n goes to infinity

loop hafnian / n = 30 / rpt = 11111... / mat = random
cores ----- time(s) ----- speedup
1 ----- 322.51 ----- 1.00
2 ----- 161.42 ----- 1.99
3 ----- 112.95 ----- 2.85
4 ----- 86.32 ----- 3.74

Edit: 4 cores will always be a bit lower than a 4x speedup because the machine needs one core to deal with the operating system. 3 cores is not as good as 2 because the number of sum terms isn't divisible by three so the inefficiency might be due to the uneven division of the sum between cores. Other than that, this is quite good scaling.

@trevor-vincent
Copy link
Contributor Author

trevor-vincent commented Jan 14, 2020

Old non-parallel version runs in a time of 296.973s. Therefore the new version has an 8.7% slow down for n = 30 on 1 core versus old version, but this percentage should decrease as n gets larger because the only major changes are outside of the main O(n*2^n) work loop. This very small slow down is due to the necessary overhead in computing which sections of the sum each of the cores will deal with.

@trevor-vincent
Copy link
Contributor Author

trevor-vincent commented Jan 16, 2020

I can actually (I think) remove the overhead in the n=1 case, so I'll do that tomorrow.

@trevor-vincent
Copy link
Contributor Author

newer version on one core is now slightly faster than the older version on one core, e.g. running it now on 1 core for n = 30 I get:
new version loop hafnian time = 297.621
old version loop hafnian time = 303.325

Copy link
Member

@josh146 josh146 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @trevor-vincent this looks great! I left very minor comments regarding some comments and docstrings, but didn't go through the new code logic.

Small question: is there a significant performance increase for the single threaded version when n < 30? Or is it negligible?

include/repeated_hafnian.hpp Outdated Show resolved Hide resolved
include/repeated_hafnian.hpp Outdated Show resolved Hide resolved
include/repeated_hafnian.hpp Outdated Show resolved Hide resolved
include/repeated_hafnian.hpp Outdated Show resolved Hide resolved
include/repeated_hafnian.hpp Outdated Show resolved Hide resolved
include/repeated_hafnian.hpp Show resolved Hide resolved
@trevor-vincent
Copy link
Contributor Author

Okay great thanks Josh, I'll try to fix these things. Regarding whether the speedup is negligible for n < 30, I'll try to make a plot and hopefully this will help clarify things. From recollection the speedup is pretty meaningful for n>24. For n<24 the hafnian is computed so fast that the extra speedup isn't really needed. However there is actually still a speedup for 4 cores at even n=14, but I doubt it is close to 4x. Probably at some n there will be no speedup because the overhead for the extra cores is too much. All of this of course depends on the rpt vector which can increase the workload at any n. The greater its values the more work there will be and the more useful the speedup is. For the tests I've been setting rpt=111111111..., so it doesn't affect things much and the workload is more dependent on n.

@trevor-vincent
Copy link
Contributor Author

trevor-vincent commented Jan 17, 2020

speedup

speedup over different n's (rpt = 11111.....)
not sure what the cores=4 randomness at small n is due to atm, but these times are really small so timings are probably very inaccurate here anyway

@trevor-vincent
Copy link
Contributor Author

1core_times

1 core times for reference

@nquesada
Copy link
Collaborator

Look like it is ready to be merged!

Copy link
Member

@josh146 josh146 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Trevor! Looks good from my end, will just wait to see if @nquesada has any comments

@nquesada nquesada merged commit f227351 into master Jan 17, 2020
@nquesada nquesada deleted the hafnian_repeated_omp branch January 21, 2020 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants