-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiling for red-mot #9
Comments
I'll add inner parallelism to |
Introduced precalculation for |
Tried removing ecs barriers, doesn't make much difference (~42s both times) |
Benchmark was being limited by writing to file, so I removed file output - 42s->37.7s |
More profiling fun. I noticed there are some heap allocations which are taking up a lot of time. They come from the various It does seem somewhat counter-productive to the general ECS data pattern to use heap-allocated vec components. I am tempted to try a branch where we cap the number of laser beams (eg 16? some user-defined variable) and instead use fixed-size sampler arrays. It's more memory allocation, but having that memory not heap allocated might just be faster. Fairly straight forward to test it. |
Still more profiling, and it's a big one! Following the above, I removed all heap-allocated vector components (ie, those through The performance increase is staggering! Runtime is now 10.6s, down from 21.5s - another factor of two in speed. The effective CPU utilisation is extremely good. One caveat - I broke something during the changes! So, the simulation is not actually giving the correct physics right now. Nonetheless, the same number of systems are running, so I believe we can believe the performance. It's probably worth fixing the red-mot-perf branch and merging it back onto red-mot. |
Looking at this again, some of the speedup was because poisson dist was no longer being calculated due to a nan error. In 1c5fd I implemented There is a performance gain, but not quite as substantial as before - 17.7s compared to 22.5s using |
In 8558d17 I improved system-level parallelism, which shaved time from 17.7s to 16s. Still about 25s total CPU time on spin, which probably costs around 2s of wall time. |
Other things tested -
|
Removing sampler initialisation and instead relying on mask and leaving unused data unitialised: time goes from 18.3s->14.5s with |
Implemented in 7d0f52d |
Note that without |
Calculate the rate coefficients is currently one of the longest operations in the code and this particular implementation is back-end bound (both in memory and core):
This code is problematic for various reasons: first, it doesn't get vectorised, so the float operations are scalar. Second is the 'random' access through the sampler arrays using the cooling light index. It's also inefficient that we dispatch a An easy way to improve this would be to also pack the required |
This was attempted in 4098053, but I wasn't able to get llvm to use vectorized operations. Reverted for now. |
Tried it again in e425a5e but no meaningful improvement. |
Made remaining laser systems parallel in c69c0b4. Code is now very evenly distributed over all processors and is DRAM bandwidth bound ~50% of the time. |
For general interest, I did a performance to benchmark
This gives:
The iteration over 1,000,000 entities, 10k steps takes 18s. That gives a wall time per step of 1.8ms, and wall time/step/atom of 1.8ns, which a factor of ~4 slower than straight multiplication in matlab. The matlab multi is compiled using float vector routines which I still can't get working on rust, so that probably makes up for the difference. All in all, I think the safety and flexibility of the rust code is worth it. This benchmark is also a case that favors the matlab - multiplication of matrices is kind of matlab's whole point. I expect the rust will get the same performance if I manage to get vector routines working, but I think enough is enough for now. |
Starting to profile the latest changes to
red-mot
to get a head start on optimisations.Interestingly, it seems the vast majority of program time was being taken up by rayon implementations - specifically
SwitchToThread
andSleepConditionVariableSRW
. There was also poor spin time.This seems likely related to a known issue in rayon, where sleeping threads would burn a lot of cycles/power. For example, see
rayon-rs/rayon#642
rayon-rs/rayon#795
rayon-rs/rfcs#5
It looks like some of these issues were fixed in
rayon 1.4.0
, although we are currently onrayon 1.3.0
. I'll try updating rayon (new version is1.5.0
) and see if it improves things.The text was updated successfully, but these errors were encountered: