Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize track vector data layout for particle types #1322

Merged
merged 12 commits into from
Aug 29, 2024

Conversation

amandalund
Copy link
Contributor

I started sketching out a possible implementation of #1312 (still a WIP). This partitiions the last num_new_tracks initializers by charged/neutral and initializes neutral and charged tracks at opposite ends of the track vector.

For now I've disabled the immediate initialization of. a secondary in the parent's track slot (since that results in quite a bit more mixing) and the copying of the parent's geometry state (allowing this would require partitioning the parent IDs as well, which we could do). FWIW I didn't see a ton of speedup from those geometry optimizations anyway (maybe a few percent at most, in cms2018).

@amandalund amandalund added physics Particles, processes, and stepping algorithms performance Changes for performance optimization labels Jul 14, 2024
@amandalund amandalund requested review from sethrj and esseivaju July 14, 2024 23:40
@sethrj
Copy link
Member

sethrj commented Jul 26, 2024

I might test this today... also the test segfaults are caught by clang asan:

[ RUN      ] PartitionDataTest.init_primaries_host
=================================================================
==2683==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7f277607cc58 at pc 0x00000038c3b0 bp 0x7ffc76ad7e70 sp 0x7ffc76ad7e68
READ of size 8 at 0x7f277607cc58 thread T0

@amandalund amandalund marked this pull request as ready for review August 5, 2024 22:49
@amandalund
Copy link
Contributor Author

@sethrj @esseivaju I think this is ready for a look and some testing/profiling now. I haven't done much myself yet, but in some quick initial comparisons saw ~25% improvement on testem3 and ~5% on cms2018 with celer-sim. With merge_events off I saw only a 15% improvement with testem3 and a performance degradation with cms; I haven't looked at the performance of any of the Geant4-integrated apps yet.

@sethrj
Copy link
Member

sethrj commented Aug 6, 2024

throughput
Seems to be barely any bump in performance for Frontier/AMD. @esseivaju if you run the v0.5.0-dev.partitioned branch I'll generate the equivalent plot for perlmutter for the whole run of problems.

@esseivaju
Copy link
Contributor

I'll try to run this tomorrow

@sethrj
Copy link
Member

sethrj commented Aug 12, 2024

rel-throughput

Copying the output from the slack channel. We're obviously not yet copying the track sort parameter into the GPU setup for Celeritas. It's very odd that the CPU "sort" performance dips in a couple of cases.

Copy link
Member

@sethrj sethrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superb work @amandalund ! We really need to publish some of this 😅

src/celeritas/track/InitializeTracksAction.cc Outdated Show resolved Hide resolved
@sethrj
Copy link
Member

sethrj commented Aug 16, 2024

@amandalund I think this is good to go aside from the minor comments and the conflict 😄 I know it's probably been lost in the pile of pull requests 😅

@amandalund
Copy link
Contributor Author

Thanks @sethrj! I'd modified this to try to reuse the parent's geo state when possible for improved performance, but got sidetracked when I added an assertion that the parent's position match the initializer position and several unit tests started failing. Should be able to wrap this up pretty quickly now that those fixes are in.

@amandalund
Copy link
Contributor Author

Alright, I've updated this so we're now partitioning an array of indices (rather than track initializers) which we can also use to access the parent track slot IDs to copy the geometry state. We'll still have to reinitialize the geometry more often than we would when not sorting (we're not doing the in-place initialization, so we can't reuse the state if the track was killed but produced secondaries), but hopefully it should still help some. @esseivaju if you wouldn't mind rerunning the regression problems with the latest changes and @sethrj updating the performance plot again I'd appreciate it!

@sethrj
Copy link
Member

sethrj commented Aug 17, 2024

Rerunning frontier now...

@sethrj
Copy link
Member

sethrj commented Aug 19, 2024

rel-throughput
Frontier's seeing a solid 10% speedup with this 😄 nice work!

@sethrj
Copy link
Member

sethrj commented Aug 19, 2024

rel-throughput
Perlmutter's also seeing similar speedups for testem3. The Run 2 geometry minus MSC shows that same anomaly...

@sethrj
Copy link
Member

sethrj commented Aug 28, 2024

@esseivaju did you get a chance to re-run develop for the CMS/run2/no-msc case? It would be good to get this in but I'd like to understand the performance regression first.

@amandalund
Copy link
Contributor Author

I also ran it actually (on an A100 + AMD EPYC 7532):
rel-throughput

@sethrj
Copy link
Member

sethrj commented Aug 28, 2024

Weird...

@amandalund
Copy link
Contributor Author

It's possible the worse performance for cms+msc could still be the penalty of having to reinitialize the geometry state more often than on develop...

@sethrj sethrj enabled auto-merge (squash) August 29, 2024 11:45
@sethrj
Copy link
Member

sethrj commented Aug 29, 2024

That makes a lot of sense, good thought @amandalund . The cost of a step iteration without MSC or field is relatively low compared to with MSC and field, and the cost of initialization with vecgeom/CMS is much higher than testem3. We could potentially do a better job of serializing the geometry state for fast reconstruction by storing an additional "unique volume ID" (aka VecGeom navindex, or expanded ORANGE volume ID) on the track and defining a new initializer that reconstructs the logical geometry state more efficiently than searching based on the point.

@sethrj sethrj merged commit 3f6f041 into celeritas-project:develop Aug 29, 2024
29 checks passed
@amandalund
Copy link
Contributor Author

Yeah it definitely seems worthwhile to explore ways of speeding up that initialization.

@amandalund amandalund deleted the track-data-layout branch August 29, 2024 13:15
@amandalund
Copy link
Contributor Author

Sorry @sethrj I only just realized that M means !MSC 😅. Maybe we should invert that?

@sethrj
Copy link
Member

sethrj commented Sep 1, 2024

There's supposed to be a tilde on top; matplotlib didn't render it correctly. Since the results aren't correct without msc I figured it was better to start denoting "not included" rather than "accurate"...

@amandalund
Copy link
Contributor Author

It does seem better to better to mark that some problems are missing key physics than the other way around... though on the other hand $F$ for field and $M$ for msc is much more obvious when glancing at a plot than $F$ for field and $\tilde{M}$ for no msc.

@sethrj
Copy link
Member

sethrj commented Sep 3, 2024

I'm cool with changing this. Maybe better (if we want to go crazy) would be to replace the "no MSC" mode with full coulomb scattering? I have no idea if that'll increase our run times by 1000x though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Changes for performance optimization physics Particles, processes, and stepping algorithms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants