-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use separate along-step kernel for neutral particles for 25% performance boost #745
Conversation
8ea237b
to
bd372de
Compare
Interestingly geo propagation limit seems to be fixed? Results are now consistent with ORANGE (!)
Other than the diagnostic test failing, everything looks good to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice speedup! Looks good to me too once the test is fixed. Interestingly, for cms2018+msc+field this decreased the mean number of steps per gamma track by ~10%.
@amandalund Yeah, that might be related to the change in results for the diagnostic: I'm not sure how, but the
actions are no longer showing up. Maybe it's a lingering issue with positrons being biased to die near geometry boundaries? |
I suppose you're referring to Independent thread scheduling from Volta where each thread has its own PC and SP. If my understanding is correct, even then, the execution is still SIMT, so locksteps still happen, we can just interleave divergent code paths since each thread has its own stats. I suppose it still helps if threads are stalled due to memory transfer or fetching instruction (the two most common causes of stalling we have) then we can execute the other code path? I'd still expect AMD to be faster since it has less work to do. |
interesting, could be... I do see the |
This results in a ~25% speedup for GPU tracks on CMS+msc+field. Sorting the along-step tracks results in a slight (5-10% per kernel) performance increase for the geometry kernels but a substantial (~7% overall execution) increase in the pre-step kernel plus a penalty (~5% overall) for the sort. Sorting by both action and post-step action further increases the overall time.
NOTE: preliminary testing suggests this is terrible for AMD hardware by default: the along-step-field time does not change for TestEm3, and the along-step-neutral is simply added on top of it. Perhaps because AMD doesn't have separate hardware counters?