Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require one event at a time #1233

Open
sethrj opened this issue May 13, 2024 · 8 comments
Open

Require one event at a time #1233

sethrj opened this issue May 13, 2024 · 8 comments
Labels
physics Particles, processes, and stepping algorithms

Comments

@sethrj
Copy link
Member

sethrj commented May 13, 2024

Currently track IDs are constructed within an event based on an atomic counter. Even though the results (track slots, etc.) are reproducible across runs, the individual track IDs are not: so you can't dig into a particular history using an a priori track ID. We could simplify reproducibility by requiring all tracks in flight to be from the same event (i.e., the usual way we integrate into Geant4).

This also would eliminate the need for max_events (#1058). We could store the single event ID on the "core state" object since we don't need access to it on GPU. This would also reduce the memory requirements by sizeof(int) * (initializers.size() + state.size()).

@sethrj sethrj added the physics Particles, processes, and stepping algorithms label May 13, 2024
@sethrj sethrj changed the title Make track IDs reproducible across runs Require one event at a time to make track IDs reproducible across runs Jun 27, 2024
@sethrj sethrj changed the title Require one event at a time to make track IDs reproducible across runs Require one event at a time Jun 27, 2024
@sethrj
Copy link
Member Author

sethrj commented Jul 9, 2024

From @amandalund :

with our standalone app I’ve typically seen better performance transporting multiple events at a time (e.g. about ~2x faster for the cms2018+field+msc-vecgeom-gpu regression problem with merge_events=true ). Our regression suite merges events for celer-sim on the gpu, right? I don’t think I’ve run those problems in single-event mode in a while, but it would be good to compare. I’d still probably prefer to keep that capability.

My follow up:

fair enough! I think it'd also be good to run with a few more events (to reduce the load imbalance from each core having only a single event) and make sure that the merge_events=false case has optimal track slots

@amandalund
Copy link
Contributor

amandalund commented Sep 15, 2024

Here's where the speedup running one event at a time stands with all the latest performance improvements (but not partiitioning by charge when running with merge_events off, since it doesn't help there):
rel-throughput-merge-events

Still a bit slower when not combining events, but I haven't tried increasing the total number of events or optimizing the number of track slots (which may also be problem dependent).

EDIT: Here's the same plot but with 4x the number of track slots when merge_events is off:
rel-throughput-merge-events-4xts

and the same as above but partitioning by charge with merge_events off as well:
rel-throughput-merge-events-partition-4xts

@sethrj
Copy link
Member Author

sethrj commented Sep 16, 2024

@amandalund Why doesn't partitioning by charge help? 😕 (I don't think it helps GPU+G4 since we don't yet pass that option through right?) And is this after #1405? And so this is just the regression problems with and without "merge event"?

@amandalund
Copy link
Contributor

amandalund commented Sep 16, 2024

Right, toggling merge_events and after #1405. And yeah, we don't pass the option to celer-g4 so shouldn't expect any difference there. Here's the speedup partitioning charge with merge_events off (this is with 16 CPU cores and one A100):
rel-throughput-merge-events-partition

It helps a bit in some cases, but more often seems to hurt... I'm not really sure why that is yet.

EDIT: Same plot as above but with 4x the number of track slots:
rel-throughput-nomerge-partition-4xts

@amandalund
Copy link
Contributor

Ok @sethrj good news: I played around with different numbers of tracks slots, and increasing this number does improve performance. I added plots above using 4x as many track slots.

With more slots the partitioning does help, and continues to help relatively more the larger the state size gets (though the overall performance gets worse past a certain point), so could be that with a larger state we get less mixing for longer.

@sethrj
Copy link
Member Author

sethrj commented Sep 16, 2024

@amandalund To maintain the memory limits (especially with CMS and on V100) I have the runner script reduce the number of track slots when merge_events is off on GPU. Can you verify from the output what are the reported track slot count and number of primaries per thread?

EDIT: and on an unrelated note, do you mind setting transparent=False for the PNGs that you upload to github? They render like this with a dark background 😅
Screenshot 2024-09-16 at 15 17 46

@amandalund
Copy link
Contributor

2^18 track slots and 1300 primaries per thread (I know for celer-sim we do that division in the Runner class).

@sethrj
Copy link
Member Author

sethrj commented Sep 16, 2024

Oh ok that's great, I was afraid that was with the 2^20 number .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physics Particles, processes, and stepping algorithms
Projects
None yet
Development

No branches or pull requests

2 participants