-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure reproducibility when using MT Geant4 with Celeritas offloading #1061
Ensure reproducibility when using MT Geant4 with Celeritas offloading #1061
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move the CUDA code to the main celeritas library to live with the RNG or global core state/params. That will also allow us to add a test 😉
src/accel/LocalTransporter.hh
Outdated
@@ -66,6 +66,9 @@ class LocalTransporter | |||
// Set the event ID | |||
void SetEventId(int); | |||
|
|||
// Set the event ID and reseed the Celeritasa RNG at the start of an event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Set the event ID and reseed the Celeritasa RNG at the start of an event | |
// Set the event ID and reseed the Celeritas RNG at the start of an event |
src/accel/LocalTransporter.cc
Outdated
CELER_EXPECT(*this); | ||
CELER_EXPECT(id >= 0); | ||
|
||
this->SetEventId(id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since ensuring reproducibility is a "bug fix", I think we can change the behavior so we always initialize at the start of an event. I like your name change that reflects the initialization cost, so what about having SetEventId
be a [[deprecated]]
inline alias to InitializeEvent
(update the comment to say to be removed in v1.0
), and replacing this call with the SetEventId
code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call.
src/accel/LocalTransporter.cc
Outdated
if (celeritas::device()) | ||
{ | ||
auto const& state | ||
= dynamic_cast<CoreState<MemSpace::device> const&>( | ||
step_->state()); | ||
detail::reseed_rng(rng_->device_ref(), state.ref().rng, id); | ||
} | ||
else | ||
{ | ||
auto const& state = dynamic_cast<CoreState<MemSpace::host> const&>( | ||
step_->state()); | ||
detail::reseed_rng(rng_->host_ref(), state.ref().rng, id); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's perhaps add a reseed
method to the stepper? It'll make this a single line, remove the const casting, and hide the tempating.
src/accel/detail/RngReseed.cu
Outdated
init.seed = params.seed[0]; | ||
#else | ||
init.seed = params.seed; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the RNG engine initializer should have a seed
that's the same type as params.seed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, I can change the initializer seed to an array. Is this still a plan for xorwow?
// TODO: 256-bit seed used to generate initial states for the RNGs
// For now, just 4 bytes (same as our existing cuda/hip interface)
Array<uint_t, 1> seed;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be? 😅 it would make sense to do, but yeah I don't see any practical use until we start feeding hashed time stamps from a python front end
@sethrj I pushed the changes, but I still need to add a test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Just the test now :)
src/celeritas/global/Stepper.cc
Outdated
if constexpr (M == MemSpace::device) | ||
{ | ||
reseed_rng(params_->rng()->device_ref(), state_.ref().rng, event_id); | ||
} | ||
else | ||
{ | ||
reseed_rng(params_->rng()->host_ref(), state_.ref().rng, event_id); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From corecel/data/Ref.hh
:
if constexpr (M == MemSpace::device) | |
{ | |
reseed_rng(params_->rng()->device_ref(), state_.ref().rng, event_id); | |
} | |
else | |
{ | |
reseed_rng(params_->rng()->host_ref(), state_.ref().rng, event_id); | |
} | |
reseed_rng(get_ref<M>(*params_->rng()), state_.ref().rng, event_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks :)
* Remove assert from constexpr range construction * Fix range constexpr by removing assert * Revert "Fix range constexpr by removing assert" * Remove constexpr from range usage * Use c++ instead of c preprocessor
This reverts commit fb9d322.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Good work @amandalund , this definitely addresses a capability gap that we needed.
This reinitializes the Celeritas RNG states at the start of each event when integrating with MT Geant4 so that results are reproducible. Each state is initialized with the same seed and a unique subsequence so that sequences on different threads will
not be correlated. Using a simple problem (simple CMS with 1024 events and 128 primaries per event) I confirmed for both the cuRAND and XORWOW RNGs that the results were not reproducible before and are with this fix and that there was no change in performance.
Closes #1023.