Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure reproducibility when using MT Geant4 with Celeritas offloading #1061

Merged
merged 14 commits into from
Dec 15, 2023

Conversation

amandalund
Copy link
Contributor

This reinitializes the Celeritas RNG states at the start of each event when integrating with MT Geant4 so that results are reproducible. Each state is initialized with the same seed and a unique subsequence so that sequences on different threads will
not be correlated. Using a simple problem (simple CMS with 1024 events and 128 primaries per event) I confirmed for both the cuRAND and XORWOW RNGs that the results were not reproducible before and are with this fix and that there was no change in performance.

Closes #1023.

@amandalund amandalund added bug Something isn't working external Dependencies and framework-oriented features labels Dec 11, 2023
@amandalund amandalund requested a review from sethrj December 11, 2023 23:46
Copy link
Member

@sethrj sethrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move the CUDA code to the main celeritas library to live with the RNG or global core state/params. That will also allow us to add a test 😉

@@ -66,6 +66,9 @@ class LocalTransporter
// Set the event ID
void SetEventId(int);

// Set the event ID and reseed the Celeritasa RNG at the start of an event
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Set the event ID and reseed the Celeritasa RNG at the start of an event
// Set the event ID and reseed the Celeritas RNG at the start of an event

CELER_EXPECT(*this);
CELER_EXPECT(id >= 0);

this->SetEventId(id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ensuring reproducibility is a "bug fix", I think we can change the behavior so we always initialize at the start of an event. I like your name change that reflects the initialization cost, so what about having SetEventId be a [[deprecated]] inline alias to InitializeEvent (update the comment to say to be removed in v1.0), and replacing this call with the SetEventId code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call.

Comment on lines 147 to 160
if (celeritas::device())
{
auto const& state
= dynamic_cast<CoreState<MemSpace::device> const&>(
step_->state());
detail::reseed_rng(rng_->device_ref(), state.ref().rng, id);
}
else
{
auto const& state = dynamic_cast<CoreState<MemSpace::host> const&>(
step_->state());
detail::reseed_rng(rng_->host_ref(), state.ref().rng, id);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's perhaps add a reseed method to the stepper? It'll make this a single line, remove the const casting, and hide the tempating.

init.seed = params.seed[0];
#else
init.seed = params.seed;
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the RNG engine initializer should have a seed that's the same type as params.seed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, I can change the initializer seed to an array. Is this still a plan for xorwow?

// TODO: 256-bit seed used to generate initial states for the RNGs
// For now, just 4 bytes (same as our existing cuda/hip interface)
Array<uint_t, 1> seed;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be? 😅 it would make sense to do, but yeah I don't see any practical use until we start feeding hashed time stamps from a python front end

@amandalund
Copy link
Contributor Author

@sethrj I pushed the changes, but I still need to add a test.

Copy link
Member

@sethrj sethrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just the test now :)

src/accel/LocalTransporter.cc Show resolved Hide resolved
Comment on lines 119 to 126
if constexpr (M == MemSpace::device)
{
reseed_rng(params_->rng()->device_ref(), state_.ref().rng, event_id);
}
else
{
reseed_rng(params_->rng()->host_ref(), state_.ref().rng, event_id);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From corecel/data/Ref.hh:

Suggested change
if constexpr (M == MemSpace::device)
{
reseed_rng(params_->rng()->device_ref(), state_.ref().rng, event_id);
}
else
{
reseed_rng(params_->rng()->host_ref(), state_.ref().rng, event_id);
}
reseed_rng(get_ref<M>(*params_->rng()), state_.ref().rng, event_id);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :)

src/celeritas/global/Stepper.hh Outdated Show resolved Hide resolved
amandalund and others added 6 commits December 14, 2023 20:17
* Remove assert from constexpr range construction

* Fix range constexpr by removing assert

* Revert "Fix range constexpr by removing assert"

* Remove constexpr from range usage

* Use c++ instead of c preprocessor
Copy link
Member

@sethrj sethrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Good work @amandalund , this definitely addresses a capability gap that we needed.

@sethrj sethrj merged commit bc3ccd5 into celeritas-project:develop Dec 15, 2023
20 checks passed
@amandalund amandalund deleted the g4-reproducibility branch December 15, 2023 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external Dependencies and framework-oriented features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure reproducibility when using MT Geant4 with Celeritas offloading
2 participants