-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stronger Fenix Integration #75
Comments
Fenix CheckpointsOne very straightforward integration is to make a checkpointing backend in KokkosResilience that uses Fenix's checkpoint API for in-memory checkpoints |
Fenix ExceptionsOne flaw with the current design is the use of So instead of this:
We could use something like this:
|
Wrapping Fenix InitWe could also manage Fenix's initialization for the user, in one of two ways.
There are pros and cons. For both options, we can simplify things for the end users, but unless we maintain feature parity with Fenix's intialization options we're limiting what the user can do. Additionally, Fenix only (currently) supports a single Fenix_Init/Finalize per application so we'd be limited to one kr_ctx or scoped region. Option 1 is simplest, but I worry about not knowing the specific ordering of the Fenix_Finalize call, since it would be based on the automatic context object's destructor. This also doesn't really give us any more ability in KokkosResilience, it's mostly just hiding the Fenix function calls. Option 2 might give us more room to automate recovery. I'm imaging something like this:
When failures happen in the online recovery lambda, KR has full control over updating the MPI_Comm. When exceptions are caught, KR can just restart the whole region. This limits the ability to localize recovery, which is the final goal ultimately, since all survivor ranks do a full re-init. We could probably mix this with the callbacks from the Fenix Exceptions integration above to restore that functionality though. We could integrate with a message logger to help automate localization in the future, since we have a scoped region for recovery |
Avoiding Re-initA lot of the problems above relate to resetting state across the survivors/failed ranks. This is necessary in MiniMD since many non-checkpointed objects contain state variables. For the purposes of global recovery, though, we could just checkpoint those state variables as well. So we might be able to skip all of that complexity by using the Magistrate checkpointing work to expand to checkpointing non-KokkosView variables. This limits our automation though, since we can't automatically identify what to checkpoint/recover so the user needs to define the serializer as well as manually register the objects to checkpoint. Maybe we could serialize the lambdas themselves? Our view detection works by copying the lambda since the views are stored inside the lambda. If we could figure out some way of just serializing the whole lambda along with all the data stored inside, that would help us out. No idea if that's possible though, since I think lambdas are each custom implementation-defined objects. |
We want to simplify more of the Fenix+KokkosResilience process, focusing for now on implementing the global recovery approach but keeping future localized recovery flows in mind.
Here's the current basic flow for MiniMD:
I'll leave some thoughts on directions to go as comments.
The text was updated successfully, but these errors were encountered: