-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint/Resume #2081
Checkpoint/Resume #2081
Conversation
… fvm_lowered_cell.
…rlie' into feat/check-point-charlie
@boeschf doublechecks CUDA 11 avail on piz daint: it is! |
…at/check-point-charlie
@@ -84,6 +89,49 @@ class event_stream : public event_stream_base<Event, typename memory::device_vec | |||
arb_assert(num_events == base::ev_data_.size()); | |||
} | |||
|
|||
friend void serialize(serializer& ser, const std::string& k, const event_stream<Event>& t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be less error-prone if the base class template event_stream_base
had its own serialization hooks which could be called from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It's a tradeoff. In the current way I can use the ENABLE
macro in the multicore
subclass
and need a custom serializer in the gpu
case. Your proposal is slightly more code to write:
ENABLE
in base- custom one-liner in
multicore
- custom serializer in
gpu
minus the base members plus one line for serializingbase
The problem arises when writing out the event spans. One is a GPU memory view aka (ptr, length)
and the multicore is a range aka (ptr_beg, ptr_end)
. It seems prudent to maybe merge representations
first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I see your point. Let's leave it the way it is.
cscs-ci run daint-gpu |
2 similar comments
cscs-ci run daint-gpu |
cscs-ci run daint-gpu |
cscs-ci run daint-gpu |
Still an error in the distributed GPU job:
|
JUWELS disagrees
|
@@ -84,6 +89,49 @@ class event_stream : public event_stream_base<Event, typename memory::device_vec | |||
arb_assert(num_events == base::ev_data_.size()); | |||
} | |||
|
|||
friend void serialize(serializer& ser, const std::string& k, const event_stream<Event>& t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I see your point. Let's leave it the way it is.
What is this?
Implement (partial!) dumping of simulation state through a serialization object that can be use to reset the simulation
to an earlier state. The produced snapshot includes just enough information to perform that reset,
not all bits.
For a successful resume, the following constraints apply:
context
anddomain_decomposition
must be identical to those used duringserialize
simulation
we callresume
on must beNot included
I needed to bump CUDA language level to 17 (up from 14). That'll require CUDA 11+; released in 2020.
TODO
At least partially check the constraintsIt's largely impossible. See docs. Suggestions welcome, though.Issues
Closes #560