Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint/Resume #2081

Merged

Conversation

thorstenhater
Copy link
Contributor

@thorstenhater thorstenhater commented Jan 23, 2023

What is this?

Implement (partial!) dumping of simulation state through a serialization object that can be use to reset the simulation
to an earlier state. The produced snapshot includes just enough information to perform that reset,
not all bits.

For a successful resume, the following constraints apply:

  • context and domain_decomposition must be identical to those used during serialize
  • the simulation we call resume on must be
    • valid
    • constructed using the same recipe

Not included

  • pointer data that forms internal references.
  • internal states of generators, samplers, and probes. Might be needed to change in between snapshots.

⚠️ Breaking changes

I needed to bump CUDA language level to 17 (up from 14). That'll require CUDA 11+; released in 2020.

TODO

  • GPU.
  • At least partially check the constraints It's largely impossible. See docs. Suggestions welcome, though.
  • Python bindings.
  • Docs

Issues

Closes #560

@brenthuisman brenthuisman added this to the v0.9 milestone Jul 4, 2023
@brenthuisman
Copy link
Contributor

brenthuisman commented Aug 2, 2023

@boeschf doublechecks CUDA 11 avail on piz daint: it is!

@@ -84,6 +89,49 @@ class event_stream : public event_stream_base<Event, typename memory::device_vec
arb_assert(num_events == base::ev_data_.size());
}

friend void serialize(serializer& ser, const std::string& k, const event_stream<Event>& t) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be less error-prone if the base class template event_stream_base had its own serialization hooks which could be called from here?

Copy link
Contributor Author

@thorstenhater thorstenhater Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It's a tradeoff. In the current way I can use the ENABLE macro in the multicore subclass
and need a custom serializer in the gpu case. Your proposal is slightly more code to write:

  1. ENABLE in base
  2. custom one-liner in multicore
  3. custom serializer in gpu minus the base members plus one line for serializing base

The problem arises when writing out the event spans. One is a GPU memory view aka (ptr, length)
and the multicore is a range aka (ptr_beg, ptr_end). It seems prudent to maybe merge representations
first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I see your point. Let's leave it the way it is.

@boeschf
Copy link
Contributor

boeschf commented Aug 7, 2023

cscs-ci run daint-gpu

2 similar comments
@brenthuisman
Copy link
Contributor

cscs-ci run daint-gpu

@brenthuisman
Copy link
Contributor

cscs-ci run daint-gpu

@brenthuisman
Copy link
Contributor

cscs-ci run daint-gpu

@brenthuisman
Copy link
Contributor

Still an error in the distributed GPU job:

gap_junctions: srun: error: nid05416: task 1: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=48211655.1
slurmstepd: error: *** STEP 48211655.1 ON nid05415 CANCELLED AT 2023-08-07T17:55:22 ***
   - gap_junctions: srun: error: nid05415: task 0: Exited with exit code 143

@thorstenhater
Copy link
Contributor Author

JUWELS disagrees

$ srun --account=XXXX --partition=XXXX bin/gap_junctions
gpu:      yes
threads:  1
mpi:      yes
ranks:    1

Using default parameters.
running simulation

30 spikes generated at rate of 3.33333 ms between spikes

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                      0.016           1.654
model-run                       1.098           1.409
meter-total                     1.114           3.062

@@ -84,6 +89,49 @@ class event_stream : public event_stream_base<Event, typename memory::device_vec
arb_assert(num_events == base::ev_data_.size());
}

friend void serialize(serializer& ser, const std::string& k, const event_stream<Event>& t) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I see your point. Let's leave it the way it is.

@thorstenhater thorstenhater merged commit c11dd37 into arbor-sim:master Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arbor IO IO support in arborio enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Checkpoint support
3 participants