Checkpoint/Resume #2081

thorstenhater · 2023-01-23T15:14:57Z

What is this?

Implement (partial!) dumping of simulation state through a serialization object that can be use to reset the simulation
to an earlier state. The produced snapshot includes just enough information to perform that reset,
not all bits.

For a successful resume, the following constraints apply:

context and domain_decomposition must be identical to those used during serialize
the simulation we call resume on must be
- valid
- constructed using the same recipe

Not included

pointer data that forms internal references.
internal states of generators, samplers, and probes. Might be needed to change in between snapshots.

⚠️ Breaking changes

I needed to bump CUDA language level to 17 (up from 14). That'll require CUDA 11+; released in 2020.

TODO

GPU.
~~At least partially check the constraints~~ It's largely impossible. See docs. Suggestions welcome, though.
Python bindings.
Docs

Issues

Closes #560

… fvm_lowered_cell.

…arlie

arbor/memory/wrappers.hpp

…arlie

…rlie' into feat/check-point-charlie

brenthuisman · 2023-08-02T11:22:43Z

@boeschf doublechecks CUDA 11 avail on piz daint: it is!

…at/check-point-charlie

arbor/memory/wrappers.hpp

arbor/include/arbor/serdes.hpp

boeschf · 2023-08-02T15:36:24Z

arbor/backends/gpu/event_stream.hpp

@@ -84,6 +89,49 @@ class event_stream : public event_stream_base<Event, typename memory::device_vec
        arb_assert(num_events == base::ev_data_.size());
    }

+    friend void serialize(serializer& ser, const std::string& k, const event_stream<Event>& t) {


Wouldn't it be less error-prone if the base class template event_stream_base had its own serialization hooks which could be called from here?

Hmm. It's a tradeoff. In the current way I can use the ENABLE macro in the multicore subclass
and need a custom serializer in the gpu case. Your proposal is slightly more code to write:

ENABLE in base

custom one-liner in multicore

custom serializer in gpu minus the base members plus one line for serializing base

The problem arises when writing out the event spans. One is a GPU memory view aka (ptr, length)
and the multicore is a range aka (ptr_beg, ptr_end). It seems prudent to maybe merge representations
first?

yeah, I see your point. Let's leave it the way it is.

…arlie

boeschf · 2023-08-07T08:25:25Z

cscs-ci run daint-gpu

brenthuisman · 2023-08-07T13:37:28Z

cscs-ci run daint-gpu

brenthuisman · 2023-08-07T13:53:37Z

cscs-ci run daint-gpu

…arlie

brenthuisman · 2023-08-07T14:53:30Z

cscs-ci run daint-gpu

brenthuisman · 2023-08-08T06:58:09Z

Still an error in the distributed GPU job:

gap_junctions: srun: error: nid05416: task 1: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=48211655.1
slurmstepd: error: *** STEP 48211655.1 ON nid05415 CANCELLED AT 2023-08-07T17:55:22 ***
   - gap_junctions: srun: error: nid05415: task 0: Exited with exit code 143

thorstenhater · 2023-08-08T10:51:25Z

JUWELS disagrees

$ srun --account=XXXX --partition=XXXX bin/gap_junctions
gpu:      yes
threads:  1
mpi:      yes
ranks:    1

Using default parameters.
running simulation

30 spikes generated at rate of 3.33333 ms between spikes

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                      0.016           1.654
model-run                       1.098           1.409
meter-total                     1.114           3.062

boeschf · 2023-08-08T12:33:01Z

arbor/backends/gpu/event_stream.hpp

@@ -84,6 +89,49 @@ class event_stream : public event_stream_base<Event, typename memory::device_vec
        arb_assert(num_events == base::ev_data_.size());
    }

+    friend void serialize(serializer& ser, const std::string& k, const event_stream<Event>& t) {


yeah, I see your point. Let's leave it the way it is.

thorstenhater added 30 commits January 17, 2023 12:52

Add basic serialization infrastructure, some simple tests.

1145f0c

Add the tests.

25005b7

Polish the tests a bit.

6d3030f

SerDes for Simulation w/o cell_groups!

5a9fd9b

SerDes for Simulation w/o cell_groups!

1e5c3e3

Simplify and enhance serdes!

60c918d

Drill down into cable cell

99bd50f

Squash the code size in SERDES. Not necessarily the complexity.

837c1fe

Proceed serialization; intermediate checkpoint (!) before refactoring…

304cf48

… fvm_lowered_cell.

Enable SERDES on events and streams.

90b882c

Get single cell simulation SERDES in order.

3794cd9

Get single cell simulation SERDES in order.

97aa29d

Tweak the network.

930e90a

Make network larger.

ab6100e

CMaaaaaaaake.

61bcbb6

Black.

97c5b92

CMake fussing.

adecbff

Merge remote-tracking branch 'origin/master' into feat/check-point-ch…

f4e0c2b

…arlie

Split out writer into Arborio.

a31d318

Add missing file.

7385dc7

Fix CMake?

43b1538

Fix CMake?

e5257f9

Snapshot

600c398

Serdes is now freestanding

fe90ae1

Add docs

eb7b854

Fix-up our namespacing.

5a403df

Warnings fixed.

70443fd

merge.

5aec9f4

Appease the linker.

75b3a61

Docs.

b611178

brenthuisman added this to the v0.9 milestone Jul 4, 2023

boeschf reviewed Jul 5, 2023

View reviewed changes

arbor/memory/wrappers.hpp Outdated Show resolved Hide resolved

thorstenhater added 6 commits July 27, 2023 09:19

Merge remote-tracking branch 'origin/master' into feat/check-point-ch…

a36851b

…arlie

Start fixing GPU problems

19ef79e

Merge remote-tracking branch 'refs/remotes/hater/feat/check-point-cha…

1cd59b0

…rlie' into feat/check-point-charlie

Snapshot

36e3762

Hoist ser/des from namespace

b5a47b4

Fix namespaces?

8b7c9c8

thorstenhater dismissed brenthuisman’s stale review via 8b7c9c8 July 31, 2023 07:04

thorstenhater added 4 commits July 31, 2023 14:11

Bump CUDA to 17, GPU compiles now.

4f90020

More namespace fun.

125a90a

Minor tweak.

99b1883

GPU back to working

06849da

thorstenhater added 2 commits August 2, 2023 14:45

Bump CUDA, allow Hopper.

d090082

Merge remote-tracking branch 'hater/feat/check-point-charlie' into fe…

b23487c

…at/check-point-charlie

boeschf reviewed Aug 2, 2023

View reviewed changes

thorstenhater added 4 commits August 4, 2023 10:33

Add virtual dtor to silence warning, remove redundant overload

9313422

Merge remote-tracking branch 'origin/master' into feat/check-point-ch…

6c49379

…arlie

Clean-up.

3b1be79

While we are at it, reformat wrappers.

1ca974a

Merge remote-tracking branch 'origin/master' into feat/check-point-ch…

31617e9

…arlie

boeschf approved these changes Aug 8, 2023

View reviewed changes

thorstenhater merged commit c11dd37 into arbor-sim:master Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint/Resume #2081

Checkpoint/Resume #2081

thorstenhater commented Jan 23, 2023 •

edited

Loading

brenthuisman commented Aug 2, 2023 •

edited

Loading

boeschf Aug 2, 2023

thorstenhater Aug 4, 2023 •

edited

Loading

boeschf Aug 8, 2023

boeschf commented Aug 7, 2023

brenthuisman commented Aug 7, 2023

brenthuisman commented Aug 7, 2023

brenthuisman commented Aug 7, 2023

brenthuisman commented Aug 8, 2023

thorstenhater commented Aug 8, 2023

boeschf Aug 8, 2023

Checkpoint/Resume #2081

Checkpoint/Resume #2081

Conversation

thorstenhater commented Jan 23, 2023 • edited Loading

What is this?

Not included

⚠️ Breaking changes

TODO

Issues

brenthuisman commented Aug 2, 2023 • edited Loading

boeschf Aug 2, 2023

Choose a reason for hiding this comment

thorstenhater Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

boeschf Aug 8, 2023

Choose a reason for hiding this comment

boeschf commented Aug 7, 2023

brenthuisman commented Aug 7, 2023

brenthuisman commented Aug 7, 2023

brenthuisman commented Aug 7, 2023

brenthuisman commented Aug 8, 2023

thorstenhater commented Aug 8, 2023

boeschf Aug 8, 2023

Choose a reason for hiding this comment

thorstenhater commented Jan 23, 2023 •

edited

Loading

brenthuisman commented Aug 2, 2023 •

edited

Loading

thorstenhater Aug 4, 2023 •

edited

Loading