SEAS checkpointing of PETSc TS object #59

hpc4geo · 2023-10-05T21:44:58Z

Summary of changes

Correctly storing RK stages if the RK scheme has the FASL property;
Better control flow to enable checkpoint and restarts using a single call to TSSolve() and checkpointing via TSSetPostSolve();
Different checkpoint storage modes are possible (e.g. store all, store some, store none);
A restart can be staged using any checkpoint data, not just the last checkpoint file written;
The checkpoint data written is placed in its own directory.
TOML variables ts_checkpoint_prefix and ts_checkpoint_frequency currently have no effect.
All control of checkpointing is defined via PETSc options.

Summary of new options to control checkpointing

-ts_checkpoint_load <str> defines the directory to use to load a checkpoint
-ts_checkpoint_path <str> defines the root location where checkpoint files will be written.
-ts_checkpoint_freq_step <int> defines the time step frequency between checkpoints.
-ts_checkpoint_freq_cputime <float> defines the CPU time (minutes) frequency between checkpoints.
-ts_checkpoint_freq_physical_time <float> defines the physical time frequency between checkpoints.
-ts_checkpoint_storage_type <none, limited, unlimited> defines the type of storage for checkpoints. limited will store a finite number of unique checkpoints on disk. unlimited stores all checkpoints. Use none to completely deactivate checkpointing.
-ts_checkpoint_storage_limited_size <int> defines how many unique checkpoints you want to store on disk. Only has effect with -ts_checkpoint_storage_type limited.

Notes

One can provide values for all the -ts_checkpoint_freq_xxx options. They can be combined together and if any are satisfied a checkpoint will be written out.
Checkpointing is always on by default. The default storage mode is limited and the number of unique checkpoints stored is 1 (e.g. -ts_checkpoint_storage_limited_size 1).
Each checkpoint directory contains a file checkpoint_info.csv which contains time, dt, step and some other info.
When a checkpoint file is written, in the path specified by -ts_checkpoint_path a file load_last_checkpoint.opts is written. This can be used to restart from the last written checkpoint. For example ./app/tandem your_toml_options --petsc your_petsc_options and then ./app/tandem your_toml_options --petsc your_petsc_options -options_file PATH_TO_CHECKPOINT_DATA/load_last_checkpoint.opts
Testing was successful with PETSc v3.13 and v.3.16 using either TSRK or TSEULER.
Do not change the TS method between a checkpoint and a restart. This functionality is not supported and it is unlikely to work. Do this at your own peril.

Testing

./app/tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400  -ts_monitor -ts_checkpoint_freq_step 10 -pc_type lu -ts_max_steps 50 -ts_checkpoint_path ckp_ex1

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version 0801af7

                       stack size limit = unlimited

                              Worker affinity
       0123456789|0123456789|0123456789|0123456789|0123456789|012345


TS -ts_checkpoint_path ckp_ex1
TS -ts_checkpoint_freq_step 10
TS -ts_checkpoint_freq_cputime 3.0000e+01
TS -ts_checkpoint_freq_physical_time 1.0000e+10
[ckp_ex1] directory created
DOFs (domain): 2796
DOFs (fault): 240
Mesh size: 200
0 TS dt 0.1 time 0.
1 TS dt 1. time 0.1
2 TS dt 10. time 1.1
3 TS dt 100. time 11.1
4 TS dt 1000. time 111.1
5 TS dt 3496.05 time 1111.1
6 TS dt 4357.85 time 4607.15
7 TS dt 5960.65 time 8965.
8 TS dt 8033.43 time 14925.7
9 TS dt 10851.1 time 22959.1
[TSCheckpoint] Triggered by: "step"
[ckp_ex1] directory exists
[ckp_ex1/step10] directory created
[TSCheckpoint]   writing checkpoint data for step 10 -> path ckp_ex1/step10
10 TS dt 14652.3 time 33810.2
11 TS dt 19786.2 time 48462.6
12 TS dt 26718.9 time 68248.8
13 TS dt 36080.7 time 94967.7
14 TS dt 48722.9 time 131048.
15 TS dt 65795.1 time 179771.
16 TS dt 88849.4 time 245566.
17 TS dt 119982. time 334416.
18 TS dt 162024. time 454398.
19 TS dt 218798. time 616422.
[TSCheckpoint] Triggered by: "step"
[ckp_ex1] directory exists
[ckp_ex1/step20] directory created
[TSCheckpoint]   writing checkpoint data for step 20 -> path ckp_ex1/step20
20 TS dt 295466. time 835219.
21 TS dt 398999. time 1.13069e+06
22 TS dt 538812. time 1.52968e+06
23 TS dt 727617. time 2.0685e+06
24 TS dt 982581. time 2.79611e+06
25 TS dt 1.32689e+06 time 3.77869e+06
26 TS dt 1.79185e+06 time 5.10558e+06
27 TS dt 2.41974e+06 time 6.89743e+06
28 TS dt 3.26765e+06 time 9.31717e+06
29 TS dt 4.41268e+06 time 1.25848e+07
[TSCheckpoint] Triggered by: "step"
[ckp_ex1] directory exists
[ckp_ex1/step30] directory created
[TSCheckpoint]   writing checkpoint data for step 30 -> path ckp_ex1/step30
30 TS dt 5.95895e+06 time 1.69975e+07
31 TS dt 7.98805e+06 time 2.29564e+07
32 TS dt 9.28044e+06 time 3.09445e+07
33 TS dt 1.13786e+07 time 4.02249e+07
34 TS dt 1.24453e+07 time 5.16035e+07
35 TS dt 1.44212e+07 time 6.40488e+07
36 TS dt 1.69361e+07 time 7.847e+07
37 TS dt 1.99241e+07 time 9.54061e+07
38 TS dt 2.3332e+07 time 1.1533e+08
39 TS dt 2.68773e+07 time 1.38662e+08
[TSCheckpoint] Triggered by: "step"
[ckp_ex1] directory exists
[ckp_ex1/step40] directory created
[TSCheckpoint]   writing checkpoint data for step 40 -> path ckp_ex1/step40
40 TS dt 3.11191e+07 time 1.6554e+08
41 TS dt 3.33839e+07 time 1.96659e+08
42 TS dt 3.66182e+07 time 2.30043e+08
43 TS dt 4.14027e+07 time 2.66661e+08
44 TS dt 3.01661e+07 time 2.95497e+08
45 TS dt 2.9463e+07 time 3.25663e+08
46 TS dt 3.17658e+07 time 3.55126e+08
47 TS dt 2.64165e+07 time 3.80683e+08
48 TS dt 3.42318e+07 time 4.07099e+08
49 TS dt 3.33315e+07 time 4.41331e+08
[TSCheckpoint] Triggered by: "step"
[ckp_ex1] directory exists
[ckp_ex1/step50] directory created
[TSCheckpoint]   writing checkpoint data for step 50 -> path ckp_ex1/step50
50 TS dt 2.67389e+07 time 4.66688e+08

./app/tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400  -ts_monitor -ts_checkpoint_freq_step 4 -pc_type lu -ts_max_steps 50 -ts_checkpoint_load ckp_ex1/step20

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version 0801af7

                       stack size limit = unlimited

                              Worker affinity
       0123456789|0123456789|0123456789|0123456789|0123456789|012345


TS -ts_checkpoint_path checkpoint
TS -ts_checkpoint_freq_step 4
TS -ts_checkpoint_freq_cputime 3.0000e+01
TS -ts_checkpoint_freq_physical_time 1.0000e+10
[checkpoint] directory exists
DOFs (domain): 2796
DOFs (fault): 240
Mesh size: 200
TS -ts_checkpoint_load ckp_ex1/step20
[TSCheckpoint]   loading checkpoint data <- path ckp_ex1/step20
20 TS dt 295466. time 835219.
21 TS dt 398999. time 1.13069e+06
22 TS dt 538812. time 1.52968e+06
23 TS dt 727617. time 2.0685e+06
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step24] directory created
[TSCheckpoint]   writing checkpoint data for step 24 -> path checkpoint/step24
24 TS dt 982581. time 2.79611e+06
25 TS dt 1.32689e+06 time 3.77869e+06
26 TS dt 1.79185e+06 time 5.10558e+06
27 TS dt 2.41974e+06 time 6.89743e+06
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step28] directory created
[TSCheckpoint]   writing checkpoint data for step 28 -> path checkpoint/step28
28 TS dt 3.26765e+06 time 9.31717e+06
29 TS dt 4.41268e+06 time 1.25848e+07
30 TS dt 5.95895e+06 time 1.69975e+07
31 TS dt 7.98805e+06 time 2.29564e+07
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step32] directory created
[TSCheckpoint]   writing checkpoint data for step 32 -> path checkpoint/step32
32 TS dt 9.28044e+06 time 3.09445e+07
33 TS dt 1.13786e+07 time 4.02249e+07
34 TS dt 1.24453e+07 time 5.16035e+07
35 TS dt 1.44212e+07 time 6.40488e+07
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step36] directory created
[TSCheckpoint]   writing checkpoint data for step 36 -> path checkpoint/step36
36 TS dt 1.69361e+07 time 7.847e+07
37 TS dt 1.99241e+07 time 9.54061e+07
38 TS dt 2.3332e+07 time 1.1533e+08
39 TS dt 2.68773e+07 time 1.38662e+08
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step40] directory created
[TSCheckpoint]   writing checkpoint data for step 40 -> path checkpoint/step40
40 TS dt 3.11191e+07 time 1.6554e+08
41 TS dt 3.33839e+07 time 1.96659e+08
42 TS dt 3.66182e+07 time 2.30043e+08
43 TS dt 4.14027e+07 time 2.66661e+08
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step44] directory created
[TSCheckpoint]   writing checkpoint data for step 44 -> path checkpoint/step44
44 TS dt 3.01661e+07 time 2.95497e+08
45 TS dt 2.9463e+07 time 3.25663e+08
46 TS dt 3.17658e+07 time 3.55126e+08
47 TS dt 2.64165e+07 time 3.80683e+08
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step48] directory created
[TSCheckpoint]   writing checkpoint data for step 48 -> path checkpoint/step48
48 TS dt 3.42318e+07 time 4.07099e+08
49 TS dt 3.33315e+07 time 4.41331e+08
50 TS dt 2.67389e+07 time 4.66688e+08

…ommit will checkpoint every time step

* Add filename as arg to load(), view() * Removed unused code * Error check for VecLoad_Nest

…formed.

…erving as much state as possible when performing a restart.

…grator. - New tool parameters are [1] ts_checkpoint_prefix = <str> The value of <str> should something like this "/path/to/dir/will/tscheckpoints/will/be/written" The directory will be created (recursively if required) if it does not exists. This parameter is optional. Checkpointing will ONLY occur if this parameter is defined. [2] ts_checkpoint_frequency = <integer> The value of <integer> indicates the number of time steps which will occur between checkpoints. If ts_checkpoint_prefix is not specified, this parameter has no effect on the SEAS simulation.

…ing the stage vectors for RK. This is vital for consistent restarts if the RK scheme has the FSAL property. TOML options related to checkpointing are currently unused. All checkpoint control is defined via PETSc args. These options control when a checkpoint is written -ts_checkpoint_freq_step, -ts_checkpoint_freq_cputime, -ts_checkpoint_freq_physical_time. -ts_checkpoint_path defines the directory where all checkpoint files will be written. -ts_checkpoint_load indicates the directory of checkpoint data you want to use to start a new simulation. Also note that all checkpointed data is stored.

hpc4geo · 2023-10-05T21:51:59Z

The example provided above used the mesh attached, along with the standard, unmodified files from examples/tandem/2d/ bp1_sym.toml, bp1.lua
bp1_sym.msh.zip

hpc4geo · 2023-10-05T21:55:13Z

Another example of a restart with two triggers is provided below.

./app/tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400  -ts_monitor -ts_checkpoint_freq_step 40 -ts_checkpoint_freq_cputime 0.1 -pc_type lu -ts_max_steps 500 -ts_checkpoint_load ckp_ex1/step20

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version 0801af7

                       stack size limit = unlimited

                              Worker affinity
       0123456789|0123456789|0123456789|0123456789|0123456789|012345


TS -ts_checkpoint_path checkpoint
TS -ts_checkpoint_freq_step 40
TS -ts_checkpoint_freq_cputime 1.0000e-01
TS -ts_checkpoint_freq_physical_time 1.0000e+10
[checkpoint] directory exists
DOFs (domain): 2796
DOFs (fault): 240
Mesh size: 200
TS -ts_checkpoint_load ckp_ex1/step20
[TSCheckpoint]   loading checkpoint data <- path ckp_ex1/step20
20 TS dt 295466. time 835219.
21 TS dt 398999. time 1.13069e+06
22 TS dt 538812. time 1.52968e+06
23 TS dt 727617. time 2.0685e+06
24 TS dt 982581. time 2.79611e+06
25 TS dt 1.32689e+06 time 3.77869e+06
26 TS dt 1.79185e+06 time 5.10558e+06
27 TS dt 2.41974e+06 time 6.89743e+06
28 TS dt 3.26765e+06 time 9.31717e+06
29 TS dt 4.41268e+06 time 1.25848e+07
30 TS dt 5.95895e+06 time 1.69975e+07
31 TS dt 7.98805e+06 time 2.29564e+07
32 TS dt 9.28044e+06 time 3.09445e+07
33 TS dt 1.13786e+07 time 4.02249e+07
34 TS dt 1.24453e+07 time 5.16035e+07
35 TS dt 1.44212e+07 time 6.40488e+07
36 TS dt 1.69361e+07 time 7.847e+07
37 TS dt 1.99241e+07 time 9.54061e+07
38 TS dt 2.3332e+07 time 1.1533e+08
39 TS dt 2.68773e+07 time 1.38662e+08
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step40] directory exists
[TSCheckpoint]   writing checkpoint data for step 40 -> path checkpoint/step40
40 TS dt 3.11191e+07 time 1.6554e+08
41 TS dt 3.33839e+07 time 1.96659e+08
42 TS dt 3.66182e+07 time 2.30043e+08
43 TS dt 4.14027e+07 time 2.66661e+08
44 TS dt 3.01661e+07 time 2.95497e+08
45 TS dt 2.9463e+07 time 3.25663e+08
46 TS dt 3.17658e+07 time 3.55126e+08
47 TS dt 2.64165e+07 time 3.80683e+08
48 TS dt 3.42318e+07 time 4.07099e+08
49 TS dt 3.33315e+07 time 4.41331e+08
50 TS dt 2.67389e+07 time 4.66688e+08
51 TS dt 3.57412e+07 time 4.93427e+08
52 TS dt 3.14185e+07 time 5.24069e+08
53 TS dt 3.13646e+07 time 5.55488e+08
54 TS dt 2.92616e+07 time 5.86852e+08
55 TS dt 2.45094e+07 time 6.10546e+08
56 TS dt 3.50312e+07 time 6.35056e+08
57 TS dt 3.32866e+07 time 6.70087e+08
58 TS dt 2.88053e+07 time 6.97524e+08
59 TS dt 3.24711e+07 time 7.2633e+08
60 TS dt 2.87486e+07 time 7.54359e+08
61 TS dt 3.51527e+07 time 7.83107e+08
62 TS dt 3.13853e+07 time 8.14036e+08
63 TS dt 2.87462e+07 time 8.42122e+08
64 TS dt 3.516e+07 time 8.70868e+08
65 TS dt 2.98683e+07 time 8.99916e+08
66 TS dt 3.13585e+07 time 9.29784e+08
67 TS dt 2.95721e+07 time 9.61143e+08
68 TS dt 2.6957e+07 time 9.87609e+08
69 TS dt 3.28089e+07 time 1.01457e+09
70 TS dt 3.30834e+07 time 1.04737e+09
71 TS dt 2.9346e+07 time 1.07572e+09
72 TS dt 3.54434e+07 time 1.10507e+09
73 TS dt 2.94051e+07 time 1.13351e+09
74 TS dt 3.41035e+07 time 1.16291e+09
75 TS dt 3.34337e+07 time 1.19702e+09
76 TS dt 2.57716e+07 time 1.22142e+09
77 TS dt 3.63741e+07 time 1.2472e+09
78 TS dt 3.37553e+07 time 1.28357e+09
79 TS dt 2.8562e+07 time 1.31059e+09
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step80] directory exists
[TSCheckpoint]   writing checkpoint data for step 80 -> path checkpoint/step80
80 TS dt 3.37938e+07 time 1.33915e+09
81 TS dt 2.9868e+07 time 1.36825e+09
82 TS dt 3.54748e+07 time 1.39812e+09
83 TS dt 3.13201e+07 time 1.42908e+09
84 TS dt 2.85277e+07 time 1.45695e+09
85 TS dt 3.55476e+07 time 1.48547e+09
86 TS dt 3.09389e+07 time 1.51572e+09
87 TS dt 2.68293e+07 time 1.54208e+09
88 TS dt 2.43823e+07 time 1.56891e+09
89 TS dt 2.37081e+07 time 1.59329e+09
90 TS dt 2.39819e+07 time 1.617e+09
91 TS dt 2.75741e+07 time 1.64098e+09
92 TS dt 2.81586e+07 time 1.66855e+09
93 TS dt 3.0927e+07 time 1.69671e+09
94 TS dt 3.41968e+07 time 1.72764e+09
95 TS dt 3.13265e+07 time 1.76184e+09
96 TS dt 3.43477e+07 time 1.79316e+09
97 TS dt 4.11846e+07 time 1.82751e+09
98 TS dt 4.6323e+07 time 1.86869e+09
99 TS dt 2.98454e+07 time 1.89624e+09
100 TS dt 3.53535e+07 time 1.92608e+09
101 TS dt 3.53026e+07 time 1.96144e+09
102 TS dt 2.56873e+07 time 1.98551e+09
103 TS dt 3.66231e+07 time 2.0112e+09
104 TS dt 3.43716e+07 time 2.04782e+09
105 TS dt 2.86974e+07 time 2.0749e+09
106 TS dt 3.48469e+07 time 2.10359e+09
107 TS dt 3.01841e+07 time 2.13289e+09
108 TS dt 3.5707e+07 time 2.16308e+09
109 TS dt 3.16369e+07 time 2.19441e+09
110 TS dt 2.81204e+07 time 2.22175e+09
111 TS dt 3.58772e+07 time 2.24987e+09
112 TS dt 3.18032e+07 time 2.28107e+09
113 TS dt 3.08361e+07 time 2.31287e+09
114 TS dt 3.26638e+07 time 2.34371e+09
115 TS dt 2.60927e+07 time 2.36879e+09
116 TS dt 3.61226e+07 time 2.39488e+09
117 TS dt 3.51281e+07 time 2.43101e+09
118 TS dt 2.78301e+07 time 2.45717e+09
119 TS dt 3.12168e+07 time 2.485e+09
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step120] directory exists
[TSCheckpoint]   writing checkpoint data for step 120 -> path checkpoint/step120
120 TS dt 2.40731e+07 time 2.50543e+09
121 TS dt 1.67271e+07 time 2.51829e+09
122 TS dt 1.17115e+07 time 2.52816e+09
123 TS dt 8.8341e+06 time 2.53626e+09
124 TS dt 7.60614e+06 time 2.54357e+09
125 TS dt 6.03803e+06 time 2.5493e+09
126 TS dt 4.46701e+06 time 2.55368e+09
127 TS dt 3.84921e+06 time 2.55751e+09
128 TS dt 3.78722e+06 time 2.56135e+09
129 TS dt 3.95784e+06 time 2.56514e+09
130 TS dt 3.97125e+06 time 2.5691e+09
131 TS dt 4.38699e+06 time 2.57307e+09
132 TS dt 4.38842e+06 time 2.57746e+09
133 TS dt 5.03443e+06 time 2.58185e+09
134 TS dt 6.04587e+06 time 2.58688e+09
135 TS dt 6.99945e+06 time 2.59293e+09
136 TS dt 7.99595e+06 time 2.59993e+09
137 TS dt 7.97008e+06 time 2.60792e+09
138 TS dt 8.76227e+06 time 2.61589e+09
139 TS dt 1.02309e+07 time 2.62465e+09
140 TS dt 1.11905e+07 time 2.63489e+09
141 TS dt 1.10837e+07 time 2.64608e+09
142 TS dt 1.11844e+07 time 2.65716e+09
143 TS dt 1.24482e+07 time 2.66834e+09
144 TS dt 1.49095e+07 time 2.68079e+09
145 TS dt 1.71897e+07 time 2.6957e+09
146 TS dt 1.96651e+07 time 2.71289e+09
147 TS dt 2.11836e+07 time 2.73256e+09
148 TS dt 2.35066e+07 time 2.75374e+09
149 TS dt 2.45546e+07 time 2.77725e+09
150 TS dt 2.73177e+07 time 2.8018e+09
151 TS dt 3.31472e+07 time 2.82912e+09
152 TS dt 3.6281e+07 time 2.86227e+09
153 TS dt 4.17783e+07 time 2.89855e+09
154 TS dt 4.49488e+07 time 2.94033e+09
155 TS dt 4.95264e+07 time 2.98527e+09
[TSCheckpoint] Triggered by: "cputime"
[checkpoint] directory exists
[checkpoint/step156] directory exists
[TSCheckpoint]   writing checkpoint data for step 156 -> path checkpoint/step156
156 TS dt 3.30105e+07 time 3.01379e+09
157 TS dt 3.84009e+07 time 3.04681e+09
158 TS dt 2.8359e+07 time 3.07353e+09
159 TS dt 3.70198e+07 time 3.10189e+09
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step160] directory exists
[TSCheckpoint]   writing checkpoint data for step 160 -> path checkpoint/step160
160 TS dt 3.42e+07 time 3.13891e+09
161 TS dt 2.57749e+07 time 3.16301e+09
162 TS dt 3.69121e+07 time 3.18879e+09
163 TS dt 3.37955e+07 time 3.22199e+09

As we can see, step156 was created by our cputime frequency whilst step160 was generated by the tilmestep frequency.

JeenaYun · 2023-10-07T00:27:29Z

Could you inform me what you mean by 'mesh attached'? I don't see any attachment. Did you just created mesh file using gmsh -2 bp1_sym.geo?

hpc4geo · 2023-10-07T00:31:59Z

Could you inform me what you mean by 'mesh attached'? I don't see any attachment. Did you just created mesh file using gmsh -2 bp1_sym.geo?

There was supposed to be a file attached to my message. It failed to upload. It is there now as a zip file.

JeenaYun · 2023-10-09T19:19:55Z

Hi, I confirmed the examples you provided work well. I have a question about the default values. It seems like the code is using some large enough default value if I don't provide -ts_checkpoint_freq_xxx options so they are not triggered. I am concerned about the choice of default -ts_checkpoint_freq_cputime being 30 (seconds, I assume). Would if be enough even for a very large/long simulations?
./app/tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -pc_type lu
returns

TS -ts_checkpoint_freq_step 1000
TS -ts_checkpoint_freq_cputime 3.0000e+01
TS -ts_checkpoint_freq_physical_time 1.0000e+10

hpc4geo · 2023-10-09T19:27:17Z

-ts_checkpoint_freq_cputime Defines the CPU time in minutes between checkpoints. So a default value of 30 means a checkpoint will be written out every 30 minutes the tandem executable is running.

I actually don't understand your question. Can you please re-phrase it?

JeenaYun · 2023-10-09T19:34:38Z

Sorry, I was thinking of a case where someone does not want checkpointing to be triggered at all since we are trying to merge into the main. Even if I don't provide any checkpoint paths or -ts_checkpoint_freq_xxx, seems like the code is setting -ts_checkpoint_freq_cputime to 30 minutes. I wonder if a user is running for several hours (which is usually the case), then the code might automatically write out some checkpoints even though the user did not ask it to do so. Is it a safety feature that you intentionally put in?

hpc4geo · 2023-10-09T20:17:55Z

Okay - I understand your concern now. I can confirm that the default behavior of tandem would change if this branch is merged - specifically tandem would always write out checkpoint files even if they weren't requested. There are pros and cons to every default. With the current defaults in this branch we have

Cons

A directory called checkpoint will always be created.
Checkpoint point files will be written if either the job runs for longer than 30 mins and or does more than 1000 time steps, and or evaluates the model more than 1e10 seconds. The downside of writing these files may be without explicitly setting the directory they may end up in your home directory and not in place with fast IO (scratch) and with less limited storage (I am thinking of a large run on a cluster).

Pros

Users don't have to think about turning something on which is pretty useful .

We can certainly change the current defaults. I'd like group consensus before twiddling with anything. This could be a topic for the next tandem meeting. @AliceGabriel maybe you want to add this to the agenda and or weigh in with an opinion.

JeenaYun · 2023-10-10T05:23:35Z

Hi, could you check whether outputs (e.g., fault probe output) are properly generated when loading a checkpoint? For example, run
tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -ts_checkpoint_freq_step 10 -pc_type lu -ts_max_steps 50 -ts_checkpoint_path ckp_ex1
Remove all the output files when it's done, then run
tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -ts_checkpoint_freq_step 4 -pc_type lu -ts_max_steps 50 -ts_checkpoint_load ckp_ex1/step20.
I expect the second line to generate output files that start from the time of the ckp_ex1/step20 file (835219 in this case), but I find that no file is created.

NicoSchlw · 2023-10-10T08:20:21Z

Hi, the checkpoint files can get quite large for 3D simulations. The Green's function checkpoint files of BP5 need 65 GB, this can easily exceed the disk quota of the home directory. I liked the initial approach, where you had to provide a path to enable checkpointing.

hpc4geo · 2023-10-10T17:43:14Z

Hi, the checkpoint files can get quite large for 3D simulations. The Green's function checkpoint files of BP5 need 65 GB, this can easily exceed the disk quota of the home directory. I liked the initial approach, where you had to provide a path to enable checkpointing.

This branch is concerned with checkpointing the time integrator, not the Green's functions. As such the data being written per checkpoint file is just the state variable defined on the fault.

hpc4geo · 2023-10-10T17:54:18Z

Hi, could you check whether outputs (e.g., fault probe output) are properly generated when loading a checkpoint? For example, run
tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -ts_checkpoint_freq_step 10 -pc_type lu -ts_max_steps 50 -ts_checkpoint_path ckp_ex1
Remove all the output files when it's done, then run
tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -ts_checkpoint_freq_step 4 -pc_type lu -ts_max_steps 50 -ts_checkpoint_load ckp_ex1/step20.
I expect the second line to generate output files that start from the time of the ckp_ex1/step20 file (835219 in this case), but I find that no file is created.

The issue is here
https://github.com/TEAR-ERC/tandem/blob/17c42dc9ae0ec519dcc1b5732681b2e4054666f1/src/io/ProbeWriter.cpp#L80C27-L80C27

The logic of this code snippet

        if (time <= 0.0) {
            out_->open(probe.file_name, false);
            write_header(probe, functions);
        } else {
            out_->open(probe.file_name, true);
        }

is to open a new file and insert the header if time = 0.0, otherwise try to open an existing file.

Does the probe file get updated correctly if you do not delete the output files?

I suggest the following change might be the way to go

if (time <= 0.0) {
  // open new file
  // write header
} else {
  if (file_exists == true) {
    // open existing file
  } else { // below is needed to support checkpointing
    // open new file
    // write header
  }
}

JeenaYun · 2023-10-10T23:20:38Z

Thanks for the response. I confirmed that the output file does updated correctly if I do not delete the output files.

This is a fault probe output (fltst_dp025.csv) after running
tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -ts_checkpoint_freq_step 10 -pc_type lu -ts_max_steps 50 -ts_checkpoint_path ckp_ex1

and this is the same file after running
tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -ts_checkpoint_freq_step 10 -pc_type lu -ts_max_steps 100 -ts_checkpoint_load ckp_ex1/step20.

We can see that lines 1-12 is well preserved and the rest of the times are added to the file correctly.

And I agree with the suggested update to solve the issue. We may need to apply the same logic for all the outputs. I see the same lines in BoundaryProbeWriter.cpp and ScalarWriter.cpp.

JeenaYun · 2023-10-12T20:35:24Z

Hi, applying the following changes seems to fix the behavior. I don't have permission to commit to this branch, so I am attaching the updated lines here.

With these changes, a new output file is created if it does not exist and confirmed that the new output file starts at the loaded time.
For example, timestep 20 is written at 349592 s
20 TS dt 118109. time 349592..
When loading step20, probe outputs starts at 349592 s:
Fault probe

# Station dp025 (x = [-9.40013e-17, -2.5])
"Time","state","slip0","traction0","slip-rate0","normal-stress"
3.495924715086657e+05,6.569953829392793e-01,1.361502614496297e-05,-2.654612236511788e+01,3.347510697585748e-12,4.999999999999999e+01
1.094609221593516e+07,7.083295680707463e-01,1.558101076239341e-05,-2.655272238935397e+01,2.000043070515226e-14,4.999999999999999e+01
2.229201356334821e+07,7.189929196645377e-01,1.571280546578435e-05,-2.655937987746161e+01,6.977799613217902e-15,4.999999999999999e+01
3.471027647277595e+07,7.256332180928498e-01,1.577499426254676e-05,-2.656654287035408e+01,3.643809747980215e-15,4.999999999999999e+01
4.883362447109488e+07,7.307530422201931e-01,1.581498288285030e-05,-2.657455650139471e+01,2.219039884145297e-15,4.999999999999999e+01
6.021635167284335e+07,7.338954906903757e-01,1.583667030268056e-05,-2.658092427895965e+01,1.641425917278126e-15,4.999999999999999e+01
6.984097745359907e+07,7.361193964973215e-01,1.585086901443423e-05,-2.658625156878288e+01,1.328206219517662e-15,4.999999999999999e+01
7.993113815732217e+07,7.381433701238216e-01,1.586303912579679e-05,-2.659178512952237e+01,1.096912136359413e-15,4.999999999999999e+01
8.955819904938963e+07,7.398490737174934e-01,1.587278150097346e-05,-2.659701930757101e+01,9.346315237063199e-16,4.999999999999999e+01
9.926828902046998e+07,7.413930195914766e-01,1.588122290326828e-05,-2.660225688922429e+01,8.093511969077113e-16,4.999999999999999e+01

Domain probe

# Station dp010 (x = [50, -2.60764e-17])
"Time","u"
3.495924715086657e+05,0.000000000000000e+00
1.094609221593516e+07,4.327753506626133e-03
2.229201356334821e+07,8.778128494079909e-03
3.471027647277595e+07,1.362380221792416e-02
4.883362447109488e+07,1.910578557861235e-02
6.021635167284335e+07,2.350327498646150e-02
6.984097745359907e+07,2.720814061880856e-02
7.993113815732217e+07,3.107980981448889e-02
8.955819904938963e+07,3.476261351551272e-02
9.926828902046998e+07,3.846672791282178e-02

ProbeWriter.cpp

--- a/home/jyun/softwares/project-tandem/tandem/src/io/ProbeWriter.cpp
+++ b/ProbeWriter.cpp
@@ -7,6 +7,8 @@
 #include <sstream>
 #include <unordered_map>
 #include <unordered_set>
+#include <filesystem>
+namespace fs = std::filesystem;
 
 namespace tndm {
 
@@ -81,7 +83,17 @@ void ProbeWriter<D>::write(double time, mneme::span<FiniteElementFunction<D>> fu
             out_->open(probe.file_name, false);
             write_header(probe, functions);
         } else {
-            out_->open(probe.file_name, true);
+            fs::path pckp(probe.file_name);
+            bool exists = fs::exists(pckp);
+            if (exists) {
+                // open existing file
+                out_->open(probe.file_name, true);
+            } else { // below is needed to support checkpointing
+                // open new file
+                out_->open(probe.file_name, false);
+                // write header
+                write_header(probe, functions);
+            }
         }

BoundaryProbeWriter.cpp

--- a/home/jyun/softwares/project-tandem/tandem/src/io/BoundaryProbeWriter.cpp
+++ b/BoundaryProbeWriter.cpp
@@ -8,6 +8,8 @@
 #include <sstream>
 #include <unordered_map>
 #include <unordered_set>
+#include <filesystem>
+namespace fs = std::filesystem;
 
 namespace tndm {
 
@@ -83,7 +85,17 @@ void BoundaryProbeWriter<D>::write(double time,
             out_->open(probe.file_name, false);
             write_header(probe, functions);
         } else {
-            out_->open(probe.file_name, true);
+            fs::path pckp(probe.file_name);
+            bool exists = fs::exists(pckp);
+            if (exists) {
+                // open existing file
+                out_->open(probe.file_name, true);
+            } else { // below is needed to support checkpointing
+                // open new file
+                out_->open(probe.file_name, false);
+                // write header
+                write_header(probe, functions);
+            }
         }

ScalarWriter.cpp

--- a/home/jyun/softwares/project-tandem/tandem/src/io/ScalarWriter.cpp
+++ b/ScalarWriter.cpp
@@ -2,6 +2,8 @@
 
 #include <iomanip>
 #include <ios>
+#include <filesystem>
+namespace fs = std::filesystem;
 
 namespace tndm {
 
@@ -19,7 +21,17 @@ void ScalarWriter::write(double time, mneme::span<double> scalars) const {
         out_->open(file_name_, false);
         write_header();
     } else {
-        out_->open(file_name_, true);
+        fs::path pckp(file_name_);
+        bool exists = fs::exists(pckp);
+        if (exists) {
+            // open existing file
+            out_->open(file_name_, true);
+        } else { // below is needed to support checkpointing
+            // open new file
+            out_->open(file_name_, false);
+            // write header
+            write_header();
+        }
     }

hpc4geo · 2023-10-12T20:59:41Z

In general we don't want to share code snippets to be merged within PR comments - it's too easy to introduce cut-and-paste mistakes.

Please make a branch of dmay/seas-checkpoint and add your changes to it. I believe you should have write access to push that branch to the remote repo. Once you have done that we can proceed.

JeenaYun · 2023-10-12T22:09:15Z

I don't have permission to push a branch to this repository.
remote: Permission to TEAR-ERC/tandem.git denied to JeenaYun.
I rather made a pull request (#60) from my forked repository.

hpc4geo · 2023-11-03T16:56:45Z

A traditional way of checkpointing is to keep only the last 2 checkpoints, that is when creating a new checkpoint n, removing checkpoint n-2 (this is what is implemented in moose with option num_file, default-2, https://mooseframework.inl.gov/source/outputs/Checkpoint.html).
This would prevent the number of checkpoint files from growing indefinitely.
Nevertheless, it is a new feature, and I would refrain from asking for new features in a PR.
If that feature was integrated, I would make it the default, and keep the checkpointing parameters as you chose for instance.

This is a good point. This branch now allows the user to specify the storage mode for checkpoints. The default is currently to store only 1 checkpoint on disk. I've updated the PR description at the very top of this page to reflect this.

hpc4geo · 2023-11-03T23:14:59Z

Sorry, I was thinking of a case where someone does not want checkpointing to be triggered at all since we are trying to merge into the main. Even if I don't provide any checkpoint paths or -ts_checkpoint_freq_xxx, seems like the code is setting -ts_checkpoint_freq_cputime to 30 minutes. I wonder if a user is running for several hours (which is usually the case), then the code might automatically write out some checkpoints even though the user did not ask it to do so. Is it a safety feature that you intentionally put in?

Yes, absolutely. Nothing worse than having an error after 4 hrs and not having a single checkpoint file written and thus requiring you re-do 4 hrs of computation. Definitely better to write more often. Not this is less of an issue now that the default mode uses storage_type = limited.

…without libxsmm.

Thomas-Ulrich

LGTM! Maybe you want to merge #61 into #59 before merging into main.

…sult in non-determinstic deadlock.

AliceGabriel · 2024-06-06T15:13:13Z

@hpc4geo should this PR be merged now or after merging 61?

move ts checkpointing parameters to toml

yohaimagen · 2024-08-14T18:18:39Z

I was playing with this pull request and found a behavior which I think is undesired.

If I am starting a model from a specific checkpoint and at the same time the PETSc param -ts_dt is set to some value, PETSc will ignore the time step from the checkpoint and rerun the model with the fault params and time of the checkpoint with the dt specified with -ts_dt.

This could cause misalignments and even lead to a crash tandem if -ts_dt is not suitable.

hpc4geo · 2024-08-15T21:42:08Z

I was playing with this pull request and found a behavior which I think is undesired.

If I am starting a model from a specific checkpoint and at the same time the PETSc param -ts_dt is set to some value, PETSc will ignore the time step from the checkpoint and rerun the model with the fault params and time of the checkpoint with the dt specified with -ts_dt.

This could cause misalignments and even lead to a crash tandem if -ts_dt is not suitable.

Good spot @yohaimagen ! -ts_dt has two purposes. When TSAdaptType = none, then this flag sets the time step to use. When TSAdaptType = basic it defines the "initial" tilmestep to use. In theory when using adapt type basic -ts_dt should only affect the time step size at time t = 0 (that is what "initial" is supposed to mean). That means when restarting (and assuming several steps have been performed) that the option should not have any affect. However, petsc is currently assuming "initial" means "first time TSStep is called".

I will modify the checkpoint code to raise a warning if -ts_dt is detected when restarting.

Alice-Agnes Gabriel and others added 11 commits December 2, 2021 13:45

Added basic functionality required to checkpoing a SEAS model. This c…

e284992

…ommit will checkpoint every time step

Remove guard on files

b0380fe

Clean utils.

d881b36

* Add filename as arg to load(), view() * Removed unused code * Error check for VecLoad_Nest

SEAS checkpoint TS solver.

741e3c2

solve(): break earlier if the number of requested steps have been per…

5dc8b70

…formed.

ts_utils: Output more members from TS and TSAdapt in the hope of pres…

80b1982

…erving as much state as possible when performing a restart.

Bugfix: Do not overwrite user vector

f40a8dd

PetscTimeSolver: Add getter/setter for checkpoint filename and frequency

aef2891

Add safety of operator overload

fad1cf9

Dave May added 2 commits October 5, 2023 16:58

Halt execution if a file failed to load when a restart was requested

db8979b

More error checking

3b489de

Allow creating output file when simulataion starts inat non-zero time

c93f97c

Removed unused checkpointing methods and variables

f6587fe

TEAR-ERC deleted a comment from Thomas-Ulrich Nov 3, 2023

Removed useless code copied from PETSc source

67cb17a

Added more safety when loading TSAdapt

3463e0a

Thomas-Ulrich and others added 7 commits November 7, 2023 09:15

Merge remote-tracking branch 'origin/main' into dmay/seas-checkpoint

8458c1d

fix compilation

8a87903

- this small change fixes for me an error that arises when compiling …

64046d5

…without libxsmm.

add missing include

65d2480

fix compilation

4c6f720

ini commit toml

2ac9e62

checkpoint parameters directly on the main toml node

b8dc246

Thomas-Ulrich approved these changes Nov 7, 2023

View reviewed changes

Thomas-Ulrich and others added 4 commits November 8, 2023 09:59

re-instate last_checkpoint

0590fe7

use the TOML sub-node for checkpointing

43c0b2b

Synchronize using cpu time from rank 0. Failure to synchronize may re…

b8ebe24

…sult in non-determinstic deadlock.

Force the generate flag to be syncd across comm

c22b346

Merge pull request #61 from Thomas-Ulrich/thomas/test_toml_cp

a4c44d2

move ts checkpointing parameters to toml

AliceGabriel assigned hpc4geo Jun 6, 2024

Merge remote-tracking branch 'origin/main' into dmay/seas-checkpoint

254657f

Thomas-Ulrich mentioned this pull request Jun 25, 2024

Document ts checkpoint + minor change (-> dmay/seas-checkpoint) #72

Open

This was referenced Jul 17, 2024

Slow Green function checkpointing on large setups risks unusable gf file #73

Open

F(a) and F(b) must have different sign on first time step of BP5 #74

Closed

JeenaYun mentioned this pull request Nov 1, 2024

version 1.2 #83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEAS checkpointing of PETSc TS object #59

SEAS checkpointing of PETSc TS object #59

hpc4geo commented Oct 5, 2023 •

edited

Loading

hpc4geo commented Oct 5, 2023 •

edited

Loading

hpc4geo commented Oct 5, 2023

JeenaYun commented Oct 7, 2023

hpc4geo commented Oct 7, 2023

JeenaYun commented Oct 9, 2023

hpc4geo commented Oct 9, 2023

JeenaYun commented Oct 9, 2023

hpc4geo commented Oct 9, 2023

JeenaYun commented Oct 10, 2023

NicoSchlw commented Oct 10, 2023

hpc4geo commented Oct 10, 2023

hpc4geo commented Oct 10, 2023

JeenaYun commented Oct 10, 2023 •

edited

Loading

JeenaYun commented Oct 12, 2023 •

edited

Loading

hpc4geo commented Oct 12, 2023

JeenaYun commented Oct 12, 2023

hpc4geo commented Nov 3, 2023

hpc4geo commented Nov 3, 2023

Thomas-Ulrich left a comment

AliceGabriel commented Jun 6, 2024 •

edited

Loading

yohaimagen commented Aug 14, 2024

hpc4geo commented Aug 15, 2024

SEAS checkpointing of PETSc TS object #59

Are you sure you want to change the base?

SEAS checkpointing of PETSc TS object #59

Conversation

hpc4geo commented Oct 5, 2023 • edited Loading

hpc4geo commented Oct 5, 2023 • edited Loading

hpc4geo commented Oct 5, 2023

JeenaYun commented Oct 7, 2023

hpc4geo commented Oct 7, 2023

JeenaYun commented Oct 9, 2023

hpc4geo commented Oct 9, 2023

JeenaYun commented Oct 9, 2023

hpc4geo commented Oct 9, 2023

JeenaYun commented Oct 10, 2023

NicoSchlw commented Oct 10, 2023

hpc4geo commented Oct 10, 2023

hpc4geo commented Oct 10, 2023

JeenaYun commented Oct 10, 2023 • edited Loading

JeenaYun commented Oct 12, 2023 • edited Loading

hpc4geo commented Oct 12, 2023

JeenaYun commented Oct 12, 2023

hpc4geo commented Nov 3, 2023

hpc4geo commented Nov 3, 2023

Thomas-Ulrich left a comment

Choose a reason for hiding this comment

AliceGabriel commented Jun 6, 2024 • edited Loading

yohaimagen commented Aug 14, 2024

hpc4geo commented Aug 15, 2024

hpc4geo commented Oct 5, 2023 •

edited

Loading

hpc4geo commented Oct 5, 2023 •

edited

Loading

JeenaYun commented Oct 10, 2023 •

edited

Loading

JeenaYun commented Oct 12, 2023 •

edited

Loading

AliceGabriel commented Jun 6, 2024 •

edited

Loading