Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEAS checkpointing of PETSc TS object #59

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

Conversation

hpc4geo
Copy link
Collaborator

@hpc4geo hpc4geo commented Oct 5, 2023

Summary of changes

  1. Correctly storing RK stages if the RK scheme has the FASL property;
  2. Better control flow to enable checkpoint and restarts using a single call to TSSolve() and checkpointing via TSSetPostSolve();
  3. Different checkpoint storage modes are possible (e.g. store all, store some, store none);
  4. A restart can be staged using any checkpoint data, not just the last checkpoint file written;
  5. The checkpoint data written is placed in its own directory.
  6. TOML variables ts_checkpoint_prefix and ts_checkpoint_frequency currently have no effect.
  7. All control of checkpointing is defined via PETSc options.

Summary of new options to control checkpointing

  • -ts_checkpoint_load <str> defines the directory to use to load a checkpoint
  • -ts_checkpoint_path <str> defines the root location where checkpoint files will be written.
  • -ts_checkpoint_freq_step <int> defines the time step frequency between checkpoints.
  • -ts_checkpoint_freq_cputime <float> defines the CPU time (minutes) frequency between checkpoints.
  • -ts_checkpoint_freq_physical_time <float> defines the physical time frequency between checkpoints.
  • -ts_checkpoint_storage_type <none, limited, unlimited> defines the type of storage for checkpoints. limited will store a finite number of unique checkpoints on disk. unlimited stores all checkpoints. Use none to completely deactivate checkpointing.
  • -ts_checkpoint_storage_limited_size <int> defines how many unique checkpoints you want to store on disk. Only has effect with -ts_checkpoint_storage_type limited.

Notes

  • One can provide values for all the -ts_checkpoint_freq_xxx options. They can be combined together and if any are satisfied a checkpoint will be written out.
  • Checkpointing is always on by default. The default storage mode is limited and the number of unique checkpoints stored is 1 (e.g. -ts_checkpoint_storage_limited_size 1).
  • Each checkpoint directory contains a file checkpoint_info.csv which contains time, dt, step and some other info.
  • When a checkpoint file is written, in the path specified by -ts_checkpoint_path a file load_last_checkpoint.opts is written. This can be used to restart from the last written checkpoint. For example ./app/tandem your_toml_options --petsc your_petsc_options and then ./app/tandem your_toml_options --petsc your_petsc_options -options_file PATH_TO_CHECKPOINT_DATA/load_last_checkpoint.opts
  • Testing was successful with PETSc v3.13 and v.3.16 using either TSRK or TSEULER.
  • Do not change the TS method between a checkpoint and a restart. This functionality is not supported and it is unlikely to work. Do this at your own peril.

Testing

./app/tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400  -ts_monitor -ts_checkpoint_freq_step 10 -pc_type lu -ts_max_steps 50 -ts_checkpoint_path ckp_ex1

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version 0801af7

                       stack size limit = unlimited

                              Worker affinity
       0123456789|0123456789|0123456789|0123456789|0123456789|012345


TS -ts_checkpoint_path ckp_ex1
TS -ts_checkpoint_freq_step 10
TS -ts_checkpoint_freq_cputime 3.0000e+01
TS -ts_checkpoint_freq_physical_time 1.0000e+10
[ckp_ex1] directory created
DOFs (domain): 2796
DOFs (fault): 240
Mesh size: 200
0 TS dt 0.1 time 0.
1 TS dt 1. time 0.1
2 TS dt 10. time 1.1
3 TS dt 100. time 11.1
4 TS dt 1000. time 111.1
5 TS dt 3496.05 time 1111.1
6 TS dt 4357.85 time 4607.15
7 TS dt 5960.65 time 8965.
8 TS dt 8033.43 time 14925.7
9 TS dt 10851.1 time 22959.1
[TSCheckpoint] Triggered by: "step"
[ckp_ex1] directory exists
[ckp_ex1/step10] directory created
[TSCheckpoint]   writing checkpoint data for step 10 -> path ckp_ex1/step10
10 TS dt 14652.3 time 33810.2
11 TS dt 19786.2 time 48462.6
12 TS dt 26718.9 time 68248.8
13 TS dt 36080.7 time 94967.7
14 TS dt 48722.9 time 131048.
15 TS dt 65795.1 time 179771.
16 TS dt 88849.4 time 245566.
17 TS dt 119982. time 334416.
18 TS dt 162024. time 454398.
19 TS dt 218798. time 616422.
[TSCheckpoint] Triggered by: "step"
[ckp_ex1] directory exists
[ckp_ex1/step20] directory created
[TSCheckpoint]   writing checkpoint data for step 20 -> path ckp_ex1/step20
20 TS dt 295466. time 835219.
21 TS dt 398999. time 1.13069e+06
22 TS dt 538812. time 1.52968e+06
23 TS dt 727617. time 2.0685e+06
24 TS dt 982581. time 2.79611e+06
25 TS dt 1.32689e+06 time 3.77869e+06
26 TS dt 1.79185e+06 time 5.10558e+06
27 TS dt 2.41974e+06 time 6.89743e+06
28 TS dt 3.26765e+06 time 9.31717e+06
29 TS dt 4.41268e+06 time 1.25848e+07
[TSCheckpoint] Triggered by: "step"
[ckp_ex1] directory exists
[ckp_ex1/step30] directory created
[TSCheckpoint]   writing checkpoint data for step 30 -> path ckp_ex1/step30
30 TS dt 5.95895e+06 time 1.69975e+07
31 TS dt 7.98805e+06 time 2.29564e+07
32 TS dt 9.28044e+06 time 3.09445e+07
33 TS dt 1.13786e+07 time 4.02249e+07
34 TS dt 1.24453e+07 time 5.16035e+07
35 TS dt 1.44212e+07 time 6.40488e+07
36 TS dt 1.69361e+07 time 7.847e+07
37 TS dt 1.99241e+07 time 9.54061e+07
38 TS dt 2.3332e+07 time 1.1533e+08
39 TS dt 2.68773e+07 time 1.38662e+08
[TSCheckpoint] Triggered by: "step"
[ckp_ex1] directory exists
[ckp_ex1/step40] directory created
[TSCheckpoint]   writing checkpoint data for step 40 -> path ckp_ex1/step40
40 TS dt 3.11191e+07 time 1.6554e+08
41 TS dt 3.33839e+07 time 1.96659e+08
42 TS dt 3.66182e+07 time 2.30043e+08
43 TS dt 4.14027e+07 time 2.66661e+08
44 TS dt 3.01661e+07 time 2.95497e+08
45 TS dt 2.9463e+07 time 3.25663e+08
46 TS dt 3.17658e+07 time 3.55126e+08
47 TS dt 2.64165e+07 time 3.80683e+08
48 TS dt 3.42318e+07 time 4.07099e+08
49 TS dt 3.33315e+07 time 4.41331e+08
[TSCheckpoint] Triggered by: "step"
[ckp_ex1] directory exists
[ckp_ex1/step50] directory created
[TSCheckpoint]   writing checkpoint data for step 50 -> path ckp_ex1/step50
50 TS dt 2.67389e+07 time 4.66688e+08
./app/tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400  -ts_monitor -ts_checkpoint_freq_step 4 -pc_type lu -ts_max_steps 50 -ts_checkpoint_load ckp_ex1/step20

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version 0801af7

                       stack size limit = unlimited

                              Worker affinity
       0123456789|0123456789|0123456789|0123456789|0123456789|012345


TS -ts_checkpoint_path checkpoint
TS -ts_checkpoint_freq_step 4
TS -ts_checkpoint_freq_cputime 3.0000e+01
TS -ts_checkpoint_freq_physical_time 1.0000e+10
[checkpoint] directory exists
DOFs (domain): 2796
DOFs (fault): 240
Mesh size: 200
TS -ts_checkpoint_load ckp_ex1/step20
[TSCheckpoint]   loading checkpoint data <- path ckp_ex1/step20
20 TS dt 295466. time 835219.
21 TS dt 398999. time 1.13069e+06
22 TS dt 538812. time 1.52968e+06
23 TS dt 727617. time 2.0685e+06
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step24] directory created
[TSCheckpoint]   writing checkpoint data for step 24 -> path checkpoint/step24
24 TS dt 982581. time 2.79611e+06
25 TS dt 1.32689e+06 time 3.77869e+06
26 TS dt 1.79185e+06 time 5.10558e+06
27 TS dt 2.41974e+06 time 6.89743e+06
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step28] directory created
[TSCheckpoint]   writing checkpoint data for step 28 -> path checkpoint/step28
28 TS dt 3.26765e+06 time 9.31717e+06
29 TS dt 4.41268e+06 time 1.25848e+07
30 TS dt 5.95895e+06 time 1.69975e+07
31 TS dt 7.98805e+06 time 2.29564e+07
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step32] directory created
[TSCheckpoint]   writing checkpoint data for step 32 -> path checkpoint/step32
32 TS dt 9.28044e+06 time 3.09445e+07
33 TS dt 1.13786e+07 time 4.02249e+07
34 TS dt 1.24453e+07 time 5.16035e+07
35 TS dt 1.44212e+07 time 6.40488e+07
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step36] directory created
[TSCheckpoint]   writing checkpoint data for step 36 -> path checkpoint/step36
36 TS dt 1.69361e+07 time 7.847e+07
37 TS dt 1.99241e+07 time 9.54061e+07
38 TS dt 2.3332e+07 time 1.1533e+08
39 TS dt 2.68773e+07 time 1.38662e+08
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step40] directory created
[TSCheckpoint]   writing checkpoint data for step 40 -> path checkpoint/step40
40 TS dt 3.11191e+07 time 1.6554e+08
41 TS dt 3.33839e+07 time 1.96659e+08
42 TS dt 3.66182e+07 time 2.30043e+08
43 TS dt 4.14027e+07 time 2.66661e+08
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step44] directory created
[TSCheckpoint]   writing checkpoint data for step 44 -> path checkpoint/step44
44 TS dt 3.01661e+07 time 2.95497e+08
45 TS dt 2.9463e+07 time 3.25663e+08
46 TS dt 3.17658e+07 time 3.55126e+08
47 TS dt 2.64165e+07 time 3.80683e+08
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step48] directory created
[TSCheckpoint]   writing checkpoint data for step 48 -> path checkpoint/step48
48 TS dt 3.42318e+07 time 4.07099e+08
49 TS dt 3.33315e+07 time 4.41331e+08
50 TS dt 2.67389e+07 time 4.66688e+08

Alice-Agnes Gabriel and others added 11 commits December 2, 2021 13:45
* Add filename as arg to load(), view()
* Removed unused code
* Error check for VecLoad_Nest
…erving as much state as possible when performing a restart.
…grator.

- New tool parameters are
[1] ts_checkpoint_prefix = <str>
  The value of <str> should something like this
    "/path/to/dir/will/tscheckpoints/will/be/written"
  The directory will be created (recursively if required) if it does not exists.
  This parameter is optional. 
  Checkpointing will ONLY occur if this parameter is defined.
[2] ts_checkpoint_frequency = <integer>
  The value of <integer> indicates the number of time steps which will occur 
  between checkpoints.
  If ts_checkpoint_prefix is not specified, this parameter has no 
  effect on the SEAS simulation.
…ing the stage vectors for RK. This is vital for consistent restarts if the RK scheme has the FSAL property. TOML options related to checkpointing are currently unused. All checkpoint control is defined via PETSc args. These options control when a checkpoint is written -ts_checkpoint_freq_step, -ts_checkpoint_freq_cputime, -ts_checkpoint_freq_physical_time. -ts_checkpoint_path defines the directory where all checkpoint files will be written. -ts_checkpoint_load indicates the directory of checkpoint data you want to use to start a new simulation. Also note that all checkpointed data is stored.
@hpc4geo
Copy link
Collaborator Author

hpc4geo commented Oct 5, 2023

The example provided above used the mesh attached, along with the standard, unmodified files from examples/tandem/2d/ bp1_sym.toml, bp1.lua
bp1_sym.msh.zip

@hpc4geo
Copy link
Collaborator Author

hpc4geo commented Oct 5, 2023

Another example of a restart with two triggers is provided below.

./app/tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400  -ts_monitor -ts_checkpoint_freq_step 40 -ts_checkpoint_freq_cputime 0.1 -pc_type lu -ts_max_steps 500 -ts_checkpoint_load ckp_ex1/step20

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version 0801af7

                       stack size limit = unlimited

                              Worker affinity
       0123456789|0123456789|0123456789|0123456789|0123456789|012345


TS -ts_checkpoint_path checkpoint
TS -ts_checkpoint_freq_step 40
TS -ts_checkpoint_freq_cputime 1.0000e-01
TS -ts_checkpoint_freq_physical_time 1.0000e+10
[checkpoint] directory exists
DOFs (domain): 2796
DOFs (fault): 240
Mesh size: 200
TS -ts_checkpoint_load ckp_ex1/step20
[TSCheckpoint]   loading checkpoint data <- path ckp_ex1/step20
20 TS dt 295466. time 835219.
21 TS dt 398999. time 1.13069e+06
22 TS dt 538812. time 1.52968e+06
23 TS dt 727617. time 2.0685e+06
24 TS dt 982581. time 2.79611e+06
25 TS dt 1.32689e+06 time 3.77869e+06
26 TS dt 1.79185e+06 time 5.10558e+06
27 TS dt 2.41974e+06 time 6.89743e+06
28 TS dt 3.26765e+06 time 9.31717e+06
29 TS dt 4.41268e+06 time 1.25848e+07
30 TS dt 5.95895e+06 time 1.69975e+07
31 TS dt 7.98805e+06 time 2.29564e+07
32 TS dt 9.28044e+06 time 3.09445e+07
33 TS dt 1.13786e+07 time 4.02249e+07
34 TS dt 1.24453e+07 time 5.16035e+07
35 TS dt 1.44212e+07 time 6.40488e+07
36 TS dt 1.69361e+07 time 7.847e+07
37 TS dt 1.99241e+07 time 9.54061e+07
38 TS dt 2.3332e+07 time 1.1533e+08
39 TS dt 2.68773e+07 time 1.38662e+08
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step40] directory exists
[TSCheckpoint]   writing checkpoint data for step 40 -> path checkpoint/step40
40 TS dt 3.11191e+07 time 1.6554e+08
41 TS dt 3.33839e+07 time 1.96659e+08
42 TS dt 3.66182e+07 time 2.30043e+08
43 TS dt 4.14027e+07 time 2.66661e+08
44 TS dt 3.01661e+07 time 2.95497e+08
45 TS dt 2.9463e+07 time 3.25663e+08
46 TS dt 3.17658e+07 time 3.55126e+08
47 TS dt 2.64165e+07 time 3.80683e+08
48 TS dt 3.42318e+07 time 4.07099e+08
49 TS dt 3.33315e+07 time 4.41331e+08
50 TS dt 2.67389e+07 time 4.66688e+08
51 TS dt 3.57412e+07 time 4.93427e+08
52 TS dt 3.14185e+07 time 5.24069e+08
53 TS dt 3.13646e+07 time 5.55488e+08
54 TS dt 2.92616e+07 time 5.86852e+08
55 TS dt 2.45094e+07 time 6.10546e+08
56 TS dt 3.50312e+07 time 6.35056e+08
57 TS dt 3.32866e+07 time 6.70087e+08
58 TS dt 2.88053e+07 time 6.97524e+08
59 TS dt 3.24711e+07 time 7.2633e+08
60 TS dt 2.87486e+07 time 7.54359e+08
61 TS dt 3.51527e+07 time 7.83107e+08
62 TS dt 3.13853e+07 time 8.14036e+08
63 TS dt 2.87462e+07 time 8.42122e+08
64 TS dt 3.516e+07 time 8.70868e+08
65 TS dt 2.98683e+07 time 8.99916e+08
66 TS dt 3.13585e+07 time 9.29784e+08
67 TS dt 2.95721e+07 time 9.61143e+08
68 TS dt 2.6957e+07 time 9.87609e+08
69 TS dt 3.28089e+07 time 1.01457e+09
70 TS dt 3.30834e+07 time 1.04737e+09
71 TS dt 2.9346e+07 time 1.07572e+09
72 TS dt 3.54434e+07 time 1.10507e+09
73 TS dt 2.94051e+07 time 1.13351e+09
74 TS dt 3.41035e+07 time 1.16291e+09
75 TS dt 3.34337e+07 time 1.19702e+09
76 TS dt 2.57716e+07 time 1.22142e+09
77 TS dt 3.63741e+07 time 1.2472e+09
78 TS dt 3.37553e+07 time 1.28357e+09
79 TS dt 2.8562e+07 time 1.31059e+09
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step80] directory exists
[TSCheckpoint]   writing checkpoint data for step 80 -> path checkpoint/step80
80 TS dt 3.37938e+07 time 1.33915e+09
81 TS dt 2.9868e+07 time 1.36825e+09
82 TS dt 3.54748e+07 time 1.39812e+09
83 TS dt 3.13201e+07 time 1.42908e+09
84 TS dt 2.85277e+07 time 1.45695e+09
85 TS dt 3.55476e+07 time 1.48547e+09
86 TS dt 3.09389e+07 time 1.51572e+09
87 TS dt 2.68293e+07 time 1.54208e+09
88 TS dt 2.43823e+07 time 1.56891e+09
89 TS dt 2.37081e+07 time 1.59329e+09
90 TS dt 2.39819e+07 time 1.617e+09
91 TS dt 2.75741e+07 time 1.64098e+09
92 TS dt 2.81586e+07 time 1.66855e+09
93 TS dt 3.0927e+07 time 1.69671e+09
94 TS dt 3.41968e+07 time 1.72764e+09
95 TS dt 3.13265e+07 time 1.76184e+09
96 TS dt 3.43477e+07 time 1.79316e+09
97 TS dt 4.11846e+07 time 1.82751e+09
98 TS dt 4.6323e+07 time 1.86869e+09
99 TS dt 2.98454e+07 time 1.89624e+09
100 TS dt 3.53535e+07 time 1.92608e+09
101 TS dt 3.53026e+07 time 1.96144e+09
102 TS dt 2.56873e+07 time 1.98551e+09
103 TS dt 3.66231e+07 time 2.0112e+09
104 TS dt 3.43716e+07 time 2.04782e+09
105 TS dt 2.86974e+07 time 2.0749e+09
106 TS dt 3.48469e+07 time 2.10359e+09
107 TS dt 3.01841e+07 time 2.13289e+09
108 TS dt 3.5707e+07 time 2.16308e+09
109 TS dt 3.16369e+07 time 2.19441e+09
110 TS dt 2.81204e+07 time 2.22175e+09
111 TS dt 3.58772e+07 time 2.24987e+09
112 TS dt 3.18032e+07 time 2.28107e+09
113 TS dt 3.08361e+07 time 2.31287e+09
114 TS dt 3.26638e+07 time 2.34371e+09
115 TS dt 2.60927e+07 time 2.36879e+09
116 TS dt 3.61226e+07 time 2.39488e+09
117 TS dt 3.51281e+07 time 2.43101e+09
118 TS dt 2.78301e+07 time 2.45717e+09
119 TS dt 3.12168e+07 time 2.485e+09
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step120] directory exists
[TSCheckpoint]   writing checkpoint data for step 120 -> path checkpoint/step120
120 TS dt 2.40731e+07 time 2.50543e+09
121 TS dt 1.67271e+07 time 2.51829e+09
122 TS dt 1.17115e+07 time 2.52816e+09
123 TS dt 8.8341e+06 time 2.53626e+09
124 TS dt 7.60614e+06 time 2.54357e+09
125 TS dt 6.03803e+06 time 2.5493e+09
126 TS dt 4.46701e+06 time 2.55368e+09
127 TS dt 3.84921e+06 time 2.55751e+09
128 TS dt 3.78722e+06 time 2.56135e+09
129 TS dt 3.95784e+06 time 2.56514e+09
130 TS dt 3.97125e+06 time 2.5691e+09
131 TS dt 4.38699e+06 time 2.57307e+09
132 TS dt 4.38842e+06 time 2.57746e+09
133 TS dt 5.03443e+06 time 2.58185e+09
134 TS dt 6.04587e+06 time 2.58688e+09
135 TS dt 6.99945e+06 time 2.59293e+09
136 TS dt 7.99595e+06 time 2.59993e+09
137 TS dt 7.97008e+06 time 2.60792e+09
138 TS dt 8.76227e+06 time 2.61589e+09
139 TS dt 1.02309e+07 time 2.62465e+09
140 TS dt 1.11905e+07 time 2.63489e+09
141 TS dt 1.10837e+07 time 2.64608e+09
142 TS dt 1.11844e+07 time 2.65716e+09
143 TS dt 1.24482e+07 time 2.66834e+09
144 TS dt 1.49095e+07 time 2.68079e+09
145 TS dt 1.71897e+07 time 2.6957e+09
146 TS dt 1.96651e+07 time 2.71289e+09
147 TS dt 2.11836e+07 time 2.73256e+09
148 TS dt 2.35066e+07 time 2.75374e+09
149 TS dt 2.45546e+07 time 2.77725e+09
150 TS dt 2.73177e+07 time 2.8018e+09
151 TS dt 3.31472e+07 time 2.82912e+09
152 TS dt 3.6281e+07 time 2.86227e+09
153 TS dt 4.17783e+07 time 2.89855e+09
154 TS dt 4.49488e+07 time 2.94033e+09
155 TS dt 4.95264e+07 time 2.98527e+09
[TSCheckpoint] Triggered by: "cputime"
[checkpoint] directory exists
[checkpoint/step156] directory exists
[TSCheckpoint]   writing checkpoint data for step 156 -> path checkpoint/step156
156 TS dt 3.30105e+07 time 3.01379e+09
157 TS dt 3.84009e+07 time 3.04681e+09
158 TS dt 2.8359e+07 time 3.07353e+09
159 TS dt 3.70198e+07 time 3.10189e+09
[TSCheckpoint] Triggered by: "step"
[checkpoint] directory exists
[checkpoint/step160] directory exists
[TSCheckpoint]   writing checkpoint data for step 160 -> path checkpoint/step160
160 TS dt 3.42e+07 time 3.13891e+09
161 TS dt 2.57749e+07 time 3.16301e+09
162 TS dt 3.69121e+07 time 3.18879e+09
163 TS dt 3.37955e+07 time 3.22199e+09

As we can see, step156 was created by our cputime frequency whilst step160 was generated by the tilmestep frequency.

@JeenaYun
Copy link
Collaborator

JeenaYun commented Oct 7, 2023

Could you inform me what you mean by 'mesh attached'? I don't see any attachment. Did you just created mesh file using gmsh -2 bp1_sym.geo?

@hpc4geo
Copy link
Collaborator Author

hpc4geo commented Oct 7, 2023

Could you inform me what you mean by 'mesh attached'? I don't see any attachment. Did you just created mesh file using gmsh -2 bp1_sym.geo?

There was supposed to be a file attached to my message. It failed to upload. It is there now as a zip file.

@JeenaYun
Copy link
Collaborator

JeenaYun commented Oct 9, 2023

Hi, I confirmed the examples you provided work well. I have a question about the default values. It seems like the code is using some large enough default value if I don't provide -ts_checkpoint_freq_xxx options so they are not triggered. I am concerned about the choice of default -ts_checkpoint_freq_cputime being 30 (seconds, I assume). Would if be enough even for a very large/long simulations?
./app/tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -pc_type lu
returns

TS -ts_checkpoint_freq_step 1000
TS -ts_checkpoint_freq_cputime 3.0000e+01
TS -ts_checkpoint_freq_physical_time 1.0000e+10

@hpc4geo
Copy link
Collaborator Author

hpc4geo commented Oct 9, 2023

-ts_checkpoint_freq_cputime Defines the CPU time in minutes between checkpoints. So a default value of 30 means a checkpoint will be written out every 30 minutes the tandem executable is running.

I actually don't understand your question. Can you please re-phrase it?

@JeenaYun
Copy link
Collaborator

JeenaYun commented Oct 9, 2023

Sorry, I was thinking of a case where someone does not want checkpointing to be triggered at all since we are trying to merge into the main. Even if I don't provide any checkpoint paths or -ts_checkpoint_freq_xxx, seems like the code is setting -ts_checkpoint_freq_cputime to 30 minutes. I wonder if a user is running for several hours (which is usually the case), then the code might automatically write out some checkpoints even though the user did not ask it to do so. Is it a safety feature that you intentionally put in?

@hpc4geo
Copy link
Collaborator Author

hpc4geo commented Oct 9, 2023

Okay - I understand your concern now. I can confirm that the default behavior of tandem would change if this branch is merged - specifically tandem would always write out checkpoint files even if they weren't requested. There are pros and cons to every default. With the current defaults in this branch we have

Cons

  • A directory called checkpoint will always be created.
  • Checkpoint point files will be written if either the job runs for longer than 30 mins and or does more than 1000 time steps, and or evaluates the model more than 1e10 seconds. The downside of writing these files may be without explicitly setting the directory they may end up in your home directory and not in place with fast IO (scratch) and with less limited storage (I am thinking of a large run on a cluster).

Pros

  • Users don't have to think about turning something on which is pretty useful .

We can certainly change the current defaults. I'd like group consensus before twiddling with anything. This could be a topic for the next tandem meeting. @AliceGabriel maybe you want to add this to the agenda and or weigh in with an opinion.

@JeenaYun
Copy link
Collaborator

Hi, could you check whether outputs (e.g., fault probe output) are properly generated when loading a checkpoint? For example, run
tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -ts_checkpoint_freq_step 10 -pc_type lu -ts_max_steps 50 -ts_checkpoint_path ckp_ex1
Remove all the output files when it's done, then run
tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -ts_checkpoint_freq_step 4 -pc_type lu -ts_max_steps 50 -ts_checkpoint_load ckp_ex1/step20.
I expect the second line to generate output files that start from the time of the ckp_ex1/step20 file (835219 in this case), but I find that no file is created.

@NicoSchlw
Copy link
Collaborator

Hi, the checkpoint files can get quite large for 3D simulations. The Green's function checkpoint files of BP5 need 65 GB, this can easily exceed the disk quota of the home directory. I liked the initial approach, where you had to provide a path to enable checkpointing.

@hpc4geo
Copy link
Collaborator Author

hpc4geo commented Oct 10, 2023

Hi, the checkpoint files can get quite large for 3D simulations. The Green's function checkpoint files of BP5 need 65 GB, this can easily exceed the disk quota of the home directory. I liked the initial approach, where you had to provide a path to enable checkpointing.

This branch is concerned with checkpointing the time integrator, not the Green's functions. As such the data being written per checkpoint file is just the state variable defined on the fault.

@hpc4geo
Copy link
Collaborator Author

hpc4geo commented Oct 10, 2023

Hi, could you check whether outputs (e.g., fault probe output) are properly generated when loading a checkpoint? For example, run
tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -ts_checkpoint_freq_step 10 -pc_type lu -ts_max_steps 50 -ts_checkpoint_path ckp_ex1
Remove all the output files when it's done, then run
tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -ts_checkpoint_freq_step 4 -pc_type lu -ts_max_steps 50 -ts_checkpoint_load ckp_ex1/step20.
I expect the second line to generate output files that start from the time of the ckp_ex1/step20 file (835219 in this case), but I find that no file is created.

The issue is here
https://github.com/TEAR-ERC/tandem/blob/17c42dc9ae0ec519dcc1b5732681b2e4054666f1/src/io/ProbeWriter.cpp#L80C27-L80C27

The logic of this code snippet

        if (time <= 0.0) {
            out_->open(probe.file_name, false);
            write_header(probe, functions);
        } else {
            out_->open(probe.file_name, true);
        }

is to open a new file and insert the header if time = 0.0, otherwise try to open an existing file.

Does the probe file get updated correctly if you do not delete the output files?

I suggest the following change might be the way to go

if (time <= 0.0) {
  // open new file
  // write header
} else {
  if (file_exists == true) {
    // open existing file
  } else { // below is needed to support checkpointing
    // open new file
    // write header
  }
}

@JeenaYun
Copy link
Collaborator

JeenaYun commented Oct 10, 2023

Thanks for the response. I confirmed that the output file does updated correctly if I do not delete the output files.

This is a fault probe output (fltst_dp025.csv) after running
tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -ts_checkpoint_freq_step 10 -pc_type lu -ts_max_steps 50 -ts_checkpoint_path ckp_ex1
image
and this is the same file after running
tandem bp1_sym.toml --petsc -options_file rk45.cfg -ts_max_steps 400 -ts_monitor -ts_checkpoint_freq_step 10 -pc_type lu -ts_max_steps 100 -ts_checkpoint_load ckp_ex1/step20.
image
We can see that lines 1-12 is well preserved and the rest of the times are added to the file correctly.

And I agree with the suggested update to solve the issue. We may need to apply the same logic for all the outputs. I see the same lines in BoundaryProbeWriter.cpp and ScalarWriter.cpp.

@JeenaYun
Copy link
Collaborator

JeenaYun commented Oct 12, 2023

Hi, applying the following changes seems to fix the behavior. I don't have permission to commit to this branch, so I am attaching the updated lines here.

With these changes, a new output file is created if it does not exist and confirmed that the new output file starts at the loaded time.
For example, timestep 20 is written at 349592 s
20 TS dt 118109. time 349592..
When loading step20, probe outputs starts at 349592 s:
Fault probe

# Station dp025 (x = [-9.40013e-17, -2.5])
"Time","state","slip0","traction0","slip-rate0","normal-stress"
3.495924715086657e+05,6.569953829392793e-01,1.361502614496297e-05,-2.654612236511788e+01,3.347510697585748e-12,4.999999999999999e+01
1.094609221593516e+07,7.083295680707463e-01,1.558101076239341e-05,-2.655272238935397e+01,2.000043070515226e-14,4.999999999999999e+01
2.229201356334821e+07,7.189929196645377e-01,1.571280546578435e-05,-2.655937987746161e+01,6.977799613217902e-15,4.999999999999999e+01
3.471027647277595e+07,7.256332180928498e-01,1.577499426254676e-05,-2.656654287035408e+01,3.643809747980215e-15,4.999999999999999e+01
4.883362447109488e+07,7.307530422201931e-01,1.581498288285030e-05,-2.657455650139471e+01,2.219039884145297e-15,4.999999999999999e+01
6.021635167284335e+07,7.338954906903757e-01,1.583667030268056e-05,-2.658092427895965e+01,1.641425917278126e-15,4.999999999999999e+01
6.984097745359907e+07,7.361193964973215e-01,1.585086901443423e-05,-2.658625156878288e+01,1.328206219517662e-15,4.999999999999999e+01
7.993113815732217e+07,7.381433701238216e-01,1.586303912579679e-05,-2.659178512952237e+01,1.096912136359413e-15,4.999999999999999e+01
8.955819904938963e+07,7.398490737174934e-01,1.587278150097346e-05,-2.659701930757101e+01,9.346315237063199e-16,4.999999999999999e+01
9.926828902046998e+07,7.413930195914766e-01,1.588122290326828e-05,-2.660225688922429e+01,8.093511969077113e-16,4.999999999999999e+01

Domain probe

# Station dp010 (x = [50, -2.60764e-17])
"Time","u"
3.495924715086657e+05,0.000000000000000e+00
1.094609221593516e+07,4.327753506626133e-03
2.229201356334821e+07,8.778128494079909e-03
3.471027647277595e+07,1.362380221792416e-02
4.883362447109488e+07,1.910578557861235e-02
6.021635167284335e+07,2.350327498646150e-02
6.984097745359907e+07,2.720814061880856e-02
7.993113815732217e+07,3.107980981448889e-02
8.955819904938963e+07,3.476261351551272e-02
9.926828902046998e+07,3.846672791282178e-02

ProbeWriter.cpp

--- a/home/jyun/softwares/project-tandem/tandem/src/io/ProbeWriter.cpp
+++ b/ProbeWriter.cpp
@@ -7,6 +7,8 @@
 #include <sstream>
 #include <unordered_map>
 #include <unordered_set>
+#include <filesystem>
+namespace fs = std::filesystem;
 
 namespace tndm {
 
@@ -81,7 +83,17 @@ void ProbeWriter<D>::write(double time, mneme::span<FiniteElementFunction<D>> fu
             out_->open(probe.file_name, false);
             write_header(probe, functions);
         } else {
-            out_->open(probe.file_name, true);
+            fs::path pckp(probe.file_name);
+            bool exists = fs::exists(pckp);
+            if (exists) {
+                // open existing file
+                out_->open(probe.file_name, true);
+            } else { // below is needed to support checkpointing
+                // open new file
+                out_->open(probe.file_name, false);
+                // write header
+                write_header(probe, functions);
+            }
         }

BoundaryProbeWriter.cpp

--- a/home/jyun/softwares/project-tandem/tandem/src/io/BoundaryProbeWriter.cpp
+++ b/BoundaryProbeWriter.cpp
@@ -8,6 +8,8 @@
 #include <sstream>
 #include <unordered_map>
 #include <unordered_set>
+#include <filesystem>
+namespace fs = std::filesystem;
 
 namespace tndm {
 
@@ -83,7 +85,17 @@ void BoundaryProbeWriter<D>::write(double time,
             out_->open(probe.file_name, false);
             write_header(probe, functions);
         } else {
-            out_->open(probe.file_name, true);
+            fs::path pckp(probe.file_name);
+            bool exists = fs::exists(pckp);
+            if (exists) {
+                // open existing file
+                out_->open(probe.file_name, true);
+            } else { // below is needed to support checkpointing
+                // open new file
+                out_->open(probe.file_name, false);
+                // write header
+                write_header(probe, functions);
+            }
         }

ScalarWriter.cpp

--- a/home/jyun/softwares/project-tandem/tandem/src/io/ScalarWriter.cpp
+++ b/ScalarWriter.cpp
@@ -2,6 +2,8 @@
 
 #include <iomanip>
 #include <ios>
+#include <filesystem>
+namespace fs = std::filesystem;
 
 namespace tndm {
 
@@ -19,7 +21,17 @@ void ScalarWriter::write(double time, mneme::span<double> scalars) const {
         out_->open(file_name_, false);
         write_header();
     } else {
-        out_->open(file_name_, true);
+        fs::path pckp(file_name_);
+        bool exists = fs::exists(pckp);
+        if (exists) {
+            // open existing file
+            out_->open(file_name_, true);
+        } else { // below is needed to support checkpointing
+            // open new file
+            out_->open(file_name_, false);
+            // write header
+            write_header();
+        }
     }

@hpc4geo
Copy link
Collaborator Author

hpc4geo commented Oct 12, 2023

In general we don't want to share code snippets to be merged within PR comments - it's too easy to introduce cut-and-paste mistakes.

Please make a branch of dmay/seas-checkpoint and add your changes to it. I believe you should have write access to push that branch to the remote repo. Once you have done that we can proceed.

@JeenaYun
Copy link
Collaborator

I don't have permission to push a branch to this repository.
remote: Permission to TEAR-ERC/tandem.git denied to JeenaYun.
I rather made a pull request (#60) from my forked repository.

@TEAR-ERC TEAR-ERC deleted a comment from Thomas-Ulrich Nov 3, 2023
@TEAR-ERC TEAR-ERC deleted a comment from Thomas-Ulrich Nov 3, 2023
@TEAR-ERC TEAR-ERC deleted a comment from Thomas-Ulrich Nov 3, 2023
@hpc4geo
Copy link
Collaborator Author

hpc4geo commented Nov 3, 2023

A traditional way of checkpointing is to keep only the last 2 checkpoints, that is when creating a new checkpoint n, removing checkpoint n-2 (this is what is implemented in moose with option num_file, default-2, https://mooseframework.inl.gov/source/outputs/Checkpoint.html).
This would prevent the number of checkpoint files from growing indefinitely.
Nevertheless, it is a new feature, and I would refrain from asking for new features in a PR.
If that feature was integrated, I would make it the default, and keep the checkpointing parameters as you chose for instance.

This is a good point. This branch now allows the user to specify the storage mode for checkpoints. The default is currently to store only 1 checkpoint on disk. I've updated the PR description at the very top of this page to reflect this.

@hpc4geo
Copy link
Collaborator Author

hpc4geo commented Nov 3, 2023

Sorry, I was thinking of a case where someone does not want checkpointing to be triggered at all since we are trying to merge into the main. Even if I don't provide any checkpoint paths or -ts_checkpoint_freq_xxx, seems like the code is setting -ts_checkpoint_freq_cputime to 30 minutes. I wonder if a user is running for several hours (which is usually the case), then the code might automatically write out some checkpoints even though the user did not ask it to do so. Is it a safety feature that you intentionally put in?

Yes, absolutely. Nothing worse than having an error after 4 hrs and not having a single checkpoint file written and thus requiring you re-do 4 hrs of computation. Definitely better to write more often. Not this is less of an issue now that the default mode uses storage_type = limited.

Copy link
Collaborator

@Thomas-Ulrich Thomas-Ulrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Maybe you want to merge #61 into #59 before merging into main.

@AliceGabriel
Copy link
Contributor

AliceGabriel commented Jun 6, 2024

@hpc4geo should this PR be merged now or after merging 61?

move ts checkpointing parameters to toml
@yohaimagen
Copy link
Collaborator

I was playing with this pull request and found a behavior which I think is undesired.

If I am starting a model from a specific checkpoint and at the same time the PETSc param -ts_dt is set to some value, PETSc will ignore the time step from the checkpoint and rerun the model with the fault params and time of the checkpoint with the dt specified with -ts_dt.

This could cause misalignments and even lead to a crash tandem if -ts_dt is not suitable.

@hpc4geo
Copy link
Collaborator Author

hpc4geo commented Aug 15, 2024

I was playing with this pull request and found a behavior which I think is undesired.

If I am starting a model from a specific checkpoint and at the same time the PETSc param -ts_dt is set to some value, PETSc will ignore the time step from the checkpoint and rerun the model with the fault params and time of the checkpoint with the dt specified with -ts_dt.

This could cause misalignments and even lead to a crash tandem if -ts_dt is not suitable.

Good spot @yohaimagen ! -ts_dt has two purposes. When TSAdaptType = none, then this flag sets the time step to use. When TSAdaptType = basic it defines the "initial" tilmestep to use. In theory when using adapt type basic -ts_dt should only affect the time step size at time t = 0 (that is what "initial" is supposed to mean). That means when restarting (and assuming several steps have been performed) that the option should not have any affect. However, petsc is currently assuming "initial" means "first time TSStep is called".

I will modify the checkpoint code to raise a warning if -ts_dt is detected when restarting.

@JeenaYun JeenaYun mentioned this pull request Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants