-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEAS checkpointing of PETSc TS object #59
base: main
Are you sure you want to change the base?
Conversation
…ommit will checkpoint every time step
* Add filename as arg to load(), view() * Removed unused code * Error check for VecLoad_Nest
…erving as much state as possible when performing a restart.
…grator. - New tool parameters are [1] ts_checkpoint_prefix = <str> The value of <str> should something like this "/path/to/dir/will/tscheckpoints/will/be/written" The directory will be created (recursively if required) if it does not exists. This parameter is optional. Checkpointing will ONLY occur if this parameter is defined. [2] ts_checkpoint_frequency = <integer> The value of <integer> indicates the number of time steps which will occur between checkpoints. If ts_checkpoint_prefix is not specified, this parameter has no effect on the SEAS simulation.
…ing the stage vectors for RK. This is vital for consistent restarts if the RK scheme has the FSAL property. TOML options related to checkpointing are currently unused. All checkpoint control is defined via PETSc args. These options control when a checkpoint is written -ts_checkpoint_freq_step, -ts_checkpoint_freq_cputime, -ts_checkpoint_freq_physical_time. -ts_checkpoint_path defines the directory where all checkpoint files will be written. -ts_checkpoint_load indicates the directory of checkpoint data you want to use to start a new simulation. Also note that all checkpointed data is stored.
The example provided above used the mesh attached, along with the standard, unmodified files from |
Another example of a restart with two triggers is provided below.
As we can see, step156 was created by our cputime frequency whilst step160 was generated by the tilmestep frequency. |
Could you inform me what you mean by 'mesh attached'? I don't see any attachment. Did you just created mesh file using |
There was supposed to be a file attached to my message. It failed to upload. It is there now as a zip file. |
Hi, I confirmed the examples you provided work well. I have a question about the default values. It seems like the code is using some large enough default value if I don't provide TS -ts_checkpoint_freq_step 1000
TS -ts_checkpoint_freq_cputime 3.0000e+01
TS -ts_checkpoint_freq_physical_time 1.0000e+10 |
I actually don't understand your question. Can you please re-phrase it? |
Sorry, I was thinking of a case where someone does not want checkpointing to be triggered at all since we are trying to merge into the main. Even if I don't provide any checkpoint paths or |
Okay - I understand your concern now. I can confirm that the default behavior of tandem would change if this branch is merged - specifically tandem would always write out checkpoint files even if they weren't requested. There are pros and cons to every default. With the current defaults in this branch we have Cons
Pros
We can certainly change the current defaults. I'd like group consensus before twiddling with anything. This could be a topic for the next tandem meeting. @AliceGabriel maybe you want to add this to the agenda and or weigh in with an opinion. |
Hi, could you check whether outputs (e.g., fault probe output) are properly generated when loading a checkpoint? For example, run |
Hi, the checkpoint files can get quite large for 3D simulations. The Green's function checkpoint files of BP5 need 65 GB, this can easily exceed the disk quota of the home directory. I liked the initial approach, where you had to provide a path to enable checkpointing. |
This branch is concerned with checkpointing the time integrator, not the Green's functions. As such the data being written per checkpoint file is just the state variable defined on the fault. |
The issue is here The logic of this code snippet
is to open a new file and insert the header if time = 0.0, otherwise try to open an existing file. Does the probe file get updated correctly if you do not delete the output files? I suggest the following change might be the way to go
|
Thanks for the response. I confirmed that the output file does updated correctly if I do not delete the output files. This is a fault probe output (fltst_dp025.csv) after running And I agree with the suggested update to solve the issue. We may need to apply the same logic for all the outputs. I see the same lines in BoundaryProbeWriter.cpp and ScalarWriter.cpp. |
Hi, applying the following changes seems to fix the behavior. I don't have permission to commit to this branch, so I am attaching the updated lines here. With these changes, a new output file is created if it does not exist and confirmed that the new output file starts at the loaded time.
Domain probe
ProbeWriter.cpp --- a/home/jyun/softwares/project-tandem/tandem/src/io/ProbeWriter.cpp
+++ b/ProbeWriter.cpp
@@ -7,6 +7,8 @@
#include <sstream>
#include <unordered_map>
#include <unordered_set>
+#include <filesystem>
+namespace fs = std::filesystem;
namespace tndm {
@@ -81,7 +83,17 @@ void ProbeWriter<D>::write(double time, mneme::span<FiniteElementFunction<D>> fu
out_->open(probe.file_name, false);
write_header(probe, functions);
} else {
- out_->open(probe.file_name, true);
+ fs::path pckp(probe.file_name);
+ bool exists = fs::exists(pckp);
+ if (exists) {
+ // open existing file
+ out_->open(probe.file_name, true);
+ } else { // below is needed to support checkpointing
+ // open new file
+ out_->open(probe.file_name, false);
+ // write header
+ write_header(probe, functions);
+ }
} BoundaryProbeWriter.cpp --- a/home/jyun/softwares/project-tandem/tandem/src/io/BoundaryProbeWriter.cpp
+++ b/BoundaryProbeWriter.cpp
@@ -8,6 +8,8 @@
#include <sstream>
#include <unordered_map>
#include <unordered_set>
+#include <filesystem>
+namespace fs = std::filesystem;
namespace tndm {
@@ -83,7 +85,17 @@ void BoundaryProbeWriter<D>::write(double time,
out_->open(probe.file_name, false);
write_header(probe, functions);
} else {
- out_->open(probe.file_name, true);
+ fs::path pckp(probe.file_name);
+ bool exists = fs::exists(pckp);
+ if (exists) {
+ // open existing file
+ out_->open(probe.file_name, true);
+ } else { // below is needed to support checkpointing
+ // open new file
+ out_->open(probe.file_name, false);
+ // write header
+ write_header(probe, functions);
+ }
} ScalarWriter.cpp --- a/home/jyun/softwares/project-tandem/tandem/src/io/ScalarWriter.cpp
+++ b/ScalarWriter.cpp
@@ -2,6 +2,8 @@
#include <iomanip>
#include <ios>
+#include <filesystem>
+namespace fs = std::filesystem;
namespace tndm {
@@ -19,7 +21,17 @@ void ScalarWriter::write(double time, mneme::span<double> scalars) const {
out_->open(file_name_, false);
write_header();
} else {
- out_->open(file_name_, true);
+ fs::path pckp(file_name_);
+ bool exists = fs::exists(pckp);
+ if (exists) {
+ // open existing file
+ out_->open(file_name_, true);
+ } else { // below is needed to support checkpointing
+ // open new file
+ out_->open(file_name_, false);
+ // write header
+ write_header();
+ }
} |
In general we don't want to share code snippets to be merged within PR comments - it's too easy to introduce cut-and-paste mistakes. Please make a branch of dmay/seas-checkpoint and add your changes to it. I believe you should have write access to push that branch to the remote repo. Once you have done that we can proceed. |
I don't have permission to push a branch to this repository. |
This is a good point. This branch now allows the user to specify the storage mode for checkpoints. The default is currently to store only 1 checkpoint on disk. I've updated the PR description at the very top of this page to reflect this. |
Yes, absolutely. Nothing worse than having an error after 4 hrs and not having a single checkpoint file written and thus requiring you re-do 4 hrs of computation. Definitely better to write more often. Not this is less of an issue now that the default mode uses storage_type = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hpc4geo should this PR be merged now or after merging 61? |
move ts checkpointing parameters to toml
I was playing with this pull request and found a behavior which I think is undesired. If I am starting a model from a specific checkpoint and at the same time the PETSc param -ts_dt is set to some value, PETSc will ignore the time step from the checkpoint and rerun the model with the fault params and time of the checkpoint with the dt specified with -ts_dt. This could cause misalignments and even lead to a crash tandem if -ts_dt is not suitable. |
Good spot @yohaimagen ! I will modify the checkpoint code to raise a warning if |
Summary of changes
ts_checkpoint_prefix
andts_checkpoint_frequency
currently have no effect.Summary of new options to control checkpointing
-ts_checkpoint_load <str>
defines the directory to use to load a checkpoint-ts_checkpoint_path <str>
defines the root location where checkpoint files will be written.-ts_checkpoint_freq_step <int>
defines the time step frequency between checkpoints.-ts_checkpoint_freq_cputime <float>
defines the CPU time (minutes) frequency between checkpoints.-ts_checkpoint_freq_physical_time <float>
defines the physical time frequency between checkpoints.-ts_checkpoint_storage_type <none, limited, unlimited>
defines the type of storage for checkpoints.limited
will store a finite number of unique checkpoints on disk.unlimited
stores all checkpoints. Usenone
to completely deactivate checkpointing.-ts_checkpoint_storage_limited_size <int>
defines how many unique checkpoints you want to store on disk. Only has effect with-ts_checkpoint_storage_type limited
.Notes
-ts_checkpoint_freq_xxx
options. They can be combined together and if any are satisfied a checkpoint will be written out.limited
and the number of unique checkpoints stored is 1 (e.g.-ts_checkpoint_storage_limited_size 1
).checkpoint_info.csv
which contains time, dt, step and some other info.-ts_checkpoint_path
a fileload_last_checkpoint.opts
is written. This can be used to restart from the last written checkpoint. For example./app/tandem your_toml_options --petsc your_petsc_options
and then./app/tandem your_toml_options --petsc your_petsc_options -options_file PATH_TO_CHECKPOINT_DATA/load_last_checkpoint.opts
Testing