Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing option for GAP-C programs #163

Merged
merged 84 commits into from
Mar 3, 2023
Merged

Checkpointing option for GAP-C programs #163

merged 84 commits into from
Mar 3, 2023

Conversation

fymue
Copy link
Collaborator

@fymue fymue commented Dec 18, 2022

GAP-C Checkpointing

Motivation

Certain GAP-C programs/jobs take a very long time to run, some requiring multiple weeks or even months to finish their calculations. For obvious reasons, these jobs are usually scheduled on a server/cluster. But due to the very lengthy execution time of these jobs, they are susceptible to crashing at some point due to some unforeseeable problem related to e.g. the server/cluster. In such a case, a lot of progress and time could be lost.

In order to counteract that, I developed an automated "checkpointing" method, which can be activated via a GAP-C compiler option and allows for periodic "checkpointing" of the program in a user-specified interval. If this option is active during compilation, a slightly modified program will be generated that periodically archives all non-terminal tables to the disk, which can then be read at the next program restart if the program were to crash. This allows the program to more or less "pick up were it left off" and continue calculating from (close to) the point in time where it crashed (i.e. the last "checkpoint") since all table values up until the last checkpoint don't have to be recalculated and can simply be returned, which allows the program to advance much faster through the DP algorithm and in turn saves us a lot of valuable time.

Method

The central idea of this checkpointing method is to periodically archive all non-terminal tables (i.e. write them to the disk) in user-specified intervals, which serve as "checkpoints" of the program since they essentially contain the progress of the DP algorithm. If a checkpointed program were to crash, these archived tables can be read instead of initializing the DP tables empty at the beginning of the restarted program, so the DP algorithm will advance much quicker than normally since all calculated values up to the last checkpoint of the previously (crashed) program are already available. Thus the program essentially continues from the last "checkpoint".

Usage

The checkpointing routine can be integrated with the --checkpoint option when invoking GAP-C (e.g. gapc -p alg_mfe --checkpoint nodangle.gap). This will include all of the code required for the checkpointing to work into the GAPC-generated C++-files. Once the user compiles this code, these user-compiled binaries will get additional options to specify the checkpointing interval as well as the input/output path for the checkpoints. For example, executing the user-compiled binary (e.g. out) with the option --checkpointInterval 0:1:30:0 would archive all non-terminal tables periodically every 90 minutes until the program finishes (or crashes). By default, the table archives will be saved as binary files in the current working directory, unless the user specifies a different output path with the --checkpointOutput PATH/PREFIX option. If the user appends PREFIX to the output path, every generated file will start with PREFIX (note that the output path must end with a / (\ on Windows) if no PREFIX is desired; whatever string comes after the last / or (\) will be parsed as PREFIX). In order to start a program from an existing checkpoint, the user can provide the path to a Logfile that is generated along with the table archives when a checkpointed program reaches its first checkpoint and contains information regarding the user input for that program call as well as the paths of the checkpoints that program generated with the --checkpointInput LOGFILE option. If for some reason this Logfile isn't available anymore, the user can also simply specify the path to a text file containing the absolute paths to the checkpoint archives he/she wishes to load.

Implementation

To easily insert the checkpointing code at the right places during compilation, I wrote a new Checkpoint class (defined in src/checkpoint.hh). This class contains various generic methods handling the reading of the archived tables, the archiving of the tables, the initiation of the thread which handles the periodic archiving and the insertion of the required include statements along with other helper functions, respectively.

@fymue fymue added the enhancement New feature or request label Dec 18, 2022
@fymue fymue self-assigned this Dec 18, 2022
@fymue fymue changed the title Checkpointing ooption for GAP-C programs Checkpointing option for GAP-C programs Dec 18, 2022
@fymue fymue requested a review from sjanssen2 December 18, 2022 13:44
@sjanssen2
Copy link
Member

sjanssen2 commented Dec 19, 2022

This is amazing work. Congratulations @fymue for coming up with this peace of code in just a couple of days!
I am going through it now and will use this comment for taking notes:

  • gapc comes with a logging mechanism, we might want to "log" that checkpointing routines have been integrated, as in e.g.
    Log::instance()->warning(
    ?
  • wouldn't it make more sense for the user to define the checkpointing interval when executing the compile product, not the compiler? We could modify the https://github.com/jlab/gapc/blob/master/rtlib/generic_main.cc for that.
  • how much overhead is the checkpointing at all? Should we maybe activate by default and say it should dump a checkpoint every 8h or so? Thus, inexperienced user would profit automatically, or - if you don't foresee a longer runtime - you also have this advantage
  • am I right that the sleep duration depends on wall time, not on CPU time? Which can be very different in our cluster environment, if a process is hold for a longer duration. I think wall time is what we want here
  • the dump filenames should at least contain the "process ID" if not also the machine name for situations in which a user runs multiple instances of the program at the same time. Think of @kmaibach's ~100 pareto jobs. Currently, checkpoint files would overwrite each other.
  • what if the user calls his/her program like ./out -T 30 UACGUACG, the binary creates a checkpoint and the user than executes the program with different parameters, like ./out -T 99 AAAA. Can the checkpointing mechanism warn the user about this situation or would it silently return wrong results?
  • I think it would be ideal, if checkpoint files would be removes after successful execution, the start in a clear state for the next execution
  • I also want to add the option to save the table archives in a user-specified directory, which is probably more convenient than just saving them in the program directory. Default should be users current working directory, not the directory of the binary, as this might be in a system wide installed directory, to which the user has no write permissions (in productive scenarios)

src/checkpoint.hh Outdated Show resolved Hide resolved
src/checkpoint.hh Outdated Show resolved Hide resolved
src/checkpoint.hh Outdated Show resolved Hide resolved
src/gapc.cc Outdated Show resolved Hide resolved
src/gapc.cc Outdated Show resolved Hide resolved
@fymue
Copy link
Collaborator Author

fymue commented Dec 19, 2022

@sjanssen2

* [ ]  gapc comes with a logging mechanism, we might want to "log" that checkpointing routines have been integrated, as in e.g. https://github.com/jlab/gapc/blob/f333305e013be9280e7cdfd80adc7fcf24b2015e/src/gapc.cc#L587

Done.

* [ ]  wouldn't it make more sense for the user to define the checkpointing interval when executing the compile product, not the compiler? We could modify the https://github.com/jlab/gapc/blob/master/rtlib/generic_main.cc for that.

We could do that. I thought this might be better too, but I didn't want to touch anything related to out_main.cc yet, because I was unsure if you would be OK with me modifying that part of the compilation since it is basically the same for every GAP-C compilation.

* [ ]  how much overhead is the checkpointing at all? Should we maybe activate by default and say it should dump a checkpoint every 8h or so? Thus, inexperienced user would profit automatically, or - if you don't foresee a longer runtime - you also have this advantage

The time the archiving/dumping requires depends on the size of the non-terminal tables. Do you have a rough estimate as to how big these tables realistically can get? In my tests, archiving a table (std::vector<double>) with 100 million elements took about 900ms. Archiving the "real" tables of a GAP-C program only increased the total execution time of the programs I tested by a few hundred milliseconds. These programs ran for about 35 seconds regularly and the checkpointing interval was set to 5s, so I would say the checkpointing overhead adds at most about 5% to the total execution time, probably way less. So as long as the tables aren't gigantic in size, we could just turn it on by default.

* [ ]  am I right that the sleep duration depends on wall time, not on CPU time? Which can be very different in our cluster environment, if a process is hold for a longer duration. I think wall time is what we want here

I'm not sure exactly. The C++ standard doesn't define this clearly, as it is up to the implementation/compiler to use a steady clock or the system clock (cf. Cppreference). How exactly the sleep duration is handled if the process is held I don't know. Do you think I should make the thread explicitly measure the duration with the system clock and sleep for that period instead?

* [ ]  the dump filenames should at least contain the "process ID" if not also the machine name for situations in which a user runs multiple instances of the program at the same time. Think of @kmaibach's ~100 pareto jobs. Currently, checkpoint files would overwrite each other.
* [ ]  I think it would be ideal, if checkpoint files would be removes after successful execution, the start in a clear state for the next execution

Makes sense.

* [ ]  what if the user calls his/her program like `./out -T 30 UACGUACG`, the binary creates a checkpoint and the user than executes the program with different parameters, like `./out -T 99 AAAA`. Can the checkpointing mechanism warn the user about this situation or would it silently return wrong results?

I don't think there is a way to easily check for different user inputs other than to log the input in a file and compare it to the current input. But since you also proposed to adjust the names of the archives to avoid collisions, maybe we can hash the user input somehow and append it to the archive names instead of appending e.g. the PID. That way we could infer if an archive that was created from a different user input is going to be read and warn the user. Thoughts?

* [ ]  `I also want to add the option to save the table archives in a user-specified directory, which is probably more convenient than just saving them in the program directory.` Default should be users current working directory, not the directory of the binary, as this might be in a system wide installed directory, to which the user has no write permissions (in productive scenarios)

Ok, I will adjust the default directory to be the current working directory. Should I still add the option the specify another directory for the archives?

@sjanssen2
Copy link
Member

We could do that. I thought this might be better too, but I didn't want to touch anything related to out_main.cc yet, because I was unsure if you would be OK with me modifying that part of the compilation since it is basically the same for every GAP-C compilation.

I would love to have a more speaking "interface" for the user compiled binaries. Having at least speaking options for checkpointing will definitely be appreciated. It should be independent of user instances or options, thus a general change should be fine. Please go ahead!

@sjanssen2
Copy link
Member

The time the archiving/dumping requires depends on the size of the non-terminal tables. Do you have a rough estimate as to how big these tables realistically can get? In my tests, archiving a table (std::vector<double>) with 100 million elements took about 900ms. Archiving the "real" tables of a GAP-C program only increased the total execution time of the programs I tested by a few hundred milliseconds. These programs ran for about 35 seconds regularly and the checkpointing interval was set to 5s, so I would say the checkpointing overhead adds at most about 5% to the total execution time, probably way less. So as long as the tables aren't gigantic in size, we could just turn it on by default.

that is not easy to say. In principle, we support "only" quadratic tables with respect to input size. Let's say, I haven't seen much inputs larger than 10.000 characters - but it's hard to predict what users will do in the future. Typically, storage << compute, the reason why we do dynamic programming at all.

@sjanssen2
Copy link
Member

I'm not sure exactly. The C++ standard doesn't define this clearly, as it is up to the implementation/compiler to use a steady clock or the system clock (cf. Cppreference). How exactly the sleep duration is handled if the process is held I don't know. Do you think I should make the thread explicitly measure the duration with the system clock and sleep for that period instead?

let's do a quick test. Set interval to 5 minutes, run a large input. Quickly stop the binary call (STRG+Z) wait for ~3 minutes and observe if files will be produced after ~5minutes or after ~2.

@sjanssen2
Copy link
Member

I don't think there is a way to easily check for different user inputs other than to log the input in a file and compare it to the current input. But since you also proposed to adjust the names of the archives to avoid collisions, maybe we can hash the user input somehow and append it to the archive names instead of appending e.g. the PID. That way we could infer if an archive that was created from a different user input is going to be read and warn the user. Thoughts?

as a first shot, we should provide an additional file with similar naming scheme, which tells the user which parameters he/she used. Thus, there is a least a manual way to figure out the correct way of re-executing with these checkpoints. Otherwise, nobody will be able to make sense of these files after say a week.

@sjanssen2
Copy link
Member

Should I still add the option the specify another directory for the archives?

yes please :-)

@fymue
Copy link
Collaborator Author

fymue commented Dec 19, 2022

Alright then, I will let the compiler automatically insert the checkpointing code into the out files and adjust the out_main.cc generation to provide the --checkpoint flag so it can be turned on from there. I will also provide the option to specify a different directory other than the current working directory.

let's do a quick test. Set interval to 5 minutes, run a large input. Quickly stop the binary call (STRG+Z) wait for ~3 minutes and observe if files will be produced after ~5minutes or after ~2.

Interestingly, neither of those things happen. If the interval is set to e.g. 10s and the process (which runs e.g. 30s) is stopped after 1s and resumed 15s after that, the tables are immediately dumped after the resumption. I guess since the archiving thread is detached from the main thread but is still a child process of the parent process of the executable, it can't dump the tables while the parent process is paused, but its own clock "continues to tick" and it will dump the tables ASAP (if the interval/its sleep period is over). This is not necessarily what I intended, but also shouldn't cause any problems on a cluster. What do you think @sjanssen2 ?

…am runs to completion; use cwd as default directory; add backend for possible user-specified directory (still need to move checkpointing options to out for all of this to work)
…ine integration is active by default, can be turned of via compiler option; improved some errors msgs/warnings
…ting/parsing; still need some more stuff for Logging
…arse Logfile to read in archives; add option to directly specify checkpoint input path; checkpointing has to be manually integrated w/ --checkpoint GAP-C option until I figure out the correct answer type inference
@fymue
Copy link
Collaborator Author

fymue commented Dec 22, 2022

@sjanssen2 With the latest commit, I pretty much implemented all of your suggestions/improvement ideas. I will quickly summarize what I've implemented so far:

  • archive names now look like binaryname_pid_tablename(e.g. out_123456_iloop_table) for easier identification
  • a Logfile containing the command line input as well as the paths of all archived tables will be created
  • all table archives (as well as the Logfile) will be deleted after a program runs to completion
  • the checkpointing routine still has to be activated with the --checkpoint GAP-C flag (for now)
  • added checkpointing options to set the output/input path for all checkpoints as well as the interval for the user-compiled binaries (cf. changes to generic_opts.hh)
  • table archives can automatically be read from an existing Logfile (which still exists after a program crash). This file can obviously also be read by the user, who then manually specifies the checkpoint input path. This parsing mechanism also checks if the command line inputs of the run that created the archives and the current run match. If not, the tables archives will not be used.
  • correct answer/table datatype inference
  • automatic serialization of non-primitive types

I didn't change anything regarding the measurement of the sleep duration, since it seems to use Wall time (even though it behaves slightly different than expected, see previous comment).

  • I also just noticed that whatever C++-Library MacOS uses apparently doesn't support std::filesystem yet. That's annoying , but I guess I can probably refactor to using boost::filesystem instead.

Copy link
Member

@sjanssen2 sjanssen2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, great work @fymue ... but with every new feature addition, I crave for more ;-) There are some comment in this review.

  • Now I think that user should opt in to activate checkpointing for his/her binary, as it is right now.
  • The sleep duration mechanism is OK, leave as is

rtlib/generic_opts.hh Outdated Show resolved Hide resolved
rtlib/generic_opts.hh Show resolved Hide resolved
rtlib/generic_opts.hh Show resolved Hide resolved
rtlib/generic_opts.hh Outdated Show resolved Hide resolved
rtlib/generic_opts.hh Outdated Show resolved Hide resolved
src/checkpoint.hh Outdated Show resolved Hide resolved
rtlib/generic_opts.hh Outdated Show resolved Hide resolved
rtlib/generic_opts.hh Outdated Show resolved Hide resolved
src/checkpoint.hh Outdated Show resolved Hide resolved
src/checkpoint.hh Outdated Show resolved Hide resolved
…pointInterval shortcut to '-p'; add '-ldl' flag to makefile; remove redundant <fileystem> include
Copy link
Member

@sjanssen2 sjanssen2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is a huge one and fantastic work! For now, I skipped all files in rtlib and therefore chose "request changes". I will continue tomorrow

testdata/regresstest/run.sh Outdated Show resolved Hide resolved
testdata/regresstest/config Outdated Show resolved Hide resolved
testdata/regresstest/config Show resolved Hide resolved
src/gapc.cc Outdated Show resolved Hide resolved
src/gapc.cc Outdated Show resolved Hide resolved
src/checkpoint.hh Outdated Show resolved Hide resolved
src/checkpoint.hh Show resolved Hide resolved
src/checkpoint.hh Show resolved Hide resolved
src/checkpoint.hh Show resolved Hide resolved
src/checkpoint.hh Show resolved Hide resolved
@fymue
Copy link
Collaborator Author

fymue commented Feb 21, 2023

The last 6 commits should cover all of your requested changes. I split them up into 6 commits instead of just 1 big commit so it's a little bit easier to distinguish between the different requested changes.

I also found a bug in my Shape de)serialization methods and managed to fix it. So the two remaining pKiss programs (shapes and probs) that didn't work before now can be checkpointed as well. I will have to open a fold-grammars PR for this a some point.

I also wrote some more checkpointing tests for the String, Rope and Shape datatypes, for which I will open a gapc-test-suite PR next so new the Truth files are online, after which I will push the updated regresstest/config file.

@sjanssen2 Regarding the failing tests:
Since you asked me to store the default checkpointing interval as a variable so it's the same everywhere, I figured I'm just going to store it as a macro and concatentate that macro with the checkpoint help message so it always displays the value of that marco.
However, this method seems to cause some g++ version conflicts: C++11 want a space between a string literal and a macro for concatentation, C++17 wants no space. It looks like MacOS still runs the older version and breaks because of this. Shouldn't MacOS and Linux run the same C++ version?

To fix this bug I could refactor this and simply create a std::string containing the help msg and use that string as the argument for the command line option.

…ific error msg if table contains non-supported external type; refactor help msg so C++11/MacOS is happy; refactor user-specified file prefix parsing to be portable
…int non-supported type names when they are detected
@fymue
Copy link
Collaborator Author

fymue commented Feb 25, 2023

Final update from my end:

  • made checkpointing mechanism compatible with classified/interleaved products
  • refactored default checkpointing interval to be a variable (macro) so whatever gets displayed in the GAPC help message is the default interval everywhere
  • specifying a custom prefix for the generated files now overwrites the default prefix of binaryName_PID_
  • adjusted check for matching user input when loading checkpoint to not care about the order of the individual command line arguments
  • fixed the test function to only delete tables/Logfiles of successful tests/of the same process

Once @sjanssen2 reviewed my gapc-test-suite PR for the new Truth files, I will push the updated config file with the new tests I mentioned in my last comment.

I also managed to make the checkpointing mechanism work in cyk-style programs, but that will be a separate PR which I will open once this one has been merged. For your relief: that PR will be a lot shorter than this one.

Once I've covered all your upcoming requested changes and comments for this PR, I will finally open a fold-grammars PR for the updated rnaoptions.hh file as well as the serialize methods for all supported external types (also going to be a pretty short PR).

That will be everything I still need to add/change. After that, I would consider this feature completed.

Copy link
Member

@sjanssen2 sjanssen2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left additional comments regarding changes in rtlib.
I repeat my self, but I am very impressed with your work!

src/checkpoint.hh Show resolved Hide resolved
rtlib/generic_opts.hh Show resolved Hide resolved
rtlib/generic_opts.hh Show resolved Hide resolved
rtlib/generic_opts.hh Show resolved Hide resolved
rtlib/hashtng.hh Outdated Show resolved Hide resolved
rtlib/rope.hh Show resolved Hide resolved
rtlib/rope.hh Show resolved Hide resolved
@@ -622,8 +701,26 @@ class Ref {
}
};

template <typename O, typename Refcount>
inline O &operator<<(O &o, const Ref<Refcount>& r) { r.put(o); return o; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow. Great find! Must have been a hard time though.

rtlib/shape.hh Outdated Show resolved Hide resolved
rtlib/string.hh Outdated Show resolved Hide resolved
@fymue
Copy link
Collaborator Author

fymue commented Feb 27, 2023

Because I just noticed that MacOS apparently doesn't support the timeout command I use in my test function (and I also didn't really like the mechanism anyways), I just added one more minor feature, which is the additional --keepArchives or -K option for user binaries.

With this option enabled, the checkpoint archives as well as the Logfile won't be deleted after the program finished its calculations. This makes debugging/testing a lot easier and might also make sense for end users if they e.g. want to store the table contents a program created somewhere and use/process that data later on.

@@ -31,7 +31,7 @@ jobs:
run: sudo apt-get install flex bison make libboost-all-dev libgsl-dev python3 python3-pip

- name: Checkout truth
run: git clone --branch master https://github.com/jlab/gapc-test-suite.git $GITHUB_WORKSPACE/../gapc-test-suite
run: git clone --branch more_checkpointing_truth_files https://github.com/jlab/gapc-test-suite.git $GITHUB_WORKSPACE/../gapc-test-suite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to change back once merged

@sjanssen2
Copy link
Member

sjanssen2 commented Mar 3, 2023

  • I love the progress report, can we print the percentage also into the messages, something like Info: Archived "formula_table" table into "/Daten/Git/jlab/gapc/out_536314_formula_table". 25% of computation done.
  • one last cosmetically issue regarding SEQ and REP size_of refactoring?!
  • I also like the --keepArchives function as it might allow crazy new backtracing functions :-)

@fymue
Copy link
Collaborator Author

fymue commented Mar 3, 2023

I love the progress report, can we print the percentage also into the messages, something like Info: Archived "formula_table" table into "/Daten/Git/jlab/gapc/out_536314_formula_table". 25% of computation done.

Sure, I will add that. I did notice though that most of the time not all tables are 100% filled by the end of the calculations, so this might be a little misleading. Maybe we only print the percentage for the table that contains the answer at the end? Like:

Info: Archived "iloop_table" table into "./out_1234_iloop_table"
Info: Archived "struct_table" table into "./out_1234_struct_table"
Info: 50% of computation done

, where 50% would be the tabulation status of the answer table.

one last cosmetically issue regarding SEQ and REP size_of refactoring?!

See comment in conversation regarding this.

@sjanssen2
Copy link
Member

I love the progress report, can we print the percentage also into the messages, something like Info: Archived "formula_table" table into "/Daten/Git/jlab/gapc/out_536314_formula_table". 25% of computation done.

Sure, I will add that. I did notice though that most of the time not all tables are 100% filled by the end of the calculations, so this might be a little misleading. Maybe we only print the percentage for the table that contains the answer at the end? Like:

Info: Archived "iloop_table" table into "./out_1234_iloop_table"
Info: Archived "struct_table" table into "./out_1234_struct_table"
Info: 50% of computation done

, where 50% would be the tabulation status of the answer table.

one last cosmetically issue regarding SEQ and REP size_of refactoring?!

See comment in conversation regarding this.

That's the difference to CYK. Here, we don't know in advance which / how many cells have to be filled for the final answer. Thus you are right, better not call it computation done, but maybe X% table filled?!

@fymue
Copy link
Collaborator Author

fymue commented Mar 3, 2023

That's the difference to CYK. Here, we don't know in advance which / how many cells have to be filled for the final answer. Thus you are right, better not call it computation done, but maybe X% table filled?!

Ok, makes sense. Sounds good, I will add the filled % for every table to their info message then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants