-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpointing option for GAP-C programs #163
Conversation
…adjust obj.run method to execute and stop periodic archiving
…g works (well), but currently only for primitive answer types (double, int etc.)
…ake code printing more generic
This is amazing work. Congratulations @fymue for coming up with this peace of code in just a couple of days!
|
Done.
We could do that. I thought this might be better too, but I didn't want to touch anything related to
The time the archiving/dumping requires depends on the size of the non-terminal tables. Do you have a rough estimate as to how big these tables realistically can get? In my tests, archiving a table (
I'm not sure exactly. The
Makes sense.
I don't think there is a way to easily check for different user inputs other than to log the input in a file and compare it to the current input. But since you also proposed to adjust the names of the archives to avoid collisions, maybe we can hash the user input somehow and append it to the archive names instead of appending e.g. the PID. That way we could infer if an archive that was created from a different user input is going to be read and warn the user. Thoughts?
Ok, I will adjust the default directory to be the current working directory. Should I still add the option the specify another directory for the archives? |
I would love to have a more speaking "interface" for the user compiled binaries. Having at least speaking options for checkpointing will definitely be appreciated. It should be independent of user instances or options, thus a general change should be fine. Please go ahead! |
that is not easy to say. In principle, we support "only" quadratic tables with respect to input size. Let's say, I haven't seen much inputs larger than 10.000 characters - but it's hard to predict what users will do in the future. Typically, storage << compute, the reason why we do dynamic programming at all. |
let's do a quick test. Set interval to 5 minutes, run a large input. Quickly stop the binary call (STRG+Z) wait for ~3 minutes and observe if files will be produced after ~5minutes or after ~2. |
as a first shot, we should provide an additional file with similar naming scheme, which tells the user which parameters he/she used. Thus, there is a least a manual way to figure out the correct way of re-executing with these checkpoints. Otherwise, nobody will be able to make sense of these files after say a week. |
yes please :-) |
Alright then, I will let the compiler automatically insert the checkpointing code into the
Interestingly, neither of those things happen. If the interval is set to e.g. 10s and the process (which runs e.g. 30s) is stopped after 1s and resumed 15s after that, the tables are immediately dumped after the resumption. I guess since the archiving thread is detached from the main thread but is still a child process of the parent process of the executable, it can't dump the tables while the parent process is paused, but its own clock "continues to tick" and it will dump the tables ASAP (if the interval/its sleep period is over). This is not necessarily what I intended, but also shouldn't cause any problems on a cluster. What do you think @sjanssen2 ? |
…am runs to completion; use cwd as default directory; add backend for possible user-specified directory (still need to move checkpointing options to out for all of this to work)
…ine integration is active by default, can be turned of via compiler option; improved some errors msgs/warnings
…ting/parsing; still need some more stuff for Logging
…arse Logfile to read in archives; add option to directly specify checkpoint input path; checkpointing has to be manually integrated w/ --checkpoint GAP-C option until I figure out the correct answer type inference
@sjanssen2 With the latest commit, I pretty much implemented all of your suggestions/improvement ideas. I will quickly summarize what I've implemented so far:
I didn't change anything regarding the measurement of the sleep duration, since it seems to use Wall time (even though it behaves slightly different than expected, see previous comment).
|
…le datatype instead of just axiom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, great work @fymue ... but with every new feature addition, I crave for more ;-) There are some comment in this review.
- Now I think that user should opt in to activate checkpointing for his/her binary, as it is right now.
- The sleep duration mechanism is OK, leave as is
…pointInterval shortcut to '-p'; add '-ldl' flag to makefile; remove redundant <fileystem> include
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is a huge one and fantastic work! For now, I skipped all files in rtlib and therefore chose "request changes". I will continue tomorrow
…lp msg for default interval; exit gapc when cp integration fails
…tored/fixed is_supported/gen_type_accessor methods to also work with user-defined types containing compound types
The last 6 commits should cover all of your requested changes. I split them up into 6 commits instead of just 1 big commit so it's a little bit easier to distinguish between the different requested changes. I also found a bug in my I also wrote some more checkpointing tests for the @sjanssen2 Regarding the failing tests: To fix this bug I could refactor this and simply create a |
…ific error msg if table contains non-supported external type; refactor help msg so C++11/MacOS is happy; refactor user-specified file prefix parsing to be portable
…int non-supported type names when they are detected
…omparison in parse_checkpoint_log isn't dependent on argument order
Final update from my end:
Once @sjanssen2 reviewed my gapc-test-suite PR for the new Truth files, I will push the updated I also managed to make the checkpointing mechanism work in cyk-style programs, but that will be a separate PR which I will open once this one has been merged. For your relief: that PR will be a lot shorter than this one. Once I've covered all your upcoming requested changes and comments for this PR, I will finally open a fold-grammars PR for the updated That will be everything I still need to add/change. After that, I would consider this feature completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left additional comments regarding changes in rtlib.
I repeat my self, but I am very impressed with your work!
@@ -622,8 +701,26 @@ class Ref { | |||
} | |||
}; | |||
|
|||
template <typename O, typename Refcount> | |||
inline O &operator<<(O &o, const Ref<Refcount>& r) { r.put(o); return o; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow. Great find! Must have been a hard time though.
…es; handle case where tables are enumerated as Type::Usage; refactored some stuff/comments
…exit; refactor cp test func
Because I just noticed that MacOS apparently doesn't support the With this option enabled, the checkpoint archives as well as the Logfile won't be deleted after the program finished its calculations. This makes debugging/testing a lot easier and might also make sense for end users if they e.g. want to store the table contents a program created somewhere and use/process that data later on. |
…some code so its C++11 supported (for MacOS)
@@ -31,7 +31,7 @@ jobs: | |||
run: sudo apt-get install flex bison make libboost-all-dev libgsl-dev python3 python3-pip | |||
|
|||
- name: Checkout truth | |||
run: git clone --branch master https://github.com/jlab/gapc-test-suite.git $GITHUB_WORKSPACE/../gapc-test-suite | |||
run: git clone --branch more_checkpointing_truth_files https://github.com/jlab/gapc-test-suite.git $GITHUB_WORKSPACE/../gapc-test-suite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to change back once merged
|
Sure, I will add that. I did notice though that most of the time not all tables are 100% filled by the end of the calculations, so this might be a little misleading. Maybe we only print the percentage for the table that contains the answer at the end? Like:
, where 50% would be the tabulation status of the answer table.
See comment in conversation regarding this. |
That's the difference to CYK. Here, we don't know in advance which / how many cells have to be filled for the final answer. Thus you are right, better not call it computation done, but maybe X% table filled?! |
Ok, makes sense. Sounds good, I will add the filled % for every table to their info message then. |
GAP-C Checkpointing
Motivation
Certain GAP-C programs/jobs take a very long time to run, some requiring multiple weeks or even months to finish their calculations. For obvious reasons, these jobs are usually scheduled on a server/cluster. But due to the very lengthy execution time of these jobs, they are susceptible to crashing at some point due to some unforeseeable problem related to e.g. the server/cluster. In such a case, a lot of progress and time could be lost.
In order to counteract that, I developed an automated "checkpointing" method, which can be activated via a GAP-C compiler option and allows for periodic "checkpointing" of the program in a user-specified interval. If this option is active during compilation, a slightly modified program will be generated that periodically archives all non-terminal tables to the disk, which can then be read at the next program restart if the program were to crash. This allows the program to more or less "pick up were it left off" and continue calculating from (close to) the point in time where it crashed (i.e. the last "checkpoint") since all table values up until the last checkpoint don't have to be recalculated and can simply be returned, which allows the program to advance much faster through the DP algorithm and in turn saves us a lot of valuable time.
Method
The central idea of this checkpointing method is to periodically archive all non-terminal tables (i.e. write them to the disk) in user-specified intervals, which serve as "checkpoints" of the program since they essentially contain the progress of the DP algorithm. If a checkpointed program were to crash, these archived tables can be read instead of initializing the DP tables empty at the beginning of the restarted program, so the DP algorithm will advance much quicker than normally since all calculated values up to the last checkpoint of the previously (crashed) program are already available. Thus the program essentially continues from the last "checkpoint".
Usage
The checkpointing routine can be integrated with the
--checkpoint
option when invokingGAP-C
(e.g.gapc -p alg_mfe --checkpoint nodangle.gap
). This will include all of the code required for the checkpointing to work into theGAPC
-generatedC++
-files. Once the user compiles this code, these user-compiled binaries will get additional options to specify the checkpointing interval as well as the input/output path for the checkpoints. For example, executing the user-compiled binary (e.g.out
) with the option--checkpointInterval 0:1:30:0
would archive all non-terminal tables periodically every 90 minutes until the program finishes (or crashes). By default, the table archives will be saved as binary files in the current working directory, unless the user specifies a different output path with the--checkpointOutput PATH/PREFIX
option. If the user appends PREFIX to the output path, every generated file will start with PREFIX (note that the output path must end with a / (\ on Windows) if no PREFIX is desired; whatever string comes after the last / or (\) will be parsed as PREFIX). In order to start a program from an existing checkpoint, the user can provide the path to a Logfile that is generated along with the table archives when a checkpointed program reaches its first checkpoint and contains information regarding the user input for that program call as well as the paths of the checkpoints that program generated with the--checkpointInput LOGFILE
option. If for some reason this Logfile isn't available anymore, the user can also simply specify the path to a text file containing the absolute paths to the checkpoint archives he/she wishes to load.Implementation
To easily insert the checkpointing code at the right places during compilation, I wrote a new
Checkpoint
class (defined insrc/checkpoint.hh
). This class contains various generic methods handling the reading of the archived tables, the archiving of the tables, the initiation of the thread which handles the periodic archiving and the insertion of the required include statements along with other helper functions, respectively.