-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot on signal #2253
Snapshot on signal #2253
Conversation
Looks cool, but I don't think you can safely lock a mutex in a signal handler, nor try to perform I/O. |
Thanks @flx42, I should have researched signals a bit more before I implemented this. |
@flx42 do you see any problem with this new implementation? |
The type should be "sig_atomic_t volatile". After that, I think it will be fine :) |
Thanks for the reply.
Hrmm.. all the examples I see, the qualifier precedes the type name, same as with "const" or "mutable". Please let me know if I'm missing something.
true and false are promoted to 1 and 0 per the language spec I'm pretty sure. Since they better convey the meaning, I'm inclined to leave the true and false in there. |
On Sat, Apr 18, 2015 at 12:48 PM, jyegerlehner notifications@github.com
No, you're right. That's what I meant.
It was really a nitpicking comment about style. But both ways will be fine. |
00fabeb
to
c0df909
Compare
Another nitpick: you should avoid unnecessary empty lines changes in your patch. |
Any update on this? I want it. |
Hi Luke, I'm not aware of anything deficient about it. |
Well there is one thing about this that is debatable. Solver doesn't check to see if it should exit when it is doing test, only when it is training. This means if you send it SIGINT and it happens to be in the middle of test (not train), caffe keeps going until the test is finished, and only then does it exit (or snapshot). So you might have to wait a bit if you try to request it to stop when it's testing. I thought that was a good thing since it allows the test to run to completion and produce a valid test result. However, one might prefer that it be more responsive and stop right away regardless of whether it is testing, and throw away the test that is in-progress. We could make it behave that way with a bit of extra complexity. |
That sounds better to me. I would expect Ctrl+C to kill the process within a second or two. If you wait for testing to finish, you might have to wait several minutes. |
If anyone objects to Luke's preferred behaviour please speak up. Otherwise I'll plan to make that change. |
I can see an argument for either finishing the testing or quitting Thanks for the signal handling!
|
It would be nice to have both, if that's not too much complexity for you. |
Behaviour modified to respond to signals during testing or training, whereas before it just responded during training. And rebased off master. I tested these latest changes on a couple scenarios manually. But some of this code is so new if anyone else can test it that could give more confidence. |
32c4a7b
to
17baa33
Compare
Makes sense to me.
Anything for the cause, comrade! |
If we find we need or want that extra sophistication, I think my preference is to add it in a separate PR so that we may thereby proceed more incrementally. The changes to Solver started out very simple, and got more complex with the latest change and feels to me like pushing the limits of added risk in one PR. That said, if everyone really really wants it right now, I can do it. |
No, I agree, this should be done in a separate patch. |
I think this could use a "Ready for Review" label, if those make any difference. |
17baa33
to
c946a00
Compare
@@ -126,6 +133,20 @@ void CopyLayers(caffe::Solver<float>* solver, const std::string& model_list) { | |||
} | |||
} | |||
|
|||
caffe::SolverParameter_Action GetRequestedAction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line break not needed for 80 cols? Remove if not needed.
LGTM. I shall review this PR within next week. |
OK thanks. I will squash the commits once the review is done and we know there aren't any more changes required. |
@jyegerlehner I just did a quick review today, and tested on my machine. Looks good to me! I'll finish reviewing this PR within tomorrow. (As far as I know, this PR won't apply to windows. But since we are not officially supporting windows at the moment, I'm not too worried about that.) One another potential enhancement related to this PR is that we may also support learning rate adjustment on signaling in the future (in a separate PR), so that one may adjust it during training based on e.g. learning curve from log, similar to some other deep learning tools. @jeffdonahue @longjon what do you think? |
SolverParameter_Action SIGHUP_action); | ||
ActionCallback GetActionFunction(); | ||
private: | ||
SignalHandler(); // Not implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sort of confused here... Why do we need a private constructor declaration in SignalHandler class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the default constructor is not declared private, the compiler will provide it implicitly. So by preventing default construction we force the use of the constructor taking explicitly-specified signal actions. We could provide a default constructor that, say, initializes the actions to sensible defaults. But it would be unused as it is and thus unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the default constructor is not declared private, the compiler will provide it implicitly.
C++ does generate a default constructor but only if you don't provide one of your own. Since you've already provided one: SignalHandler(SolverParameter_Action SIGINT_action, SolverParameter_Action SIGHUP_action);
, a default constructor won't be generated by compiler implicitly. So a private constructor declaration here is unnecessary.
You may verify with this simple program:
class C {
public:
C(int) {}
};
int main() { C c; } // compilation fails because there's no default constructor in class C.
Anyway, this is only a small issue. The PR seems great to me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks for the correction and explanation. I will remove the private ctor dec.
@willyd How this will impact with your windows plans? |
@bhack On windows we would need to call SetConsoleCtrlHandler to handle SIGINT but don't think there is an equivalent to SIGHUP. A cross-plaform implementation is available in boost.asio. |
If there is still interest in #2537 I will avoid asio solution. |
if (sigaction(SIGINT, &sa, NULL) == -1) { | ||
LOG(FATAL) << "Cannot install SIGINT handler."; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "unhook" these signals in SignalHandler::~SignalHandler()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK will do.
Completed a thorough pass today. This PR seems in good shape to me, handles signals via POSIX sigaction and address actions from solver's train & test loop in a call-back fashion. A side effect: this PR is platform-specific and may impact community windows ports. |
@ronghanghu The latest commits are intended to resolve the review issues you raised. Please let us know if they are not satisfactory. As far as the Travis build failing: this looks like it happened due to an error installing a package. Does anyone know: should I make a dummy commit to provoke it to try building again? Or am I missing something that's an actual problem I need to fix?. |
I restarted Travis CI and now all tests pass. I'll try to take a look today. |
Seems ready to me :) Please squash into one commit. |
@jeffdonahue @longjon I would like to merge this PR if you don't have other concerns. Community windows ports can perhaps simply strip this feature with |
OK, do you prefer commits to be squashed? |
Yes, please squash into one commit, so that I can merge in this weekend. |
Add signal handler and early exit/snapshot to Solver. Add signal handler and early exit/snapshot to Solver. Also check for exit and snapshot when testing. Skip running test after early exit. Fix more lint. Rebase on master. Finish rebase on master. Fixups per review comments. Redress review comments. Lint. Correct error message wording.
11f501a
to
ff19d5f
Compare
this is great. thanks for the effort @jyegerlehner |
Sure thing @erogol. Glad to hear it's helpful. |
Can you add an option to the solver so that users can take snap-shots at any given time by pressing a key combination like Ctrl-S for example? |
@Coderx7 I think that'd add to much complexity to the feature. If you want to snapshot at an arbitrary time without stopping training, just send SIGHUP ( |
@ajtulloch yes, but at the same time it provides a very convenient and good feature to have, |
This an implementation of the feature discussed in Issue 2012.
When you hit Ctrl-C to kill caffe (while training), it will now save a snapshot before exiting. Actually, the Solver just stops training, and a snapshot is only saved if
snapshot_after_train
is true.This is the default behavior which is configurable via the sigint_effect and sighup_effect command line options. Also by default, SIGHUP signal causes caffe to save a snapshot and continue training. So you can make caffe save a snapshot by sending it SIGHUP signal, e.g.:
kill -SIGHUP PID
where PID is the process id of caffe, which you can find by doing
ps -ef | grep caffe
.The design has two pieces to it. 1.
Solver
is modified slightly so that after each iteration it checks to see if its client wants it to either snapshot or exit. It does this via a callback function that a client can set on the Solver instance. If the callback hasn't been set, it just carries on as usual. So there is no breaking change to the Solver interface and the behavior of existing code shouldn't change. 2. The caffe executable provides the callback function to the Solver, and the callback is implemented on a SignalHandler that intercepts SIGINT and SIGHUP.