-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JVM] Support overriding RPCWatchdog termination behavior on Android and other platforms #6216
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
the intended behavior of the watchdog is actually to kill the process when a timeout Happens. The other running process. We will need to configure the app to auto restart, which will restart an RPC session. So the original exit behavior is intended. |
This is used to get around the fact that android does not support fork of a process |
I'm not an expert on JVM in Android, from what I can tell it is recommended against using System.exit in almost cases. It sounds like you are expecting System.exit to kill the TVMRPC app (the way a force stop would) but this does not seem to be the functionality of System.exit, as exit will resume any paused activities left on the activity stack after exiting [ref]. Overall, I am not claiming that the use of finish() instead is necessarily the correct implementation, but empirically the system is much more stable. |
The main question though is whether finish will actually kill the running job on the other thread -- thus achieving the functionality of timeout, or if the stablity comes from the fact that we simply disables the timeout. That would boils down to the semantics of the finish itself, and would be great to confirm |
@csullivan can you run some quick experiments to confirm whether the finish does kill the other running thread ? |
Your concern is valid. What I observed that motivated this PR: After the timeout is reached** the RPCActivity view (with the StopRPC button) will close and return to the MainActivity view where the connection fields are listed. After a short time the RPCActivity will automatically restart and the pending measurements will again decrease. This seems to be the desired behavior. I can run a few explicit experiments to be sure the thread is terminating, however, unfortunately while this helped stability, after about an hour or two I did see a system reset of similar characteristics as before. The resets are much less frequent than before, but persist. I noted two more uses of System.exit() in the RPCTracker APK which may or may not be causing this behavior: ** Approximating here as I counted from when I saw the pending measurements stall |
One quick way to confirm the behavior it is to write the main loop as an infinite loop(that sleeps periodically) and increases the counter, while the watchdog thread calls the finish function. |
@tqchen I added a counter to the RPCProcessor loop:
In the logs I see that after the watchdog wakes up and calls finish, the counter is reset. Seems to give evidence that the thread is exiting as expected.
|
Again though, this doesn't appear to have entirely fixed the issue. Less frequent instabilities, but they still exist with finish() and still seem to have something to do with the timeout behavior. Unfortunately, any instabilities that cause resets or crashes to the lock screen are enough to make long unmonitored autotuning runs untenable. For this reason I’ve moved to cross compiling the C++ RPC app (#6229) and running it from the android shell which is much more stable. After battle testing it I'd like to update our android docs to recommend using it. |
Great, Thanks @csullivan , seems thatw we can go ahead and merge it then. please help to fix the CI error |
the activity stack, finish the RPCActivity and return cleanly to the MainActivity where the RPCActivity can be restarted automatically.
@tmoreau89, @tqchen, if you have time please review the updates. Thanks! |
…and other platforms (apache#6216) * Instead of performing a system exit and leaving unhandled items on the activity stack, finish the RPCActivity and return cleanly to the MainActivity where the RPCActivity can be restarted automatically. * Update doc. string for checkstyle.
…and other platforms (apache#6216) * Instead of performing a system exit and leaving unhandled items on the activity stack, finish the RPCActivity and return cleanly to the MainActivity where the RPCActivity can be restarted automatically. * Update doc. string for checkstyle.
…and other platforms (apache#6216) * Instead of performing a system exit and leaving unhandled items on the activity stack, finish the RPCActivity and return cleanly to the MainActivity where the RPCActivity can be restarted automatically. * Update doc. string for checkstyle.
…and other platforms (apache#6216) * Instead of performing a system exit and leaving unhandled items on the activity stack, finish the RPCActivity and return cleanly to the MainActivity where the RPCActivity can be restarted automatically. * Update doc. string for checkstyle.
…and other platforms (apache#6216) * Instead of performing a system exit and leaving unhandled items on the activity stack, finish the RPCActivity and return cleanly to the MainActivity where the RPCActivity can be restarted automatically. * Update doc. string for checkstyle.
…and other platforms (apache#6216) * Instead of performing a system exit and leaving unhandled items on the activity stack, finish the RPCActivity and return cleanly to the MainActivity where the RPCActivity can be restarted automatically. * Update doc. string for checkstyle.
Calling System.exit(0) from the watchdog (RPCActivity) emits an interrupt request that goes uncaught in some Android environments resulting in a system crash. I believe the correct non-destructive behavior should be to finish the RPCActivity, thereby popping the activity stack and returning control to the MainActivity, from which the RPCActivity can be restarted using the normal auto-reboot sequence.