-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid uncatchable exception in bagthread #2139
Conversation
is there any way we can reproduce this issue with simple procedure? i'd like to see the stack trace on this. |
Hi, I can't think of any simple way of reproducing this right off the top of my head (the error is seen at the end of a 10-hour SubT virtual simulation, which is a pretty complex project involving Ignition Gazebo). This problem is spurious. I'll give it a try and replay the rosbag created when the exception was thrown, and record it again. But it might really be problematic. Nevertheless, do you think the suggested change could break something, or that it would not fix the problem we see? I'm pretty sure the exception is thrown after the |
tools/rosbag/src/recorder.cpp
Outdated
@@ -403,8 +403,16 @@ void Recorder::startWriting() { | |||
|
|||
void Recorder::stopWriting() { | |||
ROS_INFO("Closing '%s'.", target_filename_.c_str()); | |||
bag_.close(); | |||
rename(write_filename_.c_str(), target_filename_.c_str()); | |||
try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
introducing try&catch seems to be reasonable.
but I'd do check the callstack where this Recorder::stopWriting
is called, and then add try&catch statement on the caller of Recorder::stopWriting
to minimize the influence to other stack.
if i am not mitaken, this changes behavior of Recorder::checkDuration
and Recorder::checkSize
. with this fix, they will move on process with error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it seems like it would be better to handle this exception somewhere else where we can properly stop the process instead of letting the recorder continue. I presume it doesn't make sense to keep recording if there is an IO exception.
I've tried an experiment with dm-flakey creating a faulty block device which starts failing after the recording starts. I couldn't get a full stack trace with debug symbols, even though This is the non-debug stack trace:
And core file of a test app (maybe you'll be more luck with debug symbols). I'm using binary melodic install, 1.14.10-1bionic.20201017.082648. Anyways, it's apparent that segfault isn't the best reaction to faulty block devices. And as all this code is run inside a thread that is hidden to the outer world, the users have no opportunity of catching the exceptions. And this is the test program I use: #include <thread>
#include <memory>
#include <ros/ros.h>
#include <rosbag/recorder.h>
int main(int argc, char** argv)
{
ros::init(argc, argv, "bagtest");
std::unique_ptr<std::thread> bagThread = nullptr;
std::unique_ptr<rosbag::Recorder> rosRecorder;
// Setup a ros bag recorder.
rosbag::RecorderOptions recorderOptions;
recorderOptions.append_date = false;
recorderOptions.prefix = "/mnt/disk/bagtest";
recorderOptions.regex = true;
recorderOptions.topics.push_back("/subt/.*");
// Spawn thread for recording /subt/ data to rosbag.
rosRecorder.reset(new rosbag::Recorder(recorderOptions));
bagThread.reset(new std::thread([&]()
{
rosRecorder->run();
}));
ros::WallDuration(3).sleep();
ROS_INFO("shutdown");
ros::shutdown();
std::cout << "join" << std::endl;
bagThread->join();
std::cout << "end" << std::endl;
} |
The dm-crypt command I used to create the flaky IO device:
Where sde1 is an unmounted physical flash disk I don't much care about. The flaky device will work flawlessly for 10 seconds and then it will yield random errors for the following 10 seconds. Then mount it with
After a few failures, Linux remounts the disk readonly. To re-enable it,
|
@peci1 appreciate for sharing the detailed debug information! how about adding the try&catch on here then? ros_comm/tools/rosbag/src/recorder.cpp Line 573 in b6c23e1
|
You're right that adding the try/catch to One off-topic question, but I'll ask it here before opening a bug report. I've noticed that the recorder spins an AsyncSpinner. When I tried to call normal |
I am inclined to add try&catch where needs to be, but not else where. ros_comm/tools/rosbag/src/recorder.cpp Line 573 in b6c23e1
i was thinking here is just to finalize the recorder process, so it would not be a problem to catch exception, print error message and set error code then return. and i also think it would be nice to minimize the influence to other places for the maintenance. i think exception is meant to be generated for unexpected cases, so that application will get the exception. on the other hand, generating exception in child thread cannot be caught by main thread unless transfer the exception to main thread with |
@fujitatomoya it's already been some time and nobody else expressed any opinion... would please ping some relevant maintainers? |
@sloretz @mjcarroll @jacobperron friendly ping. |
Can you clarify why you want to catch the exception from the thread? Is it to try and make the recorder process exit more cleanly? IIUC, the recorder can't recover from this kind of exception and should produce any error (exception or otherwise). I'm okay with catching the exception and setting an appropriate exit code, I'm just trying to understand the full picture. |
The full picture is easy. Once an exception is raised in the recording thread, there is no way for the caller to catch it (because the caller interacts with the main thread and has no knowledge of the recording thread). When an uncaught exception shots down a thread, it takes down the whole program: https://stackoverflow.com/q/7272093/1076564 . This is definitely not the right way to react to IO errors (the caller would probably just want to be notified of this, not shot down, too). That's why I suggest that there cannot be any unhandled exceptions in the recording thread. |
Understood, thanks. I tend to agree with your suggestion of wrapping |
Okay, I changed the PR to catch all exceptions in the outernmost scope of the thread function. This should avoid any undesired side-effects while providing 100% safety for the calling code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New iteration LGTM
@peci1 If you don't mind, could you retarget the change to the I can backport to Melodic after that. Thanks for iterating! |
@jacobperron I don't mind retargetting to noetic-devel, but I wanted this to land in melodic ASAP :) Is there a difference between merging this PR and a second one for noetic-devel? I'll prepare it right now. |
See #2151 for noetic version. |
Our typical process is to target changes to the latest distro, allow some time to see if any regressions arise, and then backport to previous distros. |
Okay. The noetic PR is there. |
This change is included in #2152 |
Thanks! |
We saw these exceptions in SubT virtual challenge where a bag file recorder is being stopped by explicitly calling
ros::shutdown()
from the code that started the bag recorder.This is what shows in console:
I was looking through the code of rosbag recorder and it seems to me that what this PR fixes is the only place where the
bag_
is accessed with a write operation which is not protected with a try/catch block.