Avoid uncatchable exception in bagthread #2139

peci1 · 2021-03-01T11:35:28Z

We saw these exceptions in SubT virtual challenge where a bag file recorder is being stopped by explicitly calling ros::shutdown() from the code that started the bag recorder.

This is what shows in console:

terminate called after throwing an instance of 'rosbag::BagIOException'
  what():  Error seeking

I was looking through the code of rosbag recorder and it seems to me that what this PR fixes is the only place where the bag_ is accessed with a write operation which is not protected with a try/catch block.

fujitatomoya · 2021-03-01T14:03:28Z

@peci1

is there any way we can reproduce this issue with simple procedure? i'd like to see the stack trace on this.

peci1 · 2021-03-01T14:59:43Z

Hi, I can't think of any simple way of reproducing this right off the top of my head (the error is seen at the end of a 10-hour SubT virtual simulation, which is a pretty complex project involving Ignition Gazebo). This problem is spurious. I'll give it a try and replay the rosbag created when the exception was thrown, and record it again. But it might really be problematic.

Nevertheless, do you think the suggested change could break something, or that it would not fix the problem we see? I'm pretty sure the exception is thrown after the ros::shutdown() call (based on what do the logs say).

fujitatomoya · 2021-03-02T03:36:07Z

tools/rosbag/src/recorder.cpp

@@ -403,8 +403,16 @@ void Recorder::startWriting() {

 void Recorder::stopWriting() {
    ROS_INFO("Closing '%s'.", target_filename_.c_str());
-    bag_.close();
-    rename(write_filename_.c_str(), target_filename_.c_str());
+    try


introducing try&catch seems to be reasonable.

but I'd do check the callstack where this Recorder::stopWriting is called, and then add try&catch statement on the caller of Recorder::stopWriting to minimize the influence to other stack.

if i am not mitaken, this changes behavior of Recorder::checkDuration and Recorder::checkSize. with this fix, they will move on process with error message.

I agree, it seems like it would be better to handle this exception somewhere else where we can properly stop the process instead of letting the recorder continue. I presume it doesn't make sense to keep recording if there is an IO exception.

peci1 · 2021-03-02T23:24:35Z

I've tried an experiment with dm-flakey creating a faulty block device which starts failing after the recording starts. I couldn't get a full stack trace with debug symbols, even though ros-melodic-rosbag-storage-dbgsym and similar packages are installed. I don't really know why the symbols did not load.

This is the non-debug stack trace:

[ERROR] [1614726484.921607894, 3582.647585711] [ros.rosbag]: Error writing: Error writing to file: writing 4027 bytes, wrote 4006 bytes
terminate called after throwing an instance of 'rosbag::BagIOException'
  what():  Error writing to file: writing 4027 bytes, wrote 4019 bytes
Signal: SIGABRT (Aborted)

And core file of a test app (maybe you'll be more luck with debug symbols). I'm using binary melodic install, 1.14.10-1bionic.20201017.082648.
core.tar.bz2.zip

Anyways, it's apparent that segfault isn't the best reaction to faulty block devices. And as all this code is run inside a thread that is hidden to the outer world, the users have no opportunity of catching the exceptions.

And this is the test program I use:

#include <thread>
#include <memory>
#include <ros/ros.h>
#include <rosbag/recorder.h>

int main(int argc, char** argv)
{

  ros::init(argc, argv, "bagtest");

  std::unique_ptr<std::thread> bagThread = nullptr;
  std::unique_ptr<rosbag::Recorder> rosRecorder;

  // Setup a ros bag recorder.
  rosbag::RecorderOptions recorderOptions;
  recorderOptions.append_date = false;
  recorderOptions.prefix = "/mnt/disk/bagtest";
  recorderOptions.regex = true;
  recorderOptions.topics.push_back("/subt/.*");

  // Spawn thread for recording /subt/ data to rosbag.
  rosRecorder.reset(new rosbag::Recorder(recorderOptions));
  bagThread.reset(new std::thread([&]()
                                  {
                                    rosRecorder->run();
                                  }));

  ros::WallDuration(3).sleep();

  ROS_INFO("shutdown");
  ros::shutdown();
  std::cout << "join" << std::endl;
  bagThread->join();
  std::cout << "end" << std::endl;
}

peci1 · 2021-03-02T23:33:49Z

The dm-crypt command I used to create the flaky IO device:

echo "0 204791 flakey /dev/sde1 0 10 10" | sudo dmsetup create bad_disk

Where sde1 is an unmounted physical flash disk I don't much care about. The flaky device will work flawlessly for 10 seconds and then it will yield random errors for the following 10 seconds.

Then mount it with

sudo mkdir /mnt/disk
sudo mount -o uid=1000 /dev/disk/by-id/dm-name-bad_disk /mnt/disk/

After a few failures, Linux remounts the disk readonly. To re-enable it,

sudo umount /mnt/disk; sudo mount -o uid=1000 /dev/disk/by-id/dm-name-bad_disk /mnt/disk/

fujitatomoya · 2021-03-03T01:13:30Z

@peci1 appreciate for sharing the detailed debug information!

how about adding the try&catch on here then?

ros_comm/tools/rosbag/src/recorder.cpp

Line 573 in b6c23e1

stopWriting();

peci1 · 2021-03-03T08:50:30Z

You're right that adding the try/catch to stopWriting() would actually let checkDuration() and other functions continue even in case of disk error, and the code isn't ready for that. Given this fact, I'd suggest just wrapping the whole content of doRecord() and doRecordSnapshot() in a try/catch block catching all exceptions, writing them out, and stopping the recording thread. Would that be an acceptable solution?

One off-topic question, but I'll ask it here before opening a bug report. I've noticed that the recorder spins an AsyncSpinner. When I tried to call normal ros::spin() from the main thread, it said I can't do that because there is already an async spinner running. Wouldn't that be solved if the record thread used a custom callback queue? I think since #867, spinners spinning different callback queues should not conflict.

fujitatomoya · 2021-03-03T13:37:34Z

I am inclined to add try&catch where needs to be, but not else where.

ros_comm/tools/rosbag/src/recorder.cpp

Line 573 in b6c23e1

stopWriting();

i was thinking here is just to finalize the recorder process, so it would not be a problem to catch exception, print error message and set error code then return. and i also think it would be nice to minimize the influence to other places for the maintenance. i think exception is meant to be generated for unexpected cases, so that application will get the exception.

on the other hand, generating exception in child thread cannot be caught by main thread unless transfer the exception to main thread with std::exception_ptr, so it eventually gets aborted. for avoiding that, i guess your proposal does make sense. i'd like to hear from others too.

peci1 · 2021-03-13T19:46:06Z

@fujitatomoya it's already been some time and nobody else expressed any opinion... would please ping some relevant maintainers?

fujitatomoya · 2021-03-14T22:31:02Z

@sloretz @mjcarroll @jacobperron friendly ping.

jacobperron · 2021-03-18T21:43:20Z

Can you clarify why you want to catch the exception from the thread? Is it to try and make the recorder process exit more cleanly? IIUC, the recorder can't recover from this kind of exception and should produce any error (exception or otherwise).
If the error is caused by a race with ros::shutdown, then it sounds like the usage should be fixed (e.g. make sure to shutdown the recorder process before calling ros::shutdown).

I'm okay with catching the exception and setting an appropriate exit code, I'm just trying to understand the full picture.

peci1 · 2021-03-18T21:48:33Z

The full picture is easy. Once an exception is raised in the recording thread, there is no way for the caller to catch it (because the caller interacts with the main thread and has no knowledge of the recording thread). When an uncaught exception shots down a thread, it takes down the whole program: https://stackoverflow.com/q/7272093/1076564 . This is definitely not the right way to react to IO errors (the caller would probably just want to be notified of this, not shot down, too). That's why I suggest that there cannot be any unhandled exceptions in the recording thread.

jacobperron · 2021-03-18T22:55:31Z

Understood, thanks. I tend to agree with your suggestion of wrapping doRecord and doRecordSnapshotter in a try-catch to completely avoid exceptions in the child thread.

peci1 · 2021-04-05T18:05:24Z

Okay, I changed the PR to catch all exceptions in the outernmost scope of the thread function. This should avoid any undesired side-effects while providing 100% safety for the calling code.

mjcarroll

New iteration LGTM

jacobperron · 2021-04-05T20:58:42Z

@peci1 If you don't mind, could you retarget the change to the noetic-devel branch?

I can backport to Melodic after that. Thanks for iterating!

peci1 · 2021-04-05T21:00:24Z

@jacobperron I don't mind retargetting to noetic-devel, but I wanted this to land in melodic ASAP :) Is there a difference between merging this PR and a second one for noetic-devel? I'll prepare it right now.

peci1 · 2021-04-05T22:13:53Z

See #2151 for noetic version.

jacobperron · 2021-04-05T22:57:00Z

Our typical process is to target changes to the latest distro, allow some time to see if any regressions arise, and then backport to previous distros.

peci1 · 2021-04-05T22:58:27Z

Okay. The noetic PR is there.

jacobperron · 2021-04-06T18:43:11Z

I'm going to backport #2151 to Melodic along with other changes from Noetic shortly; separate from this PR.

@peci1 Thanks for improvement!

jacobperron · 2021-04-06T19:19:56Z

This change is included in #2152

peci1 · 2021-04-06T19:27:10Z

Thanks!

Avoid uncatchable exception in bagthread

465a5f9

peci1 mentioned this pull request Mar 1, 2021

Cloudsim: state.tlog files are much larger now osrf/subt#766

Closed

fujitatomoya reviewed Mar 2, 2021

View reviewed changes

rosbag: recorder: Catch all exceptions in record thread.

a7bbc2e

mjcarroll approved these changes Apr 5, 2021

View reviewed changes

jacobperron approved these changes Apr 5, 2021

View reviewed changes

peci1 mentioned this pull request Apr 5, 2021

Avoid uncatchable exception in bagthread [noetic] #2151

Merged

fujitatomoya approved these changes Apr 6, 2021

View reviewed changes

jacobperron closed this Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid uncatchable exception in bagthread #2139

Avoid uncatchable exception in bagthread #2139

peci1 commented Mar 1, 2021 •

edited

Loading

fujitatomoya commented Mar 1, 2021

peci1 commented Mar 1, 2021

fujitatomoya Mar 2, 2021

jacobperron Mar 18, 2021

peci1 commented Mar 2, 2021

peci1 commented Mar 2, 2021 •

edited

Loading

fujitatomoya commented Mar 3, 2021

peci1 commented Mar 3, 2021

fujitatomoya commented Mar 3, 2021

peci1 commented Mar 13, 2021

fujitatomoya commented Mar 14, 2021

jacobperron commented Mar 18, 2021

peci1 commented Mar 18, 2021

jacobperron commented Mar 18, 2021

peci1 commented Apr 5, 2021

mjcarroll left a comment

jacobperron commented Apr 5, 2021

peci1 commented Apr 5, 2021

peci1 commented Apr 5, 2021

jacobperron commented Apr 5, 2021

peci1 commented Apr 5, 2021

jacobperron commented Apr 6, 2021

jacobperron commented Apr 6, 2021

peci1 commented Apr 6, 2021

Avoid uncatchable exception in bagthread #2139

Avoid uncatchable exception in bagthread #2139

Conversation

peci1 commented Mar 1, 2021 • edited Loading

fujitatomoya commented Mar 1, 2021

peci1 commented Mar 1, 2021

fujitatomoya Mar 2, 2021

Choose a reason for hiding this comment

jacobperron Mar 18, 2021

Choose a reason for hiding this comment

peci1 commented Mar 2, 2021

peci1 commented Mar 2, 2021 • edited Loading

fujitatomoya commented Mar 3, 2021

peci1 commented Mar 3, 2021

fujitatomoya commented Mar 3, 2021

peci1 commented Mar 13, 2021

fujitatomoya commented Mar 14, 2021

jacobperron commented Mar 18, 2021

peci1 commented Mar 18, 2021

jacobperron commented Mar 18, 2021

peci1 commented Apr 5, 2021

mjcarroll left a comment

Choose a reason for hiding this comment

jacobperron commented Apr 5, 2021

peci1 commented Apr 5, 2021

peci1 commented Apr 5, 2021

jacobperron commented Apr 5, 2021

peci1 commented Apr 5, 2021

jacobperron commented Apr 6, 2021

jacobperron commented Apr 6, 2021

peci1 commented Apr 6, 2021

peci1 commented Mar 1, 2021 •

edited

Loading

peci1 commented Mar 2, 2021 •

edited

Loading