Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid uncatchable exception in bagthread #2139

Closed
wants to merge 2 commits into from

Conversation

peci1
Copy link
Contributor

@peci1 peci1 commented Mar 1, 2021

We saw these exceptions in SubT virtual challenge where a bag file recorder is being stopped by explicitly calling ros::shutdown() from the code that started the bag recorder.

This is what shows in console:

terminate called after throwing an instance of 'rosbag::BagIOException'
  what():  Error seeking

I was looking through the code of rosbag recorder and it seems to me that what this PR fixes is the only place where the bag_ is accessed with a write operation which is not protected with a try/catch block.

@fujitatomoya
Copy link
Contributor

@peci1

is there any way we can reproduce this issue with simple procedure? i'd like to see the stack trace on this.

@peci1
Copy link
Contributor Author

peci1 commented Mar 1, 2021

Hi, I can't think of any simple way of reproducing this right off the top of my head (the error is seen at the end of a 10-hour SubT virtual simulation, which is a pretty complex project involving Ignition Gazebo). This problem is spurious. I'll give it a try and replay the rosbag created when the exception was thrown, and record it again. But it might really be problematic.

Nevertheless, do you think the suggested change could break something, or that it would not fix the problem we see? I'm pretty sure the exception is thrown after the ros::shutdown() call (based on what do the logs say).

@@ -403,8 +403,16 @@ void Recorder::startWriting() {

void Recorder::stopWriting() {
ROS_INFO("Closing '%s'.", target_filename_.c_str());
bag_.close();
rename(write_filename_.c_str(), target_filename_.c_str());
try
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

introducing try&catch seems to be reasonable.

but I'd do check the callstack where this Recorder::stopWriting is called, and then add try&catch statement on the caller of Recorder::stopWriting to minimize the influence to other stack.

if i am not mitaken, this changes behavior of Recorder::checkDuration and Recorder::checkSize. with this fix, they will move on process with error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it seems like it would be better to handle this exception somewhere else where we can properly stop the process instead of letting the recorder continue. I presume it doesn't make sense to keep recording if there is an IO exception.

@peci1
Copy link
Contributor Author

peci1 commented Mar 2, 2021

I've tried an experiment with dm-flakey creating a faulty block device which starts failing after the recording starts. I couldn't get a full stack trace with debug symbols, even though ros-melodic-rosbag-storage-dbgsym and similar packages are installed. I don't really know why the symbols did not load.

This is the non-debug stack trace:

Snímek obrazovky pořízený 2021-03-03 00-15-51

[ERROR] [1614726484.921607894, 3582.647585711] [ros.rosbag]: Error writing: Error writing to file: writing 4027 bytes, wrote 4006 bytes
terminate called after throwing an instance of 'rosbag::BagIOException'
  what():  Error writing to file: writing 4027 bytes, wrote 4019 bytes
Signal: SIGABRT (Aborted)

And core file of a test app (maybe you'll be more luck with debug symbols). I'm using binary melodic install, 1.14.10-1bionic.20201017.082648.
core.tar.bz2.zip

Anyways, it's apparent that segfault isn't the best reaction to faulty block devices. And as all this code is run inside a thread that is hidden to the outer world, the users have no opportunity of catching the exceptions.

And this is the test program I use:

#include <thread>
#include <memory>
#include <ros/ros.h>
#include <rosbag/recorder.h>

int main(int argc, char** argv)
{

  ros::init(argc, argv, "bagtest");

  std::unique_ptr<std::thread> bagThread = nullptr;
  std::unique_ptr<rosbag::Recorder> rosRecorder;

  // Setup a ros bag recorder.
  rosbag::RecorderOptions recorderOptions;
  recorderOptions.append_date = false;
  recorderOptions.prefix = "/mnt/disk/bagtest";
  recorderOptions.regex = true;
  recorderOptions.topics.push_back("/subt/.*");

  // Spawn thread for recording /subt/ data to rosbag.
  rosRecorder.reset(new rosbag::Recorder(recorderOptions));
  bagThread.reset(new std::thread([&]()
                                  {
                                    rosRecorder->run();
                                  }));

  ros::WallDuration(3).sleep();

  ROS_INFO("shutdown");
  ros::shutdown();
  std::cout << "join" << std::endl;
  bagThread->join();
  std::cout << "end" << std::endl;
}

@peci1
Copy link
Contributor Author

peci1 commented Mar 2, 2021

The dm-crypt command I used to create the flaky IO device:

echo "0 204791 flakey /dev/sde1 0 10 10" | sudo dmsetup create bad_disk

Where sde1 is an unmounted physical flash disk I don't much care about. The flaky device will work flawlessly for 10 seconds and then it will yield random errors for the following 10 seconds.

Then mount it with

sudo mkdir /mnt/disk
sudo mount -o uid=1000 /dev/disk/by-id/dm-name-bad_disk /mnt/disk/

After a few failures, Linux remounts the disk readonly. To re-enable it,

sudo umount /mnt/disk; sudo mount -o uid=1000 /dev/disk/by-id/dm-name-bad_disk /mnt/disk/

@fujitatomoya
Copy link
Contributor

@peci1 appreciate for sharing the detailed debug information!

how about adding the try&catch on here then?

stopWriting();

@peci1
Copy link
Contributor Author

peci1 commented Mar 3, 2021

You're right that adding the try/catch to stopWriting() would actually let checkDuration() and other functions continue even in case of disk error, and the code isn't ready for that. Given this fact, I'd suggest just wrapping the whole content of doRecord() and doRecordSnapshot() in a try/catch block catching all exceptions, writing them out, and stopping the recording thread. Would that be an acceptable solution?


One off-topic question, but I'll ask it here before opening a bug report. I've noticed that the recorder spins an AsyncSpinner. When I tried to call normal ros::spin() from the main thread, it said I can't do that because there is already an async spinner running. Wouldn't that be solved if the record thread used a custom callback queue? I think since #867, spinners spinning different callback queues should not conflict.

@fujitatomoya
Copy link
Contributor

I am inclined to add try&catch where needs to be, but not else where.

stopWriting();

i was thinking here is just to finalize the recorder process, so it would not be a problem to catch exception, print error message and set error code then return. and i also think it would be nice to minimize the influence to other places for the maintenance. i think exception is meant to be generated for unexpected cases, so that application will get the exception.

on the other hand, generating exception in child thread cannot be caught by main thread unless transfer the exception to main thread with std::exception_ptr, so it eventually gets aborted. for avoiding that, i guess your proposal does make sense. i'd like to hear from others too.

@peci1
Copy link
Contributor Author

peci1 commented Mar 13, 2021

@fujitatomoya it's already been some time and nobody else expressed any opinion... would please ping some relevant maintainers?

@fujitatomoya
Copy link
Contributor

@sloretz @mjcarroll @jacobperron friendly ping.

@jacobperron
Copy link
Contributor

Can you clarify why you want to catch the exception from the thread? Is it to try and make the recorder process exit more cleanly? IIUC, the recorder can't recover from this kind of exception and should produce any error (exception or otherwise).
If the error is caused by a race with ros::shutdown, then it sounds like the usage should be fixed (e.g. make sure to shutdown the recorder process before calling ros::shutdown).

I'm okay with catching the exception and setting an appropriate exit code, I'm just trying to understand the full picture.

@peci1
Copy link
Contributor Author

peci1 commented Mar 18, 2021

The full picture is easy. Once an exception is raised in the recording thread, there is no way for the caller to catch it (because the caller interacts with the main thread and has no knowledge of the recording thread). When an uncaught exception shots down a thread, it takes down the whole program: https://stackoverflow.com/q/7272093/1076564 . This is definitely not the right way to react to IO errors (the caller would probably just want to be notified of this, not shot down, too). That's why I suggest that there cannot be any unhandled exceptions in the recording thread.

@jacobperron
Copy link
Contributor

Understood, thanks. I tend to agree with your suggestion of wrapping doRecord and doRecordSnapshotter in a try-catch to completely avoid exceptions in the child thread.

@peci1
Copy link
Contributor Author

peci1 commented Apr 5, 2021

Okay, I changed the PR to catch all exceptions in the outernmost scope of the thread function. This should avoid any undesired side-effects while providing 100% safety for the calling code.

Copy link
Member

@mjcarroll mjcarroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New iteration LGTM

@jacobperron
Copy link
Contributor

@peci1 If you don't mind, could you retarget the change to the noetic-devel branch?

I can backport to Melodic after that. Thanks for iterating!

@peci1
Copy link
Contributor Author

peci1 commented Apr 5, 2021

@jacobperron I don't mind retargetting to noetic-devel, but I wanted this to land in melodic ASAP :) Is there a difference between merging this PR and a second one for noetic-devel? I'll prepare it right now.

@peci1
Copy link
Contributor Author

peci1 commented Apr 5, 2021

See #2151 for noetic version.

@jacobperron
Copy link
Contributor

Our typical process is to target changes to the latest distro, allow some time to see if any regressions arise, and then backport to previous distros.

@peci1
Copy link
Contributor Author

peci1 commented Apr 5, 2021

Okay. The noetic PR is there.

@jacobperron
Copy link
Contributor

I'm going to backport #2151 to Melodic along with other changes from Noetic shortly; separate from this PR.

@peci1 Thanks for improvement!

@jacobperron jacobperron closed this Apr 6, 2021
@jacobperron
Copy link
Contributor

This change is included in #2152

@peci1
Copy link
Contributor Author

peci1 commented Apr 6, 2021

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants