Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RViz segfaults when adding MotionPlanning plugin #1456

Closed
mvieth opened this issue Dec 12, 2019 · 15 comments
Closed

RViz segfaults when adding MotionPlanning plugin #1456

mvieth opened this issue Dec 12, 2019 · 15 comments

Comments

@mvieth
Copy link
Contributor

mvieth commented Dec 12, 2019

When adding moveit's MotionPlanning plugin, rviz crashes with a segmentation fault. This seems to happen only if certain other types of displays are active (e.g. Polygon and PoseArray).
Here an example output of valgrind:

==24091== Invalid read of size 8
==24091==    at 0xF9361C9: ??? (in /usr/lib/x86_64-linux-gnu/libGLdispatch.so.0.0.0)
==24091==    by 0x33FAD60E: Ogre::GLHardwareBufferManagerBase::createVertexBuffer(unsigned long, unsigned long, Ogre::HardwareBuffer::Usage, bool) (in /usr/lib/x86_64-linux-gnu/OGRE-1.9.0/RenderSystem_GL.so.1.9.0)
==24091==    by 0x33FC7825: ??? (in /usr/lib/x86_64-linux-gnu/OGRE-1.9.0/RenderSystem_GL.so.1.9.0)
==24091==    by 0x98353D2: Ogre::ManualObject::end() (in /usr/lib/x86_64-linux-gnu/libOgreMain.so.1.9.0)
==24091==    by 0x3A25FA9E: rviz::PoseArrayDisplay::updateArrows2d() (in /opt/ros/melodic/lib/librviz_default_plugin.so)
==24091==    by 0x3A26056F: rviz::PoseArrayDisplay::updateDisplay() (in /opt/ros/melodic/lib/librviz_default_plugin.so)
==24091==    by 0x3A260BEB: rviz::PoseArrayDisplay::processMessage(boost::shared_ptr<geometry_msgs::PoseArray_<std::allocator<void> > const> const&) (in /opt/ros/melodic/lib/librviz_default_plugin.so)
==24091==    by 0x3A262BD2: rviz::MessageFilterDisplay<geometry_msgs::PoseArray_<std::allocator<void> > >::incomingMessage(boost::shared_ptr<geometry_msgs::PoseArray_<std::allocator<void> > const> const&) (in /opt/ros/melodic/lib/librviz_default_plugin.so)
==24091==    by 0x3A263D9D: boost::detail::function::void_function_obj_invoker1<boost::function<void (boost::shared_ptr<geometry_msgs::PoseArray_<std::allocator<void> > const> const&)>, void, boost::shared_ptr<geometry_msgs::PoseArray_<std::allocator<void> > const> >::invoke(boost::detail::function::function_buffer&, boost::shared_ptr<geometry_msgs::PoseArray_<std::allocator<void> > const>) (in /opt/ros/melodic/lib/librviz_default_plugin.so)
==24091==    by 0x3A268AEF: message_filters::CallbackHelper1T<boost::shared_ptr<geometry_msgs::PoseArray_<std::allocator<void> > const> const&, geometry_msgs::PoseArray_<std::allocator<void> > >::call(ros::MessageEvent<geometry_msgs::PoseArray_<std::allocator<void> > const> const&, bool) (in /opt/ros/melodic/lib/librviz_default_plugin.so)
==24091==    by 0x3A267BCE: message_filters::Signal1<geometry_msgs::PoseArray_<std::allocator<void> > >::call(ros::MessageEvent<geometry_msgs::PoseArray_<std::allocator<void> > const> const&) (in /opt/ros/melodic/lib/librviz_default_plugin.so)
==24091==    by 0x3A26CFC0: tf2_ros::MessageFilter<geometry_msgs::PoseArray_<std::allocator<void> > >::CBQueueCallback::call() (in /opt/ros/melodic/lib/librviz_default_plugin.so)
==24091==  Address 0x1068 is not stack'd, malloc'd or (recently) free'd

And another one:

==19183== Thread 36:
==19183== Invalid read of size 8
==19183==    at 0xFBC01C9: ??? (in /usr/lib/x86_64-linux-gnu/libGLdispatch.so.0.0.0)
==19183==    by 0x33E1060E: Ogre::GLHardwareBufferManagerBase::createVertexBuffer(unsigned long, unsigned long, Ogre::HardwareBuffer::Usage, bool) (in /usr/lib/x86_64-linux-gnu/OGRE-1.9.0/RenderSystem_GL.so.1.9.0)
==19183==    by 0x33E2A825: ??? (in /usr/lib/x86_64-linux-gnu/OGRE-1.9.0/RenderSystem_GL.so.1.9.0)
==19183==    by 0x9ABF3D2: Ogre::ManualObject::end() (in /usr/lib/x86_64-linux-gnu/libOgreMain.so.1.9.0)
==19183==    by 0x3CD3711D: rviz::PolygonDisplay::processMessage(boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&) (in /home/mvieth/robocup/rviz_ws/devel/.private/rviz/lib/librviz_default_plugin.so)
==19183==    by 0x3CD38BAE: rviz::MessageFilterDisplay<geometry_msgs::PolygonStamped_<std::allocator<void> > >::incomingMessage(boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&) (in /home/mvieth/robocup/rviz_ws/devel/.private/rviz/lib/librviz_default_plugin.so)
==19183==    by 0x3CD47372: boost::_mfi::mf1<void, rviz::MessageFilterDisplay<geometry_msgs::PolygonStamped_<std::allocator<void> > >, boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&>::operator()(rviz::MessageFilterDisplay<geometry_msgs::PolygonStamped_<std::allocator<void> > >*, boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&) const (in /home/mvieth/robocup/rviz_ws/devel/.private/rviz/lib/librviz_default_plugin.so)
==19183==    by 0x3CD44F86: void boost::_bi::list2<boost::_bi::value<rviz::MessageFilterDisplay<geometry_msgs::PolygonStamped_<std::allocator<void> > >*>, boost::arg<1> >::operator()<boost::_mfi::mf1<void, rviz::MessageFilterDisplay<geometry_msgs::PolygonStamped_<std::allocator<void> > >, boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&>, boost::_bi::rrlist1<boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&> >(boost::_bi::type<void>, boost::_mfi::mf1<void, rviz::MessageFilterDisplay<geometry_msgs::PolygonStamped_<std::allocator<void> > >, boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&>&, boost::_bi::rrlist1<boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&>&, int) (in /home/mvieth/robocup/rviz_ws/devel/.private/rviz/lib/librviz_default_plugin.so)
==19183==    by 0x3CD42D20: void boost::_bi::bind_t<void, boost::_mfi::mf1<void, rviz::MessageFilterDisplay<geometry_msgs::PolygonStamped_<std::allocator<void> > >, boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&>, boost::_bi::list2<boost::_bi::value<rviz::MessageFilterDisplay<geometry_msgs::PolygonStamped_<std::allocator<void> > >*>, boost::arg<1> > >::operator()<boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&>(boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&) (in /home/mvieth/robocup/rviz_ws/devel/.private/rviz/lib/librviz_default_plugin.so)
==19183==    by 0x3CD40C56: boost::detail::function::void_function_obj_invoker1<boost::_bi::bind_t<void, boost::_mfi::mf1<void, rviz::MessageFilterDisplay<geometry_msgs::PolygonStamped_<std::allocator<void> > >, boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&>, boost::_bi::list2<boost::_bi::value<rviz::MessageFilterDisplay<geometry_msgs::PolygonStamped_<std::allocator<void> > >*>, boost::arg<1> > >, void, boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&>::invoke(boost::detail::function::function_buffer&, boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&) (in /home/mvieth/robocup/rviz_ws/devel/.private/rviz/lib/librviz_default_plugin.so)
==19183==    by 0x3CD450AD: boost::function1<void, boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&>::operator()(boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&) const (in /home/mvieth/robocup/rviz_ws/devel/.private/rviz/lib/librviz_default_plugin.so)
==19183==    by 0x3CD42E99: boost::detail::function::void_function_obj_invoker1<boost::function<void (boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> const&)>, void, boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const> >::invoke(boost::detail::function::function_buffer&, boost::shared_ptr<geometry_msgs::PolygonStamped_<std::allocator<void> > const>) (in /home/mvieth/robocup/rviz_ws/devel/.private/rviz/lib/librviz_default_plugin.so)
==19183==  Address 0x1068 is not stack'd, malloc'd or (recently) free'd

From what I can tell, the problem seems to be in the ManualObject of OGRE. More displays that use this may be affected. Maybe this is a concurrency problem and a lock is needed?

Your environment

  • OS Version: Ubuntu 18.04
  • ROS Distro: Melodic
  • RViz, Qt, OGRE, OpenGl version as printed by rviz:
rviz version 1.13.6
compiled against Qt version 5.9.5
compiled against OGRE version 1.9.0 (Ghadamon)
Forcing OpenGl version 0.
Stereo is NOT SUPPORTED
OpenGl version: 4.6 (GLSL 4.6).
  • If source build, which git commit? 11cc9ee
@rhaschke
Copy link
Contributor

rhaschke commented Dec 12, 2019

Thanks for reporting this issue.

  • Did you observe this issue before, i.e. in older releases, as well?
  • Can you always reproduce the issue? Under what circumstances?

I don't expect that Ogre is to blame here, because we use this Ogre version for more than 6 years.
I more suspect that there was an issue introduced in the OpenGL pipeline.

  • To exclude this (and narrow down the issue), it would be great if you could check whether older rviz releases exhibit the same issue.

@mvieth
Copy link
Contributor Author

mvieth commented Dec 12, 2019

Did you observe this issue before, i.e. in older releases, as well?

I just tested it in ROS kinetic, Ubuntu 16.04:

rviz version 1.13.6
compiled against Qt version 5.5.1
compiled against OGRE version 1.9.0 (Ghadamon)
Forcing OpenGl version 0.
Stereo is NOT SUPPORTED
OpenGl version: 4.6 (GLSL 4.6).

The issue appeared there as well.

Can you always reproduce the issue? Under what circumstances?

It always happens if certain types of displays are active and subscribed to a topic (so far I reproduced it with Polygon and PoseArray). Then it appears every time.

To exclude this (and narrow down the issue), it would be great if you could check whether older rviz releases exhibit the same issue.

Any specific version or commit you would like me to test?

@rhaschke
Copy link
Contributor

  • The "ROS Kinetic" version you report is the same as in Melodic (1.13.6). I meant to build an older release, e.g. 1.13.5 or even older.
  • Do you display a large number of polygons / pose arrays?
  • Is the order of activation of those displays relevant, i.e. what happens when you permute the order of adding displays?
  • Is it always the same address 0x1068 = 4200 that is (wrongly) dereferenced?
  • Finally, you could try building against a more recent Ogre release as this seems to be an Ogre issue.
    The branch noetic-devel is compatible with Ogre 1.12.2.

@mvieth
Copy link
Contributor Author

mvieth commented Dec 13, 2019

I meant to build an older release, e.g. 1.13.5 or even older

I now tested 1.13.5 and 1.13.3 (on the melodic system, with git checkout and build from source). Both versions have the bug

Do you display a large number of polygons / pose arrays?

I wouldn't say so. This publish command can be used to reproduce the bug (one triangle at 10 hz):

rostopic pub /polygon geometry_msgs/PolygonStamped "header:
  seq: 0
  stamp:
    secs: 0
    nsecs: 0
  frame_id: 'base_footprint'
polygon:
  points:
  - x: 0.0
    y: 0.0
    z: 0.0 
  - x: 1.0
    y: 0.0
    z: 0.0
  - x: 0.0
    y: 1.0
    z: 0.0" -r 10

Is the order of activation of those displays relevant, i.e. what happens when you permute the order of adding displays?

If I add the polygon display first, and then the MotionPlanning plugin, the segfault happens. If I add the MotionPlanning plugin first, and then the polygon display, everything works fine

Is it always the same address 0x1068 = 4200 that is (wrongly) dereferenced?

It also appeared with address 0x1060. Do you think this is more than a coincidence?

Finally, you could try building against a more recent Ogre release as this seems to be an Ogre issue

I will see if I can do that later

@rhaschke
Copy link
Contributor

I cannot reproduce the issue locally, having a TF, Polygon, and MotionPlanning display - in this order.
By the way, which graphics card (and opengl library) do you use?

@mvieth
Copy link
Contributor Author

mvieth commented Dec 16, 2019

Sorry for the delay, I didn't have access to the computers over the weekend.
It seems like there has to be a move group running when adding the MotionPlanning display (e.g. the panda group from the moveit tutorials). If I add the MotionPlanning display without a move group running, there is no segmentation fault. Maybe the delay when connecting to the move group causes the issue?

By the way, which graphics card (and opengl library) do you use?

The command glxinfo | grep -i opengl on the melodic system reports the following:

OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: GeForce GTX 1060/PCIe/SSE2
OpenGL core profile version string: 4.6.0 NVIDIA 435.21
OpenGL core profile shading language version string: 4.60 NVIDIA
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 4.6.0 NVIDIA 435.21
OpenGL shading language version string: 4.60 NVIDIA
OpenGL context flags: (none)
OpenGL profile mask: (none)
OpenGL extensions:
OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 435.21
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
OpenGL ES profile extensions:

@rhaschke
Copy link
Contributor

It seems like there has to be a move group running when adding the MotionPlanning display (e.g. the panda group from the moveit tutorials). If I add the MotionPlanning display without a move group running, there is no segmentation fault.

This is normal. The MotionPlanning display receives all its information from topics published by the move_group node. If there is nothing published (w/o a move_group), the corresponding OpenGL commands to draw something are not issued.

Maybe the delay when connecting to the move group causes the issue?

There are some background threads running during display setup, but we didn't observe any issues in the past with them.
It would be most helpful to build Ogre v1-9-0 or v1.12.2 from source (with debug symbols) to better understand where and why rviz/Ogre segfaults exactly and where the mysterious 0x1060/0x1068 numbers come from.

@rhaschke
Copy link
Contributor

We have seen some issues with nvidia cards and Ogre 1.9.0 that were resolved with Ogre 1.9.1.
Please try the Ogre source builds.

@mvieth
Copy link
Contributor Author

mvieth commented Dec 17, 2019

I tried to compile Ogre and then compile rviz against that, but I can't find versions that fit together. I e.g. tried noetic-devel and Ogre 1.12.2, but that did not work. Which versions/branches of ogre and rviz exactly are compatible?

@rhaschke
Copy link
Contributor

Sorry, looks like I broke the public noetic-devel branch recently. Could you try https://github.com/rhaschke/rviz/tree/noetic-devel instead? This should work with v1.12.2.
Ogre 1.9.0 and 1.9.1 should work with the melodic-devel branch.

@mvieth
Copy link
Contributor Author

mvieth commented Dec 18, 2019

Ok, it was a real pain and needed some hacks, but I finally managed to get OGRE 1.12.2 and the noetic-devel branch to work together. Assuming everything linked correctly, I can report that the bug still occurs. Here is valgrind's output:

==30050== Thread 14:
==30050== Invalid read of size 8
==30050==    at 0xF0958A9: ??? (in /usr/lib/x86_64-linux-gnu/libGLdispatch.so.0.0.0)
==30050==    by 0x30E25CF8: Ogre::GLHardwareVertexBuffer::~GLHardwareVertexBuffer() (OgreGLHardwareVertexBuffer.cpp:63)
==30050==    by 0x30E25D1F: Ogre::GLHardwareVertexBuffer::~GLHardwareVertexBuffer() (OgreGLHardwareVertexBuffer.cpp:64)
==30050==    by 0x30E1F4DB: std::_Sp_counted_ptr<Ogre::GLHardwareVertexBuffer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:376)
==30050==    by 0x98059B9: _M_release (shared_ptr_base.h:154)
==30050==    by 0x98059B9: ~__shared_count (shared_ptr_base.h:684)
==30050==    by 0x98059B9: ~__shared_ptr (shared_ptr_base.h:1123)
==30050==    by 0x98059B9: ~shared_ptr (shared_ptr.h:93)
==30050==    by 0x98059B9: ~SharedPtr (OgreSharedPtr.h:57)
==30050==    by 0x98059B9: ~pair (stl_pair.h:208)
==30050==    by 0x98059B9: destroy<std::pair<short unsigned int const, Ogre::SharedPtr<Ogre::HardwareVertexBuffer> > > (new_allocator.h:140)
==30050==    by 0x98059B9: destroy<std::pair<short unsigned int const, Ogre::SharedPtr<Ogre::HardwareVertexBuffer> > > (alloc_traits.h:487)
==30050==    by 0x98059B9: _M_destroy_node (stl_tree.h:650)
==30050==    by 0x98059B9: _M_drop_node (stl_tree.h:658)
==30050==    by 0x98059B9: std::_Rb_tree<unsigned short, std::pair<unsigned short const, Ogre::SharedPtr<Ogre::HardwareVertexBuffer> >, std::_Select1st<std::pair<unsigned short const, Ogre::SharedPtr<Ogre::HardwareVertexBuffer> > >, std::less<unsigned short>, std::allocator<std::pair<unsigned short const, Ogre::SharedPtr<Ogre::HardwareVertexBuffer> > > >::_M_erase(std::_Rb_tree_node<std::pair<unsigned short const, Ogre::SharedPtr<Ogre::HardwareVertexBuffer> > >*) (stl_tree.h:1858)
==30050==    by 0x9804DFC: clear (stl_tree.h:1171)
==30050==    by 0x9804DFC: clear (stl_map.h:1127)
==30050==    by 0x9804DFC: Ogre::VertexBufferBinding::unsetAllBindings() (OgreHardwareVertexBuffer.cpp:776)
==30050==    by 0x9804E38: Ogre::VertexBufferBinding::~VertexBufferBinding() (OgreHardwareVertexBuffer.cpp:751)
==30050==    by 0x97FCE30: Ogre::HardwareBufferManagerBase::destroyVertexBufferBindingImpl(Ogre::VertexBufferBinding*) (OgreHardwareBufferManager.cpp:126)
==30050==    by 0x99D9BAB: Ogre::VertexData::~VertexData() (OgreVertexIndexData.cpp:63)
==30050==    by 0x984FE78: Ogre::ManualObject::ManualObjectSection::~ManualObjectSection() (OgreManualObject.cpp:1075)
==30050==    by 0x9850038: Ogre::ManualObject::ManualObjectSection::~ManualObjectSection() (OgreManualObject.cpp:1077)
==30050==    by 0x984B80D: Ogre::ManualObject::clear() (OgreManualObject.cpp:60)
==30050==  Address 0x1060 is not stack'd, malloc'd or (recently) free'd

And the versions:

rviz version 1.13.6
compiled against Qt version 5.9.5
compiled against OGRE version 1.12.2 (Rhagorthua)
Forcing OpenGl version 0.
Stereo is NOT SUPPORTED
OpenGl version: 4.6 (GLSL 4.6).

So here the problem occurs in clear() instead of end(), but still in ManualObject.
As a side note, is it possible that there is a find_package missing for OGRE in moveit_ros/visualization? The variable OGRE_LIBRARIES is used several times in that package, but that is set by find_package, isn't it?

@rhaschke
Copy link
Contributor

Thanks for hinting at OGRE_LIBRARIES in MoveIt's cmake files. They are not required (and unset), because they are pulled in via rviz.
I still cannot reproduce the issue, so I cannot really help yet.
Could you provide your .rviz config file, please? Maybe this helps to reproduce the issue.

@mvieth
Copy link
Contributor Author

mvieth commented Dec 19, 2019

I start with this config (please change ending back to .rviz, github wouldn't allow that extension):
raw_config.txt
Then I add the polygon display, then the MotionPlanning display.

@mvieth
Copy link
Contributor Author

mvieth commented Dec 19, 2019

The bug does not appear if I reduce the publishing rate of the polygon (e.g. 0.2, once every 5 secs). That could hint towards a concurrency issue.

@rhaschke
Copy link
Contributor

Fix via #1560 in the upcoming Noetic release. Thanks to @hidmic for linking the ROS2 PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants