Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch navigation performs poorly in Melodic simulation #36

Open
nickswalker opened this issue Jan 15, 2019 · 44 comments
Open

Fetch navigation performs poorly in Melodic simulation #36

nickswalker opened this issue Jan 15, 2019 · 44 comments
Assignees
Labels
help wanted More Info Needed More information is needed to help us address this issue

Comments

@nickswalker
Copy link
Contributor

Steps
With up to date versions of fetch_ros and fetch_gazebo

roslaunch fetch_gazebo playground.launch

And

roslaunch fetch_gazeo_demo fetch_nav.launch 

Behavior
When given a nav goal, the robot's localization drifts quickly (seems like it happens during rotation). The robot is never able to reach the goal.

https://youtu.be/w1y0b5aI3o8

Nothing jumps out from the standard move_base configurations so I'm not sure what's going on.

@moriarty
Copy link
Contributor

Thanks, we'll take a look.

FYI:
@velveteenrobot, @cjds @erelson & @narora1

@moriarty
Copy link
Contributor

moriarty commented Feb 13, 2019

@nickswalker sorry for the delay everyone I originally tagged has been busy.

I've just spoken with @safrimus and he'll investigate.
I've also created an internal JIRA ticket in hopes of not loosing track of this issue again.
https://fetchrobotics.atlassian.net/browse/OPEN-31

@cjds
Copy link
Contributor

cjds commented Feb 13, 2019

@nickswalker to clarify this only happens with melodic? and Gazebo 9?

@nickswalker
Copy link
Contributor Author

Yes, I have only observed this happening in Melodic with Gazebo 9.

@dbking77
Copy link

dbking77 commented Feb 14, 2019 via email

@nickswalker
Copy link
Contributor Author

Here are some clips with AMCL and the localization transforms visualized:

https://www.youtube.com/watch?v=uNb0pJbObHA

https://www.youtube.com/watch?v=sk4ANbCywUk

I pulled in all the recent Fetch changes and bumped to the latest Melodic sync.

It doesn't seem like the AMCL config, the Fetch Gazebo model, or any other component that might obviously cause localization to drift so quickly was changed between the indigo and melodic releases. But it's eminently reproduceable for me. I have a couple machines now where I can start a fresh workspace, clone everything, run the launch files and observe this behavior.

Let me know if bags would help.

@moriarty
Copy link
Contributor

@nickswalker thanks, @safrimus was also able to reproduce immediately in the simulator following your steps in the original issue.

I haven't seen this on the actual hardware, can you confirm that navigation is working on your fetch running melodic?

@nickswalker
Copy link
Contributor Author

Yes, navigation has been working fine on the real robot

@moriarty moriarty transferred this issue from ZebraDevs/fetch_ros Feb 27, 2019
@moriarty
Copy link
Contributor

@moriarty
Copy link
Contributor

@nickswalker can you test this again? And should we close this ticket as a duplicate of 30 ?

I tagged and released 0.9.0 of this package for Melodic, it was "good enough" but still not perfect, but we needed at least one released version into Melodic in order to setup the ros-pull-request-build jobs on the build farm.

@moriarty moriarty added help wanted More Info Needed More information is needed to help us address this issue labels Apr 5, 2019
@moriarty
Copy link
Contributor

moriarty commented Apr 5, 2019

@nickswalker I'll add More Info Needed, and Help Wanted to this ticket.

More Info Needed: because I'd like to know how it's performing now.
Help wanted: because we'll need help doing any further tuning on this.

@umhan35
Copy link

umhan35 commented Jun 21, 2019

@moriarty We also have this issue on Ubuntu 18.04/Gazebo 9. I pulled latest master, which is the same as 0.9.0.

Here is the video: https://youtu.be/lLUQtOjqFnM. After I recorded this issue, it takes about 15 second to move to the last goal I set.

To reproduce:

roslaunch fetch_gazebo playground.launch
roslaunch fetch_gazebo_demo fetch_nav.launch
roscd fetch_navigation && rviz -d config/navigation.rviz

We also tested in 14.04 and Gazebo 2, and it works very well.

@nickswalker
Copy link
Contributor Author

I was able to reproduce this issue using the code in #101 and the same steps as before. I don't think the problem is the inflation radius. Something about the simulation is going wrong causing drift during rotation. Given this, no amount of tuning navigation parameters is going make it localize well enough to go through doors.

@moriarty
Copy link
Contributor

@nickswalker check #101 not for the code but for the comment from @mikeferguson

So, have you always been building from source? If so, I'd recommend setting CMAKE_BUILD_TYPE=Release. The change to TF2, and associated use of tf2_sensor_msgs::PointCloudIterator is very sensitive to compilation being Release (it's about 300x faster in Release mode than Debug). I've found several times that issues with timing go away when switching to Release build.

@moriarty
Copy link
Contributor

ZebraDevs/fetch_ros@09db2ce file fetch_depth_layer/src/depth_layer.cpp

are likely causing the difference :( unfortunately the CMAKE_BUILD_TYPE -> Release did not seem to fix it.

@moriarty
Copy link
Contributor

@@ -143,8 +144,8 @@ void FetchDepthLayer::onInitialize()
     camera_info_topic, 10, &FetchDepthLayer::cameraInfoCallback, this);
 
   depth_image_sub_.reset(new message_filters::Subscriber<sensor_msgs::Image>(private_nh, camera_depth_topic, 10));
-  depth_image_filter_ = boost::shared_ptr< tf::MessageFilter<sensor_msgs::Image> >(
-    new tf::MessageFilter<sensor_msgs::Image>(*depth_image_sub_, *tf_, global_frame_, 10));
+  depth_image_filter_ = boost::shared_ptr< tf2_ros::MessageFilter<sensor_msgs::Image> >(
+    new tf2_ros::MessageFilter<sensor_msgs::Image>(*depth_image_sub_, *tf_, global_frame_, 10, private_nh));
   depth_image_filter_->registerCallback(boost::bind(&FetchDepthLayer::depthImageCallback, this, _1));
   observation_subscribers_.push_back(depth_image_sub_);
   observation_notifiers_.push_back(depth_image_filter_);
@@ -275,16 +276,26 @@ void FetchDepthLayer::depthImageCallback(
   {
     // find ground plane in camera coordinates using tf
     // transform normal axis
-    tf::Stamped<tf::Vector3> vector(tf::Vector3(0, 0, 1), ros::Time(0), "base_link");
-    tf_->transformVector(msg->header.frame_id, vector, vector);
-    ground_plane[0] = vector.getX();
-    ground_plane[1] = vector.getY();
-    ground_plane[2] = vector.getZ();
+    geometry_msgs::Vector3Stamped vector;
+    vector.vector.x = 0;
+    vector.vector.y = 0;
+    vector.vector.z = 1;
+    vector.header.frame_id = "base_link";
+    vector.header.stamp = ros::Time();
+    tf_->transform(vector, vector, msg->header.frame_id);
+    ground_plane[0] = vector.vector.x;
+    ground_plane[1] = vector.vector.y;
+    ground_plane[2] = vector.vector.z;
 
     // find offset
-    tf::StampedTransform transform;
-    tf_->lookupTransform("base_link", msg->header.frame_id, ros::Time(0), transform);
-    ground_plane[3] = transform.getOrigin().getZ();
+    geometry_msgs::TransformStamped transform;
+    try {
+      transform = tf_->lookupTransform("base_link", msg->header.frame_id, msg->header.stamp);
+      ground_plane[3] = transform.transform.translation.z;
+    } catch (tf2::TransformException){
+      ROS_WARN("Failed to lookup transform!");
+      return;
+    }
   }
 
   // check that ground plane actually exists, so it doesn't count as marking observations

@nickswalker
Copy link
Contributor Author

I confirmed that doing a release build had no impact. I looked at reverting FetchDepthLayer to tf but stopped when I realized it would've required also changing the upstream DepthLayer code back as well.

I tried bypassing localization using fake_localization (added a ground truth odometry plugin to our robot model, tweaked our navigation launch file) and this is the behavior now:
https://youtu.be/bF_NOWKgx5A

The local cost map still streaks on rotation, so it definitely seems related to the depth layer somehow not catching the correct transform. As soon as the robot starts rotating, the extra noise in the costmap makes it impossible to navigate through doorways.

@mkhansenbot
Copy link

@nickswalker - did you ever resolve this? I am still seeing it on the latest release. I'd be interested in knowing if you root caused this or had other updates?

@nickswalker
Copy link
Contributor Author

No resolution and no updates from the previous comment

@mkhansenbot
Copy link

OK, thanks for the update, I'm looking into it

@mkhansenbot
Copy link

So I see the same issue when using fake_localization instead of AMCL, and it appears the "odom->base_link" TF is moving around quite a bit. So I suspect it's either a problem with the libfetch plugin or the friction of the wheels. The wheel friction was increased by #59, did you ever see the problem before then? I can try reverting that change to see if it makes a difference.

@mkhansenbot
Copy link

mkhansenbot commented Feb 3, 2021

Here's what I mean by the transforms being off.

image

@cmcollander
Copy link

This is still an issue. Ubuntu 18.04.5, all of my fetch and ros packages are up to date. The odom transform actually reaches points where it is so far off that it's off the map. So something is wrong with the odometry.

@mkhansenbot
Copy link

mkhansenbot commented Feb 6, 2021

I'm not sure what the root cause of this is yet, it may have more than one root cause. However, here's what I think. I see that using 'fake_localization' I still have this problem, so I don't think that the odometry, wheel friction, or localization are the cause, although it is strange how much the odom transform drifts. When using the fake_localization however, the odom drift shouldn't matter, which is why I don't think that's the problem.

I'm more concerned with the local_costmap, which seems to be getting cleared incorrectly. Maybe @mikeferguson, @DLu, @SteveMacenski or someone with a deeper knowledge of the costmap clearing can take a look at that. If you see my screenshot above, you'll see that as the robot rotates, it seems to cause the costmap to 'smear' previous and current observations. I think that is causing the local planner to get "trapped" and unable to find a path forward. I observe that sometimes after the "clear costmap" recovery, it's able to move again, but not every time, as the doorways are also very narrow compared to the inflation radius of 0.7m.

So, I have experimented with a few parameters changes and have a few that seem to at least work-around this issue. With these changes I can navigate room to room mostly fine, occasionally getting stuck temporarily before proceeding. Not perfect, but much better (at least for me).

In the fetch_navigation/config/costmap_local.yaml file, change:
global_frame from 'odom' to 'map' - prevents the local_costmap from rotating, which seems to help with the smearing above
update_frequency to 5.0 - clears / updates the costmap more often
publish_frequency to 5.0 - publishes for observation in Rviz
inflater/inflation_radius: 0.1 - this gives the local_planner more room to navigate through the doorways

global_frame: map

rolling_window: true
update_frequency: 5.0
publish_frequency: 5.0
inflater:
  inflation_radius: 0.1

Also, in the fetch_navigation/config/move_base.yaml file I set planner_frequency: 1.0. That tells the global_planner to re-plan every 1.0 second, and seems to also help the local planner get un-stuck.

I started digging into the local_costmap clearing code, but didn't see anything that seemed to be causing the problem. I might look at this some more but wanted to pass along my learnings so far to see if others have ideas / suggestions etc.

@mkhansenbot
Copy link

So I also tried switching out the Fetch depth layer for the standard navigation obstacle layer, and I don't see any noticeable improvement. I also tried changing the amcl alpha1 param to 0.5 per this comment: #101 (comment) from @mikeferguson and don't see much difference there either. I see I can navigate pretty well between the two tables, but navigating into the empty room is sometimes unsuccessful. The robot gets stuck in the doorway often.

One thing I may try, per the comment mentioned above, is changing to the DWA planner to see if that improves things. But right now I'm guessing a little bit, which isn't a good debug strategy. If anyone else has time to look into this and has ideas what could be wrong I'm open to collaborating.

@mkhansenbot
Copy link

I also forgot to mention, I have also tried changing the conservative and aggressive reset distances = 0.0 to clear the local costmaps as cleanly as possible.

@mkhansenbot
Copy link

I also tried running on a Ubuntu 16 / Kinetic system to see how well that works, and hopefully use git bisect to get to the changes that broke this, but I can't get that to run at all. If I run the simulation using the playground.launch, then I start the navigation with the fetch_nav.launch, Gazebo crashes:

gzserver: /build/ogre-1.9-mqY1wq/ogre-1.9-1.9.0+dfsg1/OgreMain/src/OgreRenderSystem.cpp:546: virtual void Ogre::RenderSystem::setDepthBufferFor(Ogre::RenderTarget*): Assertion `bAttached && "A new DepthBuffer for a RenderTarget was created, but after creation" "it says it's incompatible with that RT"' failed.
Aborted (core dumped)
[gazebo-2] process has died [pid 27345, exit code 134, cmd /opt/ros/kinetic/lib/gazebo_ros/gzserver -e ode /opt/ros/kinetic/share/fetch_gazebo/worlds/test_zone.sdf __name:=gazebo __log:=/home/ubuntu/.ros/log/efbfd982-6bff-11eb-ac51-0ed9018625d7/gazebo-2.log].
log file: /home/ubuntu/.ros/log/efbfd982-6bff-11eb-ac51-0ed9018625d7/gazebo-2*.log

Does anyone else see this issue using Kinetic? If anyone has a 'working' version with Kinetic, can you post a video of the Rviz view with the map, laserscan, robot and local costmap? I'd like to see this working as a point of comparison against the current behavior.

@moriarty
Copy link
Contributor

I also tried running on a Ubuntu 16 / Kinetic system to see how well that works, and hopefully use git bisect to get to the changes that broke this, but I can't get that to run at all. If I run the simulation using the playground.launch, then I start the navigation with the fetch_nav.launch, Gazebo crashes:

gzserver: /build/ogre-1.9-mqY1wq/ogre-1.9-1.9.0+dfsg1/OgreMain/src/OgreRenderSystem.cpp:546: virtual void Ogre::RenderSystem::setDepthBufferFor(Ogre::RenderTarget*): Assertion `bAttached && "A new DepthBuffer for a RenderTarget was created, but after creation" "it says it's incompatible with that RT"' failed.
Aborted (core dumped)
[gazebo-2] process has died [pid 27345, exit code 134, cmd /opt/ros/kinetic/lib/gazebo_ros/gzserver -e ode /opt/ros/kinetic/share/fetch_gazebo/worlds/test_zone.sdf __name:=gazebo __log:=/home/ubuntu/.ros/log/efbfd982-6bff-11eb-ac51-0ed9018625d7/gazebo-2.log].
log file: /home/ubuntu/.ros/log/efbfd982-6bff-11eb-ac51-0ed9018625d7/gazebo-2*.log

Does anyone else see this issue using Kinetic? If anyone has a 'working' version with Kinetic, can you post a video of the Rviz view with the map, laserscan, robot and local costmap? I'd like to see this working as a point of comparison against the current behavior.

#46

I was using that Dockerfile to quickly switch version... but it’s out of date, the OSRF base images have changed locations, and the Nvidia docker stuff is different/no longer required... but as I recall it was possible to see this stop working when switching back and forth

@mkhansenbot
Copy link

@moriarty - thanks for the reply. I just now was able to get this same thing running on a Ubuntu 16 system. Turned out the problem above was a Gazebo 7.0.0 bug that was later fixed. I upgraded to 7.16.1 (the latest) and that fixed it.

However, I still see the same problems in Ubuntu 16 using the 'apt' released fetch packages. Here's a screenshot where the robot is stuck trying to get through the door to table 2.
image

@mkhansenbot
Copy link

@moriarty or anyone really, can someone point me to a version that worked, preferably a release tag (like 0.7.0)? I'm now able to build and test on a Ubuntu 16 system, but some dependencies have since been upgraded so I'm not sure how far back I can go.

@umhan35
Copy link

umhan35 commented Feb 11, 2021

@mkhansenbot if you don’t mind, navigation works in Ubuntu 14.04, which is end of life and may have security issues.

Fetch Robotics should really try to solve this issue, but the research platform is of low priority from what I can tell.

@velveteenrobot
Copy link
Contributor

Unfortunately all my systems are 18.04 so I don't know off the top of my head if there's a version of Ubuntu 16 + Gazebo 7 that doesn't have this issue. Like @umhan35 said, I believe it does work on 14.04 but that might be too far back to easily compare changes.

@DLu
Copy link

DLu commented Feb 12, 2021

FWIW, I tried it out and as far as I can tell, its something wonky with the odometry/localization, not the costmaps/recovery behaviors, with a very small chance of it being the local planner. Tested with Melodic/18.04/Gazebo 9. I also tried fake_localization but oddly had the same problem.

@moriarty
Copy link
Contributor

16.04 & Kinetic was skipped on Fetch Hardware, I only released Kinetic quietly after releasing 18.04 & Melodic... because of many requests from users who wanted it.

@mkhansenbot
Copy link

Thanks everyone for the replies. I can confirm that it doesn't work on the released binaries for Ubuntu 16. I haven't tested on a Ubuntu 14 system, would have to pull a docker image and install ROS on it if that's even possible anymore, not sure if the apt package servers are even alive anymore.

@velveteenrobot
Copy link
Contributor

velveteenrobot commented Feb 12, 2021

@mkhansenbot I just tested this on an Indigo docker.
Docker to pull:

docker pull ros:indigo-robot

Start a docker (my insanely overkill command is probably not necessary):

docker run -d -it --network=host --privileged -v /etc/network/interfaces:/etc/network/host_interfaces -v /dev/bus/usb/:/dev/bus/usb/  -v /dev/input:/dev/input -v /etc/ros/indigo:/etc/ros/indigo -v ~/.ssh/:/root/.ssh/ -v ~/.aws/:/root/.aws/ -v /etc/fetchcore:/etc/fetchcore --env="DISPLAY" --env="QT_X11_NO_MITSHM=1"  --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" --name indigo_robot ros:indigo-robot

I also do the totally unsafe thing as described in http://wiki.ros.org/docker/Tutorials/GUI#The_simple_way:

xhost -local:root

Install fetch packages in the docker (had to add the keys and stuff as described here):
http://wiki.ros.org/indigo/Installation/Ubuntu
Then:

sudo apt-get install ros-indigo-fetch*

Then launch playground and fetch_nav:

roslaunch fetch_gazebo playground.launch 
roslaunch fetch_gazeo_demo fetch_nav.launch

Then I run rviz outside the docker.
Nav works fine:

1404_nav.mp4

@mkhansenbot
Copy link

@velveteenrobot - thanks Sarah I'll try that too!

@mkhansenbot
Copy link

Update - I was able to get the Ubuntu 14 / Indigo container running with simulation and it does work better (not perfect but noticeably better). The package versions being used are fetch_navigation: 0.7.15, fetch_gazebo: 0.7.3, robot_controllers: 0.5.4, control_toolbox: 1.13.3

On Ubuntu 16, when the robot is failing the versions are: fetch_navigation: 0.7.15, fetch_gazebo: 0.8.2, robot_controllers: 0.5.2, control_toolbox: 1.17.0

Based on that I'm able to find a version that works on Ubuntu 14 but fails on Ubuntu 16: fetch_navigation: 0.7.15, fetch_gazebo: 0.8.2, robot_controllers: 0.5.4, control_toolbox: 1.13.3

So, I don't think the problem is any change that has occurred in any of those packages, which means some dependency change such as gazebo_plugins or the gazebo physics changed between Gazebo 5 / Indigo and Gazebo 7 / Kinetic. So many other things changed between those versions it's hard to know where to look next, I'm open to suggestions.

@DLu
Copy link

DLu commented Feb 17, 2021

I'd be curious to see what happens if you play the same sequence of velocity commands in each and see what the resulting odometry looks like.

@mkhansenbot
Copy link

I'd be curious to see what happens if you play the same sequence of velocity commands in each and see what the resulting odometry looks like.

I haven't done that but I did use rostopic pub -r 10 /teleop/cmd_vel geometry_msgs/Twist -- '[0.5, 0, 0]' '[0, 0, 0]' and just watched the robot in Rviz. In Indigo and Melodic, it seems to steer straight, but in Kinetic it drives drunk and veers left until it eventually crashes into the left wall! 😆 So I think Kinetic is really borked with the versions I am running, which isn't surprising, I think some of the fixes made since then for Melodic have improved things. It still isn't as good as it was on Indigo though. When I run in Kinetic with the "show trail" selected for the wheels, I can see it slips sideways as it rolls. On Melodic, I don't see that nearly as bad. I can post screenshots if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted More Info Needed More information is needed to help us address this issue
Projects
None yet
Development

No branches or pull requests