Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gazebo9 - transport failure during ConnectPubToSub call #2875

Closed
ojasjoshi opened this issue Nov 7, 2020 · 3 comments · Fixed by #2978
Closed

Gazebo9 - transport failure during ConnectPubToSub call #2875

ojasjoshi opened this issue Nov 7, 2020 · 3 comments · Fixed by #2978
Labels
bug Something isn't working transport

Comments

@ojasjoshi
Copy link

ojasjoshi commented Nov 7, 2020

Exception:

request_publisher: /usr/include/boost/smart_ptr/shared_ptr.hpp:734: typename boost::detail::sp_member_access<T>::type boost::shared_ptr<T>::operator->() const [with T = gazebo::transport::Publication; typename boost::detail::sp_member_access<T>::type = gazebo::transport::Publication*]: Assertion `px != 0' failed.

Stack Trace:

Process terminating with default action of signal 6 (SIGABRT): dumping core
==27141==    at 0x6AD4FB7: raise (raise.c:51)
==27141==    by 0x6AD6920: abort (abort.c:79)
==27141==    by 0x6AC6489: __assert_fail_base (assert.c:92)
==27141==    by 0x6AC6501: __assert_fail (assert.c:101)
==27141==    by 0x52D76E2: boost::shared_ptr<gazebo::transport::Publication>::operator->() const [clone .isra.61] [clone .part.62] (shared_ptr.hpp:734)
==27141==    by 0x52D7A2A: operator-> (TopicManager.cc:283)
==27141==    by 0x52D7A2A: gazebo::transport::TopicManager::ConnectPubToSub(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::shared_ptr<gazebo::transport::SubscriptionTransport>) (TopicManager.cc:282)
==27141==    by 0x52C33A5: gazebo::transport::ConnectionManager::OnRead(boost::shared_ptr<gazebo::transport::Connection>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (ConnectionManager.cc:496)
==27141==    by 0x52C434C: operator() (mem_fn_template.hpp:280)
==27141==    by 0x52C434C: operator()<boost::_mfi::mf2<void, gazebo::transport::ConnectionManager, boost::shared_ptr<gazebo::transport::Connection>, const std::__cxx11::basic_string<char>&>, boost::_bi::rrlist1<const std::__cxx11::basic_string<char>&> > (bind.hpp:398)
==27141==    by 0x52C434C: operator()<const std::__cxx11::basic_string<char>&> (bind.hpp:1306)
==27141==    by 0x52C434C: boost::detail::function::void_function_obj_invoker1<boost::_bi::bind_t<void, boost::_mfi::mf2<void, gazebo::transport::ConnectionManager, boost::shared_ptr<gazebo::transport::Connection>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>, boost::_bi::list3<boost::_bi::value<gazebo::transport::ConnectionManager*>, boost::_bi::value<boost::shared_ptr<gazebo::transport::Connection> >, boost::arg<1> > >, void, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>::invoke(boost::detail::function::function_buffer&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (function_template.hpp:159)
==27141==    by 0x52C4F4F: operator() (function_template.hpp:759)
==27141==    by 0x52C4F4F: gazebo::transport::ConnectionReadTask::execute() (Connection.hh:74)

Description

Gazebo transport is failing while setting up a new connection between a new subscriber to an existing publisher on certain topic.

  • This is occurring within ConnectPubToSub due to PublicationPtr publication = this->FindPublication(_topic) being an uninitliazed shared_ptr<Publication>.

  • A possible reason being that TopicManager::FindPublication cannot find an existing Publication within TopicManager::advertisedTopics for the given topic. This might be caused due to an async call to TopicManager::Init or TopicManager::Fini via gazebo::transport::init or gazebo::transport::fini within the same process(while loading a plugin maybe?), which seems to clear TopicManager::advertisedTopics.

  • If TopicManager::advertisedTopics is cleared between a call to ConnectSubToPub and ConnectPubToSub, the exception will occur.. I have made an attempt to repro the crash with a bit of a hack here -- https://github.com/ojasjoshi/gzserver_transport_failure/blob/main/README.md

  • Checking the existence of PublicationPtr such as in ConnectSubToPub via TopicManager::UpdatePublications(_pub.topic(), _pub.msg_type()) might help fix this.

  • I am not sure what the purpose of TopicManager::Init() is? What are the best practices to call gazebo::transport::init explicitly?

@chapulina chapulina added bug Something isn't working transport labels Nov 9, 2020
@chapulina
Copy link
Contributor

I am not sure what the purpose of TopicManager::Init() is?

Looking at what that method does, it seems that it's meant to work more like a "reset" than an "init". Maybe the original intent was to be able to reinitialize the topic manager during runtime? I'm not sure.

https://github.com/osrf/gazebo/blob/063d7386ed5499c2883de48e19ff5fb9dec01dff/gazebo/transport/TopicManager.cc#L69-L75

What are the best practices to call gazebo::transport::init explicitly?

I believe that's meant to be called once per process that will use gazebo::transport.

Internally, Gazebo already calls it once per process during setup, so I think that no plugins should ever call it.

https://github.com/osrf/gazebo/blob/063d7386ed5499c2883de48e19ff5fb9dec01dff/gazebo/gazebo_shared.cc#L75

If you're writing a standalone program that uses gazebo::tranport, this example shows transport::init being called before transport::run:

https://github.com/osrf/gazebo/blob/063d7386ed5499c2883de48e19ff5fb9dec01dff/examples/plugins/model_move/path_publisher.cc#L41-L44

Checking the existence of PublicationPtr such as in ConnectSubToPub

Agreed that it should prevent your crash. But you may still end up with a disconnected pub and sub? Or maybe it will be attempted again? I think that's worth investigating.

@sgundry
Copy link

sgundry commented Mar 19, 2021

We have continued investigating. Interestingly, the crashes that we have collected so far all occur for these two built-in topics:

/gazebo/<world>/user_cmd_stats
/gazebo/<world>/modify

The nullptr crashes occur for "sub" requests.

How would a subscription request to a built-in topic fail due to TopicManager missing the topic (since it is not found, nullptr)? Maybe a race condition between the TopicManager singleton and Advertise and Subscribe... We will continue investigating, sharing this finding for now.

@emersonknapp
Copy link
Contributor

emersonknapp commented Apr 19, 2021

I have a reliable reproduction case for this. Source follows (builds using same settings as examples/stand_alone/custom_main.cc)

#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <gazebo/gazebo.hh>
#include <gazebo/common/common.hh>
#include <gazebo/physics/physics.hh>
#include <gazebo/msgs/msgs.hh>
#include <gazebo/gazebo_client.hh>

volatile sig_atomic_t stop;
const std::string TOPIC = "/chatter";

void inthand(int signum) {
    stop = 1;
}

void cb(ConstGzStringPtr &_msg)
{
  // Dump the message contents to stdout.
  std::cout << _msg->data() << std::endl;
}

// Main function for the child process, runs client
void client_main(int argc, char** argv)
{
  gazebo::client::setup(argc, argv);

  gazebo::transport::NodePtr node(new gazebo::transport::Node());
  node->Init();
  gazebo::transport::SubscriberPtr sub = node->Subscribe(TOPIC, cb);
  while (!stop) {
    gazebo::common::Time::MSleep(10);
  }

  gazebo::client::shutdown();
}

// Main function for the parent process, runs server
void server_main(int argc, char **argv)
{
  gazebo::setupServer(argc, argv);
  gazebo::physics::WorldPtr world = gazebo::loadWorld("worlds/empty.world");

  gazebo::transport::NodePtr node(new gazebo::transport::Node());
  node->Init();

  gazebo::transport::PublisherPtr pub =
    node->Advertise<gazebo::msgs::GzString>(TOPIC);
  pub.reset();

  while (!stop) {
    std::this_thread::sleep_for(std::chrono::milliseconds(100));
    gazebo::msgs::GzString msg;
    msg.set_data("arst");
    // can't use the publisher because it got deleted 
    // pub->Publish(msg);
  }

  gazebo::shutdown();
}

/////////////////////////////////////////////////
int main(int argc, char **argv)
{
  signal(SIGINT, inthand);
  if (fork() == 0) {
    client_main(argc, argv);
  } else {
     server_main(argc, argv);
  }
  return 0;
}

In summary, the above example does:

  1. Fork a child process and run the gazebo client in it
    • Create a node, create a subscription to the topic "/chatter"
  2. In the main process
    • Create a node, create a publisher advertising the topic "/chatter"
    • Delete the "/chatter" publisher

In this case we fork a process, rather than spawning a thread, because this forces the use of the master networked transport. This problem does not occur for pure-local transport.

Everything above is perfectly legal usage of the Gazebo APIs, however what happens is as follows (paraphrasing to skip irrelevant bits):

Server process

  • Node::Advertise ->
  • TopicManager::Advertise ->
  • TopicManager::UpdatePublications ->
  • ConnectionManager::Advertise -> this->masterConn->EnqueueMsg(msgs::Package("advertise", msg));

master

  • "advertise" triggers send of "publisher_advertise"

client process

  • ConnectionManager::ProcessMessage -> "publisher_advertise" enqueues TopicManagerConnectionTask
  • TopicManagerConnectionTask::execute ->
  • TopicManager::ConnectSubToPub -> new PublicationTransport ->
  • PublicationTransport::Init -> this->connection->EnqueueMsg(msgs::Package("sub", sub));

Server process again now

  • ConnectionManager::OnRead -> if (packet.type() == "sub") ->
  • TopicManager::ConnectPubToSub ->
    • PublicationPtr publication = this->FindPublication(_topic); -> publication == nullptr

finally! In summary, the communication takes the following path

 (Server "advertise")->(Master "publisher_advertise")->(Client "sub")->Server

By the time that message has made its networked round-trip, the server process has long-since called TopicManager::Unadvertise for the given topic, which removed it from advertisedTopics, meaning that FindPublication returns NULL, every time.

This causes a null pointer derefence crash in the server process - I would expect that the real expected behavior would be a "failed subscription" of some sort on the Client side. It saw that a topic existed, but by the time its subscription request went through, there was no such topic.

This case can, of course, happen in ways other than this particular pathological usage - however this gives us a good reproducible example.

@chapulina do you know, or have an idea who would know, what the correct behavior would really be here? I notice that the all uses of TopicManager::FindPublication check the result for null, except for ConnectPubToSub and DisconnectPubFromSub, which have the implicit (incorrect, but somewhat reasonable) assumption that they can only be called if a publication actually exists. My first inclination is to simply change this to:

void TopicManager::ConnectPubToSub(const std::string &_topic,
                                   const SubscriptionTransportPtr _sublink)
{
  PublicationPtr publication = this->FindPublication(_topic);
  if (publication) {
    publication->AddSubscription(_sublink);
  }
}

But, I'm not sure if that will break the subscription in the Client somehow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working transport
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants