Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roscore in 1.13.6 is broken on macos: #1357

Closed
NikolausDemmel opened this issue Mar 30, 2018 · 37 comments
Closed

roscore in 1.13.6 is broken on macos: #1357

NikolausDemmel opened this issue Mar 30, 2018 · 37 comments

Comments

@NikolausDemmel
Copy link
Contributor

It throws errors like

process[rosout-1]: started with pid [72640]
libc++abi.dylib: terminating with uncaught exception of type std::length_error: vector
[rosout-1] process has died [pid 72640, exit code -6, cmd /opt/ros/lunar/lib/rosout/rosout __name:=rosout __log:=/Users/demmeln/.ros/log/8cd76b6b-345f-11e8-8884-6003088b8618/rosout-1.log].
log file: /Users/demmeln/.ros/log/8cd76b6b-345f-11e8-8884-6003088b8618/rosout-1*.log
[rosout-1] restarting process

Seems like it was working fine in 1.13.5. See also discussion at mikepurvis/ros-install-osx#114 . It seems like #1243 introduced the offending lines.

@NikolausDemmel
Copy link
Contributor Author

For the record, the immediate reason for the exception is that in

max_files = limit.rlim_max;

for some reason -1 is returned on macos...

@NikolausDemmel
Copy link
Contributor Author

But even with update of max_files commented out, there are further issues, since upon starting roscore it now shows the following errors:

[ERROR] [1522447653.950314000]: [registerService] Failed to contact master at [eddard:11311].  Retrying...
[ERROR] [1522447654.003783000]: [registerService] Failed to contact master at [eddard:11311].  Retrying...
[ERROR] [1522447654.062192000]: [registerPublisher] Failed to contact master at [eddard:11311].  Retrying..

@dirk-thomas
Copy link
Member

dirk-thomas commented Mar 30, 2018

According to the docs limit.rlim_max can be RLIM_INFINITY which is defined as -1. This case is not being catched in the current code.

@trainman419 @NikolausDemmel Can either of you create a pull request and check that it fixes the problem?

@NikolausDemmel
Copy link
Contributor Author

I'm not sure what the appropriate action is in that case? Sticking with the default of 1024? It would be great to get @trainman419's take on this.

However, even when reverting #1243, just like when leaving max_files with the default of 1024, I get the above Failed to contact master error. I have no clue, might be an unrelated regression...

I just checked overlaying ros_comm 1.13.5 (all packages), and with that roscore still works fine. (For it to compile, you need to cherry-pick #1239 and comment out the xmlrpcpp tests as suggested in mikepurvis/ros-install-osx#110 (comment))

@trainman419
Copy link
Contributor

I'll take a look and try to update tests for the RLIM_INFINITY case.

@trainman419
Copy link
Contributor

It's a bit of a hack, but lunar-devel...trainman419:lunar-devel will probably fix this.

I don't have an OSX machine handy and the existing tests don't mock out getrlimit(), so this is hard for me to test locally.

@trainman419
Copy link
Contributor

If you haven't already, please also try running the xmlrpcpp tests on OSX. I suspect the existing tests will catch this and may also show that my fix works.

@NikolausDemmel
Copy link
Contributor Author

Thanks for the quick response. I will test and report back.

@NikolausDemmel
Copy link
Contributor Author

@trainman419, unfortunately, I cannot run unit tests. When I try to compile them I get this:

[ 81%] Linking CXX shared library /Users/demmeln/work/ros/ros_comm_fix_ws/devel/.private/xmlrpcpp/lib/libtest_fixtures.dylib
cd /Users/demmeln/work/ros/ros_comm_fix_ws/build/xmlrpcpp/test && /usr/local/Cellar/cmake/3.11.0/bin/cmake -E cmake_link_script CMakeFiles/test_fixtures.dir/link.txt --verbose=1
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -O3 -DNDEBUG -dynamiclib -Wl,-headerpad_max_install_names  -o /Users/demmeln/work/ros/ros_comm_fix_ws/devel/.private/xmlrpcpp/lib/libtest_fixtures.dylib -install_name @rpath/libtest_fixtures.dylib CMakeFiles/test_fixtures.dir/test_fixtures.cpp.o  -L/opt/ros/lunar/lib -Wl,-rpath,/opt/ros/lunar/lib /usr/local/lib/libboost_system-mt.dylib /usr/local/lib/libboost_thread-mt.dylib /usr/local/lib/libboost_chrono-mt.dylib /usr/local/lib/libboost_date_time-mt.dylib /usr/local/lib/libboost_atomic-mt.dylib
Undefined symbols for architecture x86_64:
  "XmlRpc::XmlRpcValue::invalidate()", referenced from:
      XmlRpc::XmlRpcValue::operator=(char const*) in test_fixtures.cpp.o
  "XmlRpc::XmlRpcValue::operator=(XmlRpc::XmlRpcValue const&)", referenced from:
      XmlRpc::XmlRpcValue::operator=(char const*) in test_fixtures.cpp.o
  "XmlRpc::XmlRpcServer::bindAndListen(int, int)", referenced from:
      XmlRpcTest::SetUp() in test_fixtures.cpp.o
  "XmlRpc::XmlRpcServer::enableIntrospection(bool)", referenced from:
      XmlRpcTest::SetUp() in test_fixtures.cpp.o
  "XmlRpc::XmlRpcServer::work(double)", referenced from:
      XmlRpcTest::work() in test_fixtures.cpp.o
  "XmlRpc::XmlRpcServer::shutdown()", referenced from:
      XmlRpcTest::TearDown() in test_fixtures.cpp.o
  "XmlRpc::XmlRpcServer::XmlRpcServer()", referenced from:
      XmlRpcTest::XmlRpcTest() in test_fixtures.cpp.o
  "XmlRpc::XmlRpcServer::~XmlRpcServer()", referenced from:
      XmlRpcTest::XmlRpcTest() in test_fixtures.cpp.o
      XmlRpcTest::~XmlRpcTest() in test_fixtures.cpp.o
  "XmlRpc::setVerbosity(int)", referenced from:
      XmlRpcTest::TearDown() in test_fixtures.cpp.o
  "XmlRpc::XmlRpcServerMethod::XmlRpcServerMethod(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, XmlRpc::XmlRpcServer*)", referenced from:
      Hello::Hello(XmlRpc::XmlRpcServer*) in test_fixtures.cpp.o
  "XmlRpc::XmlRpcServerMethod::~XmlRpcServerMethod()", referenced from:
      XmlRpcTest::XmlRpcTest() in test_fixtures.cpp.o
      Hello::~Hello() in test_fixtures.cpp.o
      Hello::~Hello() in test_fixtures.cpp.o
      Hello::Hello(XmlRpc::XmlRpcServer*) in test_fixtures.cpp.o
      XmlRpcTest::~XmlRpcTest() in test_fixtures.cpp.o
  "testing::Test::Test()", referenced from:
      XmlRpcTest::XmlRpcTest() in test_fixtures.cpp.o
  "testing::Test::~Test()", referenced from:
      XmlRpcTest::XmlRpcTest() in test_fixtures.cpp.o
      XmlRpcTest::~XmlRpcTest() in test_fixtures.cpp.o
  "typeinfo for XmlRpc::XmlRpcServerMethod", referenced from:
      typeinfo for Hello in test_fixtures.cpp.o
  "typeinfo for testing::Test", referenced from:
      typeinfo for XmlRpcTest in test_fixtures.cpp.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [/Users/demmeln/work/ros/ros_comm_fix_ws/devel/.private/xmlrpcpp/lib/libtest_fixtures.dylib] Error 1
make[1]: *** [test/CMakeFiles/test_fixtures.dir/all] Error 2
make: *** [all] Error 2

See also: mikepurvis/ros-install-osx#110 (comment)

Are the unit tests using google-mock? That is currently not available on osx, see mikepurvis/ros-install-osx#110 (comment)

@NikolausDemmel
Copy link
Contributor Author

@trainman419, the patch works (with a small fix for signedness; turns out RLIM_INFINITY is unsgined int, so comparing to -1 signed int doesnt work).

With this some very basic local testing confirms that roscore works.

I still get these errors when starting roscore:

auto-starting new master
process[master]: started with pid [41674]
ROS_MASTER_URI=http://eddard:11311/

setting /run_id to 567d4f8c-34d0-11e8-b040-6003088b8618
process[rosout-1]: started with pid [41677]
started core service [/rosout]
[ERROR] [1522493005.013565000]: [registerService] Failed to contact master at [eddard:11311].  Retrying...
[ERROR] [1522493005.074404000]: [registerPublisher] Failed to contact master at [eddard:11311].  Retrying...
[ERROR] [1522493005.133195000]: [registerSubscriber] Failed to contact master at [eddard:11311].  Retrying...

It might have been there already earlier. Sometimes I get all three errors, sometimes only 2 or 1, sometimes none. Not sure which process those messages come from.

@mpflanzer
Copy link

I'm seeing the same "Failed to contact master" errors as @NikolausDemmel. I guess related to that is the problem that most of the time my callback functions for subscribed topics are not called. I need to restart the CPP node multiple times to be lucky that a simple subscriber program actually receives messages. Even though rostopic echo works fine every time. Seems to be a problem specific to the CPP part of ros_comm.

Since I have a MacBook I could and would be willing to help debugging this issue. Though I would need some guidance where to start looking.

@trainman419
Copy link
Contributor

@NikolausDemmel the error messages you're seeing are almost certainly coming from the rosout process.

I suspect there's some startup timing difference between Linux and OSX, and on OSX either the roscore node starts up more quickly, or the master starts up more slowly, resulting in a missed registration between the node and the master.

I suspect this is a result of some of the recent changes to error handling and recovery in xmlrpcpp, but it's hard to confirm because I have not had time to reproduce this locally.

@NikolausDemmel
Copy link
Contributor Author

NikolausDemmel commented Apr 2, 2018

@trainman419, what would be a good way to help narrow this down for @mpflanzer or myself? Maybe finding a commit where this doesn't happen and then bisecting to identify the commit that introduced this?

@mpflanzer, good point about roscpp vs rospy. This would explain, why my simple test with rostopic worked fine. I'd need to test again with roscpp Publishers / Subscribers.

@trainman419
Copy link
Contributor

Testing with the roscpp_tutorials and comparing to the python tutorials will help confirm that this is an issue with roscpp and not the master in general.

If you can get the unit tests to compile correctly, that will be a substantial benefit to confirm that the integration tests work the same way on OSX as they do on Linux. It looks like the errors are a result of trouble with the build files and are not related to gmock or any other external dependencies, so this should be solvable with build changes.

I think the oldest change that might have introduced this is #1216 . If you want to attempt a bisect, I would start with that change.

@NikolausDemmel
Copy link
Contributor Author

Ok thanks @trainman419, I'll see what I can do and when.

PS: I guess CI for mac OS would be nice in the future. I wounder how hard it would be to setup with travis these days. Might be a bit slow since you need to run build everything from source including dependencies, but that would be ok for testing only the main devel branches to detect regressions....

@mpflanzer
Copy link

I managed to compile the xmlrpcpp unit tests on macOS. The errors @NikolausDemmel was seeing are most likely caused by different compiler options between building the library and building the test. I had similar issues when linking against gtest (which I'm now linking statically).
The big issue is that -Wl,-wrap is unsupported on macOS and there doesn't seem to be an easy alternative. I ended up overloading the the actual functions with the mock counterparts instead of relying on the linker rewriting. I guess that's good enough to help debugging this issue but seems too much of a hack to commit it for regression tests.

Anyway, I manually ran all test binaries. Please find the respective outputs attached. In summary:

Test Status
HelloTest.log ok
TestValues.log ok
TestXml.log ok
test_base64.log ok
test_client.log ok
test_dispatch.log ok
test_dispatch_live.log fails & hangs
test_socket.log ok
test_ulimit.log fails
test_util.log ok

Are there other tests I should run as well? I'm now going to double check roscpp vs rospy.

@NikolausDemmel
Copy link
Contributor Author

@mpflanzer, cool, thanks for running the tests.

What version of the code are you testing?

Could you please still push your hacky workaround for the unit tests somewhere (maybe new branch on your fork of ros_comm)?

@mpflanzer
Copy link

I was using this archive: https://github.com/ros-gbp/ros_comm-release/archive/release/lunar/ros_comm/1.13.6-0.tar.gz
I could re-run with the current branch, though, looking at the commits it might not make much of difference. What is the best way to "clone" this repo into the catkin workspace? Or is the only way to copy the directories individually to the catkin src directory?

@mpflanzer
Copy link

My results from testing different combinations of rospy and roscpp:

Talker Listener Status
rospy rospy ok
rospy roscpp not ok
roscpp rospy ok
roscpp roscpp not ok

"no ok" means that I have to rerun the listener multiple times (~10-50) before messages are received. If messages are received there seem to be no further problems.
Gut feeling is that the CPP/CPP constellation is worse than the PY/CPP constellation. But I didn't actually count how many times I have to restart the listener.
When I use the CPP listener, independent of whether I can received messages or not, I get the already mentioned errors:

[ERROR] [1522767927.456924000]: [registerPublisher] Failed to contact master at [localhost:11311].  Retrying...
[ INFO] [1522767927.518072000]: Connected to master at [localhost:11311]
[ERROR] [1522767927.518382000]: [registerService] Failed to contact master at [localhost:11311].  Retrying...
[ INFO] [1522767927.580093000]: Connected to master at [localhost:11311]
[ERROR] [1522767927.604006000]: [registerSubscriber] Failed to contact master at [localhost:11311].  Retrying...
[ INFO] [1522767927.660666000]: Connected to master at [localhost:11311]

@NikolausDemmel
Copy link
Contributor Author

I could re-run with the current branch, though, looking at the commits it might not make much of difference. What is the best way to "clone" this repo into the catkin workspace? Or is the only way to copy the directories individually to the catkin src directory?

You can just clone the source repository into a new overlay test-ws, something like:

. /opt/ros/lunar/setup.bash
mkdir -p ~/test_ws/src
cd test_ws/src
git clone https://github.com/ros/ros_comm
cd ..
catkin init
# catkin config ...
catkin build

@trainman419
Copy link
Contributor

I was finally able to get ros_base from Lunar installed on my laptop. I've pushed a fix for the test linking issues to lunar-devel...trainman419:lunar-devel (I was under-linking some of the test libraries)

catkin no longer provides the catkin_add_gtest_build_flags() macro described on http://wiki.ros.org/gtest so I had to link against the gtest library by linking against ${GTEST_LIBRARIES} and assuming that catkin has already found it for me.

I'm now able to build the tests, but I can't run them with catkin test xmlrpcpp because the rostime tests are failing. I haven't used catkin build before, so any advice on how to get it to run the tests I've requested would be helpful.

I had a look at the test results from @mpflanzer , and it looks like the ulimit test is failing to set the ulimit at some point. This might be related to the fact that the ulimit is unlimited on OSX.

@mpflanzer I'm on OSX El Capitan and I have not had any trouble yet with the -Wl,-wrap option with my linker. Trying to remove these options and still run the tests will probably not correctly override the system call symbols in all cases.

@NikolausDemmel
Copy link
Contributor Author

I'm now able to build the tests, but I can't run them with catkin test xmlrpcpp because the rostime tests are failing. I haven't used catkin build before, so any advice on how to get it to run the tests I've requested would be helpful.

Not sure if that helps, but you can always find the build directory for each package in build/<pkg> and then execute make commands there directly.

@mpflanzer
Copy link

mpflanzer commented Apr 5, 2018

I'm now able to build the tests, but I can't run them with catkin test xmlrpcpp because the rostime tests are failing. I haven't used catkin build before, so any advice on how to get it to run the tests I've requested would be helpful.

You can do

catkin run_tests xmlrpcpp

That should build and run all tests. Here are some more infos: http://catkin-tools.readthedocs.io/en/latest/verbs/catkin_build.html#building-and-running-tests
Alternatively, you can run the tests manually. They are installed into devel/.private/xmlrpcpp/lib/xmlrpcpp/

edit: Forgot to mention that you might need to abort the run_tests at some point if its also hanging.

@mpflanzer I'm on OSX El Capitan and I have not had any trouble yet with the -Wl,-wrap option with my linker. Trying to remove these options and still run the tests will probably not correctly override the system call symbols in all cases.

Did you double-check that all tests have actually been build? I had the problem that for some reason catkin build didn't build everything and only running catkin run_tests would build all tests and fail with errors about -Wl,-wrap.
The risks that not all symbols are correctly overriden is why I wouldn't want to commit my workaround. But for testing I added some output to be sure that the mock functions are called and not the real system functions.

@trainman419
Copy link
Contributor

I'm used to having tests build with the rest of my build, but apparently catkin build doesn't build tests.

Now that I'm running catkin run_tests xmlrpcpp and I've resolved several other issues with outdated packages on my system, I'm seeing linker errors about --wrap.

@mpflanzer
Copy link

With these changes it builds for me: trainman419/ros_comm@lunar-devel...mpflanzer:lunar-devel

@mpflanzer
Copy link

I just compiled with with ros_comm version 1.13.4. I don't see any errors when running talker/listener and no need to restart the nodes multiple times. Will try a bisect now.

@mpflanzer
Copy link

According to git bisect the following is the first commit where the talker doesn't receive messages. Some earlier commit already introduced the "Failed to contact master" errors but the connection still worked fine.

a0f4cc3 is the first bad commit
commit a0f4cc3
Author: Guillaume Autran gautran@clearpathrobotics.com
Date: Fri Feb 2 12:29:50 2018 -0500

Topic subscription scalability fix (#1281)

Address a long standing issue where the `sockets_changed_` flag would never be reset causing the native poll set to be re-created everytime doing a poll.

:040000 040000 0e1212fe1a26ade2bc0af8c28ad92dbca635065a 7f507d95ca2c984d5beedc885c42f34a0e07f5e6 M clients

@mpflanzer
Copy link

The first commit showing the "Failed to contact master" errors is:

04961fe is the first bad commit
commit 04961fe
Author: Austin namniart+github@gmail.com
Date: Thu Nov 16 15:56:04 2017 -0800

Unit tests for XmlRpcDispatch (#1232)

* Unit tests for XmlRpcDispatch

* Fix handling of out-of-band (Exception) request and and event
processing to match previous select() implementation.
* Update comments to describe the events that actually trigger and
Exception state.
* Unit tests for XmlRpcDispatch to verify that it calls poll()
correctly, requests the correct events and calls the correct handler
for those events.

* Cleanup from internal review

* revert unnecessary whitespace changes

:040000 040000 4129deb486da81a74d9959ba92ed517232c6d5ef 11d27d0a433ec5c993721b0d8de3cb419cbba0b6 M utilities

@trainman419
Copy link
Contributor

The errors about "Failed to contact master" match my own debugging tonight. It looks like the semantics of the event flags returned by poll() are slightly different on OSX, and in a couple of cases this is preventing proper handling of a dropped connection. (This shows up in the test_dispatch_live failures).

I've also pushed some changes to my branch to build on OSX while maintaining the symbol overriding on Linux.

@Tobias-Fischer
Copy link
Contributor

Hi, are there any news on this issue? I just ran into it as well.

@dirk-thomas
Copy link
Member

dirk-thomas commented May 2, 2018

@NikolausDemmel @trainman419 @mpflanzer What is the status of this ticket?

@trainman419
Copy link
Contributor

I have a branch with compilation fixes for some of the issues reported in XmlRpcpp: #1392, but according to @mpflanzer it also looks like the changes introduced by @guillaumeautran in #1281 are causing subscriber issues on OSX.

Unfortunately I don't have the time to dig further into the issues introduced by #1281, and it's difficult to verify all of my fixes without also fixing those issues.

@guillaumeautran
Copy link
Contributor

Ok, I may have found the issue for the subscriber issues introduced by #1281. OSX does not have the epoll system call so it is using the poll system call instead.
When I did the fix for #1281 I missed two spots where we should trigger the rebuild of the ufds_ set. Here poll_set.cpp#L142 and here poll_set.cpp#L164.
Sorry for that. PR is incoming.

Would someone with access to OSX be able to double check the fix please?

@mpflanzer
Copy link

Nice 👍 I should be able to give it a try tomorrow

guillaumeautran added a commit to clearpathrobotics/ros_comm that referenced this issue May 9, 2018
guillaumeautran added a commit to clearpathrobotics/ros_comm that referenced this issue May 9, 2018
guillaumeautran added a commit to clearpathrobotics/ros_comm that referenced this issue May 9, 2018
@guillaumeautran
Copy link
Contributor

@mpflanzer PR is up #1393, I had to update the CMakeList.txt to allow me to build and run the unit tests.

guillaumeautran added a commit to clearpathrobotics/ros_comm that referenced this issue May 9, 2018
@mpflanzer
Copy link

I just gave it a try and can confirm that PR #1393 fixes the connection issues introduced by #1281

mikepurvis pushed a commit that referenced this issue May 10, 2018
* Change to force a rebuild of the pollset on flag changes

This change fixes issues seen in #1357, resulting from #1281
@mikepurvis
Copy link
Member

Fixed by #1393.

dirk-thomas pushed a commit that referenced this issue Aug 9, 2018
* Change to force a rebuild of the pollset on flag changes

This change fixes issues seen in #1357, resulting from #1281
dirk-thomas pushed a commit that referenced this issue Aug 20, 2018
* Change to force a rebuild of the pollset on flag changes

This change fixes issues seen in #1357, resulting from #1281
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants