Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using poll() in favor of select() in the XmlRPCDispatcher (extended) #1133

Closed

Conversation

kmactavish
Copy link
Contributor

@kmactavish kmactavish commented Aug 12, 2017

This replaces #1123. It builds on #833 to replace select() with poll() in the XmlRPCDispatcher. It:

@kmactavish kmactavish changed the title Xmlrpcdispatch poll redux Xmlrpcdispatch poll redux (extended) Aug 12, 2017
@kmactavish kmactavish changed the title Xmlrpcdispatch poll redux (extended) Using poll() in favor of select() in the XmlRPCDispatcher (extended) Aug 12, 2017
@kmactavish
Copy link
Contributor Author

kmactavish commented Aug 12, 2017

Builds 295-303 were triggered, since #1123 has passed/failed stochastically.
Edit: roughly half of them failed 😢.

Timing related

  1. testtimetest_sim took > 60s.
  2. testtime_sequencer took > 60s.
  3. testbuiltin_types took > 60s.
  4. testtest_subscriber took > 60s.

XmlRpc related

  1. test_random_play random_play.py:155 (abs(input_time - expect_time) < .5)
  2. name_remapped/parameterRemapping name_remapping.cpp:52 ros::param::get("mapfrom", param)
  3. service_call/callSrv service_call.cpp:56 ros::service::call("service_adv", req, res)
  4. name_remapped_with_ns/parameterRemapping name_remapping_with_ns.cpp:53 ros::param::get("test_full", param)

@kmactavish
Copy link
Contributor Author

kmactavish commented Aug 12, 2017

Just noticed that 9545294 loses the code attribution for @mgrrx / @jspricke from the original commits in #833. Edit: I cherry-picked the old commit and added mine on top.

* Using poll() in favor of select() in the XmlRPCDispatcher

* ros#832: initialize pollfd event with 0. Added check for POLLHUP and POLLNVAL

* Fix syntax
@kmactavish
Copy link
Contributor Author

kmactavish commented Aug 12, 2017

@dirk-thomas jobs 312-314 are hung. Sorry... I don't think much changed since I triggered 8 jobs (which all finished normally), but something went wrong. Oh good, there's a 2 hour timeout 😝.

I think that there was originally a problem where spurious events were getting triggered, which errored out in the handler, removing that source. That failed tests, but didn't hang. Fixing that meant that the handlers don't get removed, but there is a deeper problem being exposed. I'm concerned about iterator counts / invalidation (there is mention of iterators disappearing, and _sources size changing during handleEvents) running alongside the index of pollfds, so I changed the source iteration to an array as well.

@kmactavish
Copy link
Contributor Author

kmactavish commented Aug 12, 2017

316 was successful, triggered 318-322. Only one failure this time: name_remapped_with_ns/parameterRemapping name_remapping_with_ns.cpp:58

@kmactavish
Copy link
Contributor Author

kmactavish commented Aug 12, 2017

Triggered 341-348 for c66da35 (not checking POLLHUP). All passed except 345, which failed with the same unknown failure that devel has (see #1135)! Woo! Doing a bit cleanup now.

@kmactavish
Copy link
Contributor Author

kmactavish commented Aug 12, 2017

A reference simulating select using poll in linux is cloudius-systems/osv#35, specifically cloudius-systems/osv@b53d39a. The tcp poll implementation is at torvalds/linux:net/ipv4/tcp.c.

@kmactavish
Copy link
Contributor Author

kmactavish commented Aug 12, 2017

Triggered 352-359 on 6fc346e (checking exceptions even if not read/write). Update: The only failures were stochastic: a timing test and a compression ratio test, both addressed in #1135!

@kmactavish
Copy link
Contributor Author

I'm happy with this now, I'm going to squash my commits, and trigger a bunch more builds.

This commit makes sure that the poll flags emulate select (which it replaces).
It also double-checks event types to make sure they were requested (e.g. POLLERR might trigger both).
It keeps track of fd/src relationship through two parallel arrays, instead of an iterator / array hybrid.
@kmactavish
Copy link
Contributor Author

kmactavish commented Aug 12, 2017

Triggered builds 367,371-376 on the squashed commit (same code as builds 352-359, 6fc346e). Update: They all passed except one container that got disconnected (and built successfully after)! Woo!

@kmactavish
Copy link
Contributor Author

@dirk-thomas Thanks for the style edits. The failed build 383 for f01b3f9 is concerning. That is one of the stochastic failures that I have seen in the past for this PR, but didn't crop up in the previous 16 or so builds. How critical is the result of this test? If it needs to be rock solid, I'm afraid this PR is still not ready...

@dirk-thomas
Copy link
Member

Thank you to everyone for working on these patches!

I squashed my changes into the first commit and cherry-picked them in 0ea685c and 5f6ab7c.

@kmactavish
Copy link
Contributor Author

kmactavish commented Aug 14, 2017

Hopefully all is well, but for future reference, that test that failed was:

[test_roscpp.rosunit-subscribe_star/switchTypeInter][FAILURE]-------------------
/tmp/catkin_workspace/src/ros_comm/test/test_roscpp/test/src/subscribe_star.cpp:247
Value of: 0U
  Actual: 0
Expected: sub.getNumPublishers()
Which is: 1

I think it's a case of not realizing that the publisher disappeared. This could be cause by POLLHUP not being set in poll() when the publisher disappears. It should be, which causes an attempted read, which fails, which indicates that the publisher is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants