Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix subscription busy wait (melodic-devel) #1608

Closed

Conversation

peci1
Copy link
Contributor

@peci1 peci1 commented Jan 31, 2019

PR showing a solution for bug #1545 .

This is not yet meant to be merged - treat it like I'm offering a solution open for comments.

This PR consists of two commits.

The first one solves the issue in a backwards-compatible way, however I've noticed that it's performance when calling CallbackQueue::addCallback rapidly drops compared to upstream (to thousands/sec compared to hundreds of thousands/sec).

The second one takes it further, breaking ABI and API. This solution doesn't suffer from the performance issues, or at least not as much (there is some mutex locking added, so it is probably a bit slower, but taking into account that the upstream now fully throttles the CPU...)

What should the tests show: calling callOne() too often (as in upstream) is quite expensive, and since upstream enters a busy-wait-loop around callOne, it throttles 10 cores to maximum. The fixed versions should be much more friendly to your CPU, which is shown by the fact that it does not call excessive numbers of callOne (roughly 2-3 times the number of subscription callbacks pushed to the queue).

Another interesting metric showed to be the number of subscription callbacks that it's possible to push in the queue in the given time-frame. As I stated earlier, the version from first commit has a very strong drop in this metric. It might be interesting to test it with some real pusblisher/subscriber system, but I haven't yet had time to test it... The numbers are in the logs below. In the "callbacks out of xxx" , xxx stands for the number of subscription callbacks pushed to the queue.

Upstream version test log:

[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/0
[ INFO] [1548938917.404902150]: Callback queue processed 3 callbacks out of 667009
[ INFO] [1548938917.404940298]: callOne() was called 3758024 times.
[ INFO] [1548938917.404954253]: ready() was called 3758014 times.
[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/1
[ INFO] [1548938920.505888627]: Callback queue processed 3 callbacks out of 20
[ INFO] [1548938920.505926685]: callOne() was called 3337665 times.
[ INFO] [1548938920.505950341]: ready() was called 3337639 times.
[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/2
[ INFO] [1548938923.608300818]: Callback queue processed 3 callbacks out of 1886
[ INFO] [1548938923.608333159]: callOne() was called 3068045 times.
[ INFO] [1548938923.608353024]: ready() was called 3068034 times.
[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/3
[ INFO] [1548938925.709150726]: Callback queue processed 3 callbacks out of 419769
[ INFO] [1548938925.709195557]: callOne() was called 791304 times.
[ INFO] [1548938925.709217588]: ready() was called 625407 times.
[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/4
[ INFO] [1548938927.810144076]: Callback queue processed 3 callbacks out of 355726
[ INFO] [1548938927.810197815]: callOne() was called 530189 times.
[ INFO] [1548938927.810222560]: ready() was called 360343 times.
[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/5
[ INFO] [1548938935.456306184]: Callback queue processed 3 callbacks out of 15238870
[ INFO] [1548938935.456362651]: callOne() was called 15238871 times.
[ INFO] [1548938935.456381335]: ready() was called 15238870 times.
[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/6
[ INFO] [1548938941.061697745]: Callback queue processed 3 callbacks out of 3588249
[ INFO] [1548938941.061720978]: callOne() was called 9609083 times.
[ INFO] [1548938941.061725780]: ready() was called 9609081 times.

This PR test log:

[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/0
[ INFO] [1548938720.093230241]: Callback queue processed 3 callbacks out of 889294
[ INFO] [1548938720.093278796]: callOne() was called 985194 times.
[ INFO] [1548938720.093297881]: ready() was called 193324301 times.
[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/1
[ INFO] [1548938723.201363924]: Callback queue processed 3 callbacks out of 20
[ INFO] [1548938723.201433658]: callOne() was called 301 times.
[ INFO] [1548938723.201456722]: ready() was called 3136 times.
[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/2
[ INFO] [1548938726.309720284]: Callback queue processed 3 callbacks out of 1616
[ INFO] [1548938726.309784458]: callOne() was called 1959 times.
[ INFO] [1548938726.309807322]: ready() was called 287749 times.
[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/3
[ INFO] [1548938728.413855536]: Callback queue processed 3 callbacks out of 1120820
[ INFO] [1548938728.413915389]: callOne() was called 1298560 times.
[ INFO] [1548938728.413951201]: ready() was called 15516349 times.
[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/4
[ INFO] [1548938730.514903314]: Callback queue processed 3 callbacks out of 107694
[ INFO] [1548938730.514953862]: callOne() was called 154534 times.
[ INFO] [1548938730.514977241]: ready() was called 232839 times.
[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/5
[ INFO] [1548938738.808855148]: Callback queue processed 3 callbacks out of 17052751
[ INFO] [1548938738.808911622]: callOne() was called 17052752 times.
[ INFO] [1548938738.808930124]: ready() was called 34105502 times.
[ RUN      ] slow/CallbackQueueParamTest.threadedCallOneSlow/6
[ INFO] [1548938746.801665835]: Callback queue processed 3 callbacks out of 6715023
[ INFO] [1548938746.801720953]: callOne() was called 7147840 times.
[ INFO] [1548938746.801738988]: ready() was called 172077382 times.

@dirk-thomas dirk-thomas changed the title Fix subscription busy wait (melodic-devel) [WIP] Fix subscription busy wait (melodic-devel) Jan 31, 2019
@peci1
Copy link
Contributor Author

peci1 commented Feb 1, 2019

Hmm, this is weird. The results of the tests are quite different from what I get on my machines (I tested on several of them with bionic/melodic). Is there something special about the CI containers? Do the tests run in parallel?

@dirk-thomas
Copy link
Member

The results of the tests are quite different from what I get on my machines (I tested on several of them with bionic/melodic).

Let's run them again to see if this was a fluke: @ros-pull-request-builder retest this please

Is there something special about the CI containers?

I don't think so. You can run the very same process locally: https://github.com/ros-infrastructure/ros_buildfarm/blob/master/doc/jobs/devel_jobs.rst#run-the-devel-job-locally

Do the tests run in parallel?

No.

@peci1 Can you please check #1557 and comment how this patch compares to the other one.

@dirk-thomas
Copy link
Member

@peci1 Friendly ping.

@peci1
Copy link
Contributor Author

peci1 commented Mar 4, 2019

I'm sorry for the "busy wait" on my side :)

I think the mechanics of the causes of bug #1545 (solved by this PR) and #1343 (solved by PR #1557) are completely different.

In #1343, the problem is mismatch between clock types which sometimes result in zero-time waits in places which are supposed to break a busy-wait loop. This only affects callAvailable() (according to the bug reporter).

In #1545, the busy wait is caused by returning TryAgain from SubscriptionQueue::call() without making sure the next trial will not happen right away. Which it does in case multiple threads work with the subscription queue. This bug only affects callOne(). See #1545 (comment) for details.

@dirk-thomas
Copy link
Member

@peci1 Thanks for clarifying.

I will keep this PR open until there is a patch which doesn't break API/ABI and (if possible) doesn't affect performance significantly. (Unfortunately I don't have the bandwidth to comment on the approach and suggest alternatives. My time is limited to review / merge patches which are ready as is.)

@peci1
Copy link
Contributor Author

peci1 commented Jun 18, 2019

Followup in #1684 .

@peci1 peci1 closed this Jun 18, 2019
cwecht pushed a commit to cwecht/ros_comm that referenced this pull request Feb 15, 2020
cwecht pushed a commit to cwecht/ros_comm that referenced this pull request Jul 30, 2020
dirk-thomas added a commit that referenced this pull request Jul 31, 2020
* Fix a busy-wait loop in subscription queue.

* Better fix for slow callbacks CPU throttling.

* More versatile callback queue test.

* Finished cherry-pick merge to melodic-devel.

* libros: moved define

* roscpp: implementet #1608 without ABI/API breaks

* /test_roscpp: fake_message is in a header now...

* test_roscpp: fixed sign-compare warning

* stabilized test

* CallbackQueue: use SteadyTime instead of WallTime to get independent of system-time changes

* style only

* Update clients/roscpp/include/ros/callback_queue.h

Co-authored-by: Johannes Meyer <johannes@intermodalics.eu>

Co-authored-by: Martin Pecka <peci1@seznam.cz>
Co-authored-by: CTU base <robot@ctu-base>
Co-authored-by: Martin Pecka <peckama2@fel.cvut.cz>
Co-authored-by: Christopher Wecht <christopher.wechtstudent.kit.edu>
Co-authored-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
Co-authored-by: Johannes Meyer <johannes@intermodalics.eu>
meyerj added a commit to meyerj/ros_comm that referenced this pull request Oct 7, 2024
* Better fix for slow callbacks CPU throttling.

* More versatile callback queue test.

* Finished cherry-pick merge to melodic-devel.

* libros: moved define

* roscpp: implementet ros#1608 without ABI/API breaks

* /test_roscpp: fake_message is in a header now...

* test_roscpp: fixed sign-compare warning

* stabilized test

* CallbackQueue: use SteadyTime instead of WallTime to get independent of system-time changes

* style only

* Update clients/roscpp/include/ros/callback_queue.h

Co-authored-by: Johannes Meyer <johannes@intermodalics.eu>

Co-authored-by: Martin Pecka <peci1@seznam.cz>
Co-authored-by: CTU base <robot@ctu-base>
Co-authored-by: Martin Pecka <peckama2@fel.cvut.cz>
Co-authored-by: Christopher Wecht <christopher.wechtstudent.kit.edu>
Co-authored-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
Co-authored-by: Johannes Meyer <johannes@intermodalics.eu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants