-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rosbridge_server hanging randomly under load #425
Comments
Somebody else was having issues with Tornado 4.4.3, you might try 4.5.3. Yielding without the lock disconnects websocket backpressure and can lead to OOM on the server if it can't keep up with outgoing messages. If your application is not very demanding, this is fine. |
Thanks for your help with this. This is a production system and raspbian.raspberrypi.org are currently shipping 4.4.3, so I'm not brave enough at this juncture to pursue that test. For information, here's our latest:
I'll report back if we learn anything further - thanks again. |
I hope to solve most Tornado issues by removing Tornado from rosbridge. |
Experiencing same issues, some data is not being published anymore and some data is keeping old value. Will try the #426 |
|
What is the "hanging" you experience? We sometimes experience that some old messages are received via the websocket on the client side. And some messages are just not coming in anymore on the client side. |
The hanging occured again with version 0.11.4. In my case, when the websocket hangs, I can still rosnode ping it, I can still net cat to its TCP port but nothing else. For example an HTTP GET on it as above doesn't give anything back. And when it starts hanging, clients get disconnected, they can't reconnect and the /connected_clients list doesn't get updated. |
I'll be working on an automated stress test to reproduce these kinds of issues. |
@mvollrath here are some more details of all the testing we did and their results. I haven't started diving into the code yet. @nickvaras the problem occurs when switching networks as we reproduced it but not limited to that. We managed to reproduce the problem by doing so:
Note that this problem happens with 0.11.3 and 0.11.4 so it seems independant from the websocket implementation. We have had other occurences of this hanging but couldn't reproduce them. Note as well that we made the following stress test: opening about 10 front end instances at the same time so that the rosbridge server load would go 100% CPU on our embedded PC when registering all topics. This didn't cause any issues. But when we do the same thing with just one client connecting as described above, it did. So I'm not sure at all that this is related to a load issue but rather a TCP socket still alive problem... |
Ok, I'll see about automating a WebSocket client abort. The stress test was working but not revealing anything particularly interesting to this hanging problem. You might try making sure that all topic subscriptions have non-zero |
Thanks for the update @Axel13fr , that's encouraging progress. When the issue is reproduced, does the issue include the failure to display the "splash" screen, for lack of a better word, when visiting the websocket server on the browser? i.e., http://<ip_address>:9090
Thanks again!! |
@nickvaras yes it does include the failure of the websocket splash screen when the issue is reproduced. |
@mvollrath I tried a queue length of 1 and of 5 for all my subs but it didn't help. Reproducing the issue is easier when under high load. The issue doesn't happen under kinetic on 16.04 (partially because its tornado version doesn't kick clients after ping timeout). On 18.04 with 0.11.4, when setting very high timeouts so that suddenly disconnected clients are not kicked out, on reconnection, the websocket is on high load (probably due to TCP buffers full), ROS messages stop coming BUT the websocket thread is still alive (the autobahn splash screen still responds). Compared to 16.04, the websocket hanged in some cases still... |
I've seen this too. Furthermore, I've seen what appears to be a frozen topic for a few minutes, only to quickly unravel a bunch of buffered messages after a while, i.e., messages that publish at 1Hz stop coming through, and then all of a sudden the all come in quick succession at a much higher frequency. |
The bunch of buffered messages is due to the following: when the rosbridge server doesn't kick out on timeout, it assumes the client is still connected and continues to write on the TCP socket. The TCP socket OS buffer will fill up so that when your front end / client comes back online, it will receive the whole buffered content as fast as the connection allow transfers for it. But after that, I've seen the websocket hanging in some cases. |
I tried to repro this with a couple of different mobile devices but couldn't quite hit the magic. Then I tried sending SIGSTOP to a Firefox client and got the condition where other clients are hanging. The problem in rosbridge is that while a The solution is to not hold the queue |
Cheers, I'll thoroughly test it next week on all our cases and variants. |
So the PR solves the issue when reconnecting a front end while the rosbridge assumed the connection was still alive (no kick out by timeout). But if a ping timeout parameter is configured and the sudden disconnection exceeds it;resulting in the rosbridge kicking the client, on reconnection, the hanging still occurs. |
This could be the issue with unregistration, when I isolated it today I found that under some conditions a reconnecting client can't register the topics it had previously if it was the only client on those topics. The known hack for this is to never unregister anything, see #399 You can try this branch to see if it fixes the issue now described. Yet to be seen if this is something we can merge. |
We tested the branch to never unregister anything but it still hangs on reconnection after kicking the client due to timeout. In this case, the websocket server itself seems to hang as GET HTTP requests on its port do not return the usual blue autobahn banner mentionned above by nickvaras. |
Yes, I tested the changes of 464 on a never-unregister clone, and it still hangs. |
Same here, tested using 464. What we are seeing:
|
Anybody who is experiencing this issue while using this branch (which includes #464) please comment with this survey on which features of rosbridge you are using, and maybe we can find correlation.
The application I'm working on does NOT experience the issue: My ROS distro: melodic
|
For our application: ROS Melodic:
|
In mine (does experience hang): My ROS distro: Melodic
|
Thank you, I've updated the survey with "authentication" and "bson encoding", please update if you use either of these features. Also let me know if there are any more features I'm forgetting or don't know about that might be relevant. |
Hi Gents, My ROS distro: Melodic
|
I'm running into this regularly when I publish a high frequency /tf topic at 100hz across the bridge. Client disconnects/reconnects are not possible and the entire bridge process appears to be deadlocked and needs to be killed. Only way to mitigate this is to slow down the publishing rate. @mvollrath I will try out your fix! Great timing! |
I'll roll out 0.11.7 with #496 and we will see what happens. Quite possible we will find more concurrency glitches in rosbridge_library since we've closed the door on deadlock and opened the door to race conditions. |
Note: this is for the standard released version, not mvolrath's fork which I'm very eager to try when I have a chance. Not sure if this is mentioned above, might be old news but I'll post it just so people can find it a bit easier. [ WARN ] [/rosbridge_websocket]: StreamClosedError: Tried to write to a closed stream |
We have tested #496 under high load, it seems to not hang anymore but we've noticed 2 issues so far:
|
Thanks @Axel13fr , now working on thread safety for capabilities. |
Please try #500 which adds thread safety at capability level. |
We tested #500, the hanging happened on that version: clients cannot connect to the socket anymore and an HTTP get on the autobahn websocket port doesn't respond. |
Thanks @Axel13fr I'll add a handler for the ping kick exception instead as a quick fix, since we need to finish shutting down the connection. |
#502 should fix the badness caused by the ping kick error, but it will still be logged. |
@mvollrath we tested #502 and we can still cause it to hang. When stopping ROS msg transmission to the bridge subs so that nothing is sent to the front, the hanging didn't happen. |
I've updated #502 to finish the protocol in the IncomingQueue thread, this should break the incoming/outgoing loop in onClose and unblock the reactor. There may be exceptions logged during shutdown. |
(Update: I'm beginning to believe I have a bit of this too: #437) Original post: Thank you @mvollrath for keeping the hope alive. Question: Does #502 include #464 ? I tested today and it seems like an improvement regardging the hanging from sudden disconnection (autobahn splash screen remains accessible, services over websocket continue working), but (at least some) topics ceased to come through when an addiitional client was connected, even though they are still being published (rostopic hz ...etc). After 5-10 minute wait, the clients started receiving the topic again. I seem to be able to replicate the hanging at will by: That seems to do it. Also, just timed how long it takes the topics to come back to life: ~16 minutes. |
Also if you're not already, make sure all your topics have non-zero |
@mvollrath you are the man! This latest version along explicit, non zero queue_length for topic has finally done it. I did try non zero queue_length on a previous server version and still couldn't get it to stay up. Now I can pull the plug on one client and other clients keep receiving their messages the way one would expect. I've spent the grater part of these last 20 hrs looking into this and finally things are looking up. You have my profound, heartfelt gratitude! This really helped! 👍 |
Releasing #502 in 0.11.8 for kinetic and melodic. Thanks everybody that has helped track this down and test. @benmitch Let me know how it goes of course, my application only uses a small set of rosbridge's features so we depend on your reports. |
Congratulations @mvollrath for supporting us all along on this, we have tested it along with the queue length in the frondend and so far no way to reproduce the hanging ! We will continue production test in the coming weeks and keep you posted. Shouldn't there be a default queue length to 1 in the roslibjs to make sure this will not happen again to anyone ? |
Hanging still seems to occur with |
When setting this |
This issue has been marked as stale because it has been open for 180 days with no activity. Please remove the stale label or add a comment to keep it open. |
This issue has the most detailed info about the bug that is still happening:
I believe this is also linked to the newer open issues #829 #765 and maybe #720 I am posting an update here because the following comment lists the exact steps we use to reproduce this issue:
|
Expected Behavior
rosbridge_server does not hang.
Actual Behavior
rosbridge_server hangs under some load (several connections, quite high rates (50Hz), but relatively little data throughput).
Steps to Reproduce the Problem
Specifications
echo $ROS_DISTRO
): kineticgrep DISTRIB_CODENAME /etc/lsb-release
): Raspbian GNU/Linux 9 (stretch)roscat rosbridge_server package.xml | grep '<version>'
): 0.11.2python -c 'import tornado; print tornado.version'
): 4.4.3We are having trouble with rosbridge_server (0.11.2) freezing when under a fair load (number of connections on a somewhat flaky network, rather than throughput). The freeze appears to be random, and occurs after some period, typically 10-30 seconds.
I investigated, and it looks like it's a race condition for _write_lock, similar to that in issue #403, but not solved by the change of #413 (which is present at 0.11.2).
After some testing, and reading the issue reports, I noticed that somebody had suggested that the behaviour of "yield" inside a "with lock" section wasn't certain. So I tried:
That is, I un-indented after the line "future =", so the lock is released once the call has been made, but before the yield call.
Immediately, the behaviour seems to be fixed, and it has now been running robustly for several minutes without freezing again. I cannot comment at this stage on whether there have been any side effects, or data corruption, but the upstream signals look fine so far.
The text was updated successfully, but these errors were encountered: