-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rosbridge_websocket: WebSocket opening handshake timed out #403
Comments
I've got more informations: I managed to reproduce the bug on our project. As a workaround, I've created a node that can kill the rosbridge to let roslaunch restart it. Interesting facts:
If someone has the same problem, here is the node I wrote: https://gist.github.com/Alabate/9b2467018503e8d96c366e3448d34d2c I know it's not an easy bug to troubleshoot, because I don't know how to trigger it. But I would like to have some help to gather more information to troubleshoot. |
I managed to trigger it manually! I've made a python script that connect to the websocket, subscribe to With 3 processes or more, it fails nearly instantly. The "test" script stop to print stuff, and if you kill it, you can see that Here is the script with 8 processes in the pool #!/usr/bin/env python3
import asyncio
import websockets
import time
from multiprocessing import Pool
class RosbridgeTester():
def __init__(self):
with Pool(8) as p:
p.map(self.hello_process, range(100))
def hello_process(self, i):
asyncio.get_event_loop().run_until_complete(self.hello(i))
async def hello(self, i=0):
async with websockets.connect('ws://localhost:9090') as websocket:
print(i)
req = '{ "op": "subscribe", "topic": "/client_count", "type": "std_msgs/Int32" }'
await websocket.send(req)
print(f"{i} > {req}")
res = await websocket.recv()
print(f"{i} < {res}")
if __name__ == "__main__":
rosbridge_tester = RosbridgeTester() Here is the full log:
|
I've done a little more digging and the hanging function is Now that we are in
It seems that the lock stay locked when the import threading
lock = threading.Lock()
def check_locked():
if lock.acquire(False):
print("Not locked")
lock.release()
else:
print("LOCKED")
def gen_func():
yield 0
with lock:
yield 1
yield 2
gen = gen_func()
# At first the lock shouldn't be lock
check_locked()
# Shouldn't be locked too after the first yield
print(next(gen))
check_locked()
# What happen at the second yield within the `with`?
print(next(gen))
check_locked()
# The last one shouldn't be locked too
print(next(gen))
check_locked() And the output
Can be closed if #408 is accepted. |
There is another condition for #!/usr/bin/env python3
import gc
import threading
lock = threading.Lock()
def check_locked():
if lock.acquire(False):
print("Not locked")
lock.release()
else:
print("LOCKED")
def gen_func():
with lock:
yield 0
gen = gen_func()
# At first the lock shouldn't be lock
check_locked()
# We expect it is locked here
print(next(gen))
check_locked()
# But not after garbage collection
gen = None
gc.collect()
check_locked()
|
The return value of self.write_message is a |
For example, if you lock, exhaust the inner generator with a Don't you wish Tornado had features to solve this problem? |
I don't see any 'inner generator', do you talk about Anyway the idea is the same, we need to avoid reliying on the garbadge collector and ensure the
I managed to do a POC with Intersting fact: @coroutine
def prewrite_message(self, message, binary):
self._write_lock.acquire()
# Use a try block because the log decorator doesn't cooperate with @coroutine.
try:
future_handle = self.write_message(message, binary)
except WebSocketClosedError:
rospy.logwarn('WebSocketClosedError: Tried to write to a closed websocket')
raise
except BadYieldError:
# Tornado <4.5.0 doesn't like its own yield and raises BadYieldError.
# This does not affect functionality, so pass silently only in this case.
if tornado_version_info < (4, 5, 0, 0):
pass
else:
_log_exception()
raise
except:
_log_exception()
raise
finally:
self._write_lock.release()
# Ensure lock is unlocked at the end of write_message
if future_handle is None:
rospy.logwarn('write_message was canceled')
self._write_lock.release()
else:
future_handle.add_done_callback(lambda f: self._write_lock.release())
yield future_handle |
Also, I don't understand the need of locking here: def send_message(self, message):
if type(message) == bson.BSON:
binary = True
elif type(message) == bytearray:
binary = True
message = bytes(message)
else:
binary = False
with self._write_lock:
IOLoop.instance().add_callback(partial(self.prewrite_message, message, binary)) We are just adding the function to the callback, it's not executed now (and we will be locking anyway, when we will really write on tornado). And as said in the documentation add_callback is threadsafe. |
Having a blocking function in a coroutine is part of this exercise. It is waiting on an The best case is:
The order threads waiting for a lock are acquired is undefined and may vary across implementations, so we can't expect true FIFO, but the result should always be that the infinite Tornado queue is reduced to a few messages on deck for each topic. Controlling and optimizing this behavior would be an interesting project. At some point Tornado is probably just getting in the way. It sounds like using |
Unfortunately, However playing around with the code, I find another way to fix the issue: def prewrite_message(self, message, binary):
# Use a try block because the log decorator doesn't cooperate with @coroutine.
try:
with self._write_lock:
future_handle = self.write_message(message, binary)
# When closing, self.write_message() return None even if it's an undocument output.
# Consider it as WebSocketClosedError
if future_handle is None:
raise WebSocketClosedError
yield future_handle
except WebSocketClosedError:
rospy.logwarn('WebSocketClosedError: Tried to write to a closed websocket')
raise As we've seen before, sometime And as described here:
The question is now, why |
That looks great. |
Expected Behavior
rosbridge_websocket
should never stops to accept new connectionsActual Behavior
We use
rosbridge_websocket
with our web app (via roslibjs). Every time we stat the bridge and our app, everything works as expected. However after a random amount of time,rosbridge_websocket
stop to accept new connections and we get the following error (on chrome web console):Once the rosbridge is in this state, we've tried to connect from other browser or even from a bare websocket client and we have no response from the bridge.
If we check the logs of the bridge, without the problem, we have some interesting data:
But we never get a
As usual.
Please note that our web app automatically reconnect when connection is lost. That may lead to a very fast disconnection/reconnection.
More data that could help you to troubleshoot:
The bridge seems to think that there is still a client connected:
The bridge is still registred as listening on the linux network stack:
Specifications
0.11.0
(we've just updated, but the issue is not new)The text was updated successfully, but these errors were encountered: