Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional hang on AWSIoTMQTTClient.connect() #197

Closed
samvrlewis opened this issue Apr 10, 2019 · 3 comments
Closed

Occasional hang on AWSIoTMQTTClient.connect() #197

samvrlewis opened this issue Apr 10, 2019 · 3 comments

Comments

@samvrlewis
Copy link

I'm occasionally seeing AWSIoTMQTTClient.connect() indefinitely hang. Seems to be the same issue as reported in #40, but there wasn't a proper resolution found there.

Logs when this happen:

Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,819 - AWSIoTPythonSDK.core.protocol.internal.clients - DEBUG - Initializing MQTT layer...
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,824 - AWSIoTPythonSDK.core.protocol.internal.clients - DEBUG - Registering internal event callbacks to MQTT layer...
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,825 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - MqttCore initialized
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,825 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Client id: 020000035
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,826 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Protocol version: MQTTv3.1.1
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,827 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Authentication type: TLSv1.2 certificate based Mutual Auth.
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,828 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring endpoint...
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,828 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring certificates...
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,830 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring offline requests queueing: max queue size: 0
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,834 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring offline requests queue draining interval: 0.500000 sec
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,836 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring connect/disconnect time out: 10.000000 sec
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,837 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring MQTT operation time out: 30.000000 sec
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,838 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Performing sync connect...
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,839 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Performing async connect...
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,839 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Keep-alive: 600.000000 sec
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,846 - AWSIoTPythonSDK.core.protocol.internal.workers - DEBUG - Event consuming thread started
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,847 - AWSIoTPythonSDK.core.protocol.mqtt_core - DEBUG - Passing in general notification callbacks to internal client...
Mar 28 19:29:35 my_application.py[1937]: 2019-03-28 19:29:35,848 - AWSIoTPythonSDK.core.protocol.internal.clients - DEBUG - Filling in fixed event callbacks: CONNACK, DISCONNECT, MESSAGE

Comparing this to the logs when everything works, it looks as though it's hanging before the Starting network I/O thread... log message is printed from clients.py:

self._logger.debug("Starting network I/O thread...")

Which leads me to believe it's probably hanging on one of the Lock.acquire calls in either reconnect:

or connect_async:
def connect_async(self, host, port=1883, keepalive=60, bind_address=""):

It's a frustrating issue as it's difficult to detect when it has occurred. If there are contention issues for the Locks, I'd much rather the SDK throw an exception than hang forever so that my application can still recover.

Frustratingly, I haven't found a way to try to replicate this yet.

Appreciate any help or insight!

@github-actions
Copy link

Greetings! Sorry to say but this is a very old issue that is probably not getting as much attention as it deservers. We encourage you to check if this is still an issue in the latest release and if you find that this is still a problem, please feel free to open a new one.

@github-actions github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels May 13, 2020
@jackhamburger
Copy link

@samvrlewis Did you ever resolve this? I am about to timeout the call and retry on a slightly longer interval than self._operation_timeout_sec - If this is a lock based issue, I'm not certain this will succeed in the future given your suggestion this is a lock based issue.

Additionally, I think you may be right as I observe this failure non-deterministically after a publish failure. Deep in the call stack of a publish there is the acquisition of the '_out_pack_mutex' around:

I think if the timeout occurs with this acquired, we would see the hanging. I am not sure what the probability of that is though.

Seem like the fix would require a fairly large refactor of how mutex's are handled or an additional wrapper on top of this internal to the lib.

Ill continue looking for a workaround.

@samvrlewis
Copy link
Author

@jackhamburger when I came across this issue I think AWS was in the process of writing the v2 python version (https://github.com/aws/aws-iot-device-sdk-python-v2) which looks like it's ready for general use now. Maybe it's worth trying that library instead?

My "solution" here at the time was to migrate to using Golang (with a non-AWS MQTT library) instead, which did work for my use case but potentially isn't very helpful for you.. sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants