-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When losing a connection and then calling closeNow and open on a DeviceClient a connection thread is leaked #541
Comments
From my initial testing, upgrading the paho MQTT dependency to 1.2.1 resolves this issue. However, when I upgrade our dependency it breaks all MQTT_WS connections, so I can't commit this fix yet. I have opened an issue on paho to investigate why upgrading is breaking us here |
upon more testing on the latest version (1.17.4), it seems that MQTT threads are only leaked if you disconnect while the device is still within the DISCONNECTED_RETRYING state. If you allow your retry policy to end naturally, or if you set the retry policy to never retry, no MQTT threads are leaked after you call closeNow(). AMQP seems to not leak threads regardless of what state the SDK is in when you call closeNow(), for what it's worth. |
Hello @timtay-microsoft , i did try setting the retry policy to never retry using the NoRetry class. But i always receive a DISCONNECTED_RETRYING status callback through the IotHubConnectionStatusChangeCallback. This is what i added after creating the deviceclient object DeviceClient client = DeviceClient.createFromSecurityProvider( But everytime there is a connection loss i still see the DISCONNECTED_RETRYING . But in this case i do not see any lingering MQTT threads from the previous connection. |
Hello @timtay-microsoft , on a side note , do you know if the azure sdk distinguishes between connections over LTE and Wifi ? |
Even though you get the callback saying DISCONNECTED_RETRYING, the SDK isn't doing any retry logic. Since you have set the retry policy to NoRetry, the SDK should transition immediately from DISCONNECTED_RETRYING to DISCONNECTED without trying to re-establish the connection. |
No, the SDK does not distinguish between LTE or Wifi connections. When you say connection limits, you are referring to the limit of active connections to IotHub, correct? |
Hello @timtay-microsoft . Thanks for the clarification on the above two questions. Yes, i am referring the maximum active connections allowed to the IOT at at time from the service running on the device. Do you think the IOT server is implementing any Session Persistence, which is keeping the connection still active from being closed completely? |
Hello @timtay-microsoft. Through our testing we observed that on connection loss/regain , when we call closeNow, the azure sdk sometimes calls socket.close + sock.shutdowninput and sometimes just calls socket.shutdowninput before trying to connect again to the same server. For the cases that only socket.shutdowninput is called we still retain the connection and keep waiting for the client app (i.e. azure client) to call close on the socket. The result is another used connection by the azure client. One theory is that the azure client is not calling close is that its not waiting (blocked) on an inputstream.read and does not get the IOException for the sockettimeoutevent caused by the connection failure and instead triggers on some other azure framework event that only calls close on the inputstream. Normally sockets expect a socket close to propagate the stream closure and cleanup. Can you help answer why in some cases the azure client calls close on connection drop and on others it doesn't ? |
For HTTP it's a common misconception that a call to close from the client actually closes the connection. Most HTTP servers implement persistent connections whereby the server controls when the connection is actually closed even if the client calls close on itself. For HTTP we recommend that HTTP header "Connection: close" be added to the HTTP request to inform the server to not use persistent connections for a given connection. Without this option an HTTP server can hold onto a connection for minutes (typically around 5mins upwards of 30mins). For the Azure case the underlying protocol between the client/server for the Iot Hub communication is MQTT. There may be a similar persistent session mechanism at work. There is reason to believe the server is holding the sessions for this long to avoid re-handshaking on every client poll. Can you provide feedback on how the server holds onto connection and when a close the SDK will actually free the resources? If MQTT operates like HTTP then on every WIFI/LTE switch you could be holding on to 1-3 connections (DPS, Hub, Upload server). It would only take a few switches to consume the max connections allowed. |
Also, Anil suggested that we can avoid these open connections altogether by using the same device client object for every connection without calling the closeNow on every connection loss and instead rely on the retry mechanism to trigger the CONNECTED status callback. In the below case DeviceIO: Message with messageid 6196d513-96af-41e4-8bb2-dc1dd570aa18 along with callback and callbackcontext is added to the queue, method name is getMethodName. And once i went into an area with good internet connectivity i saw these messages being sent out. Can you confirm if the sdk should report a connection loss and send us a Disconnected Retrying Callback ? |
There was another test that I ran in which i turned off the Network adapter while connected to IOT. In this case i did see the azure sdk detecting this connection loss and sending us a DISCONNECTED_RETRYING. And at this point i turned the network adapter on and confirmed that my device was connected to the internet. But i did not see a CONNECTED callback from azure and it timed out with a DISCONNECTED call back after 4 minutes which is the default timeout specified DEFAULT_OPERATION_TIMEOUT = 4 * 60 * 1000; //4 minutes Can you check this on your end and help. Thanks . |
The SDK's retry logic should handle this scenario gracefully for you, and re-establish its connection to IotHub. Can you provide the connection status change callback values that you see, including any throwables given by the callback? |
I second this recommendation
This may be the expected behavior, depending on how long you believe the device lost connectivity. The SDK will not report DISCONNECTED_RETRYING immediately on pulling the ethernet plug/disabling the router. It will take some time for the underlying transport libraries to report to the SDK that connection was lost. If the SDK successfully regains connectivity to IotHub before the underlying transport libraries decide that connectivity was lost, then you will see the behavior you described above.
Again, this is expected. The SDK may not report DISCONNECTED_RETRYING if the network outage is brief enough. |
Here is the test we Ran. Scenario : . Turn Network Adapter On/OFF to verify that the device connects to iot hub while retrying
Observe the following being logged by the sdk, indicating a connection loss
IOT HUB CONNECTION STATUS CHANGED TO : DISCONNECTED |
In the connection status callback, there is a field that contains the relevant throwable that caused the connection status to change public void execute(IotHubConnectionStatus status, IotHubConnectionStatusChangeReason statusChangeReason, Throwable throwable, Object callbackContext) What stacktraces do you see from that throwable object when you run your above test? |
I see this . once i get the DISCONNECTED_RETRYING , i turn the underlying network adapter on. 06-26 14:17:03.948 IOT HUB CONNECTION STATUS CHANGED TO : CONNECTED 06-26 14:17:13.260 IOT HUB CONNECTION STATUS CHANGED TO : DISCONNECTED_RETRYING 06-26 14:18:17.196 IOT HUB CONNECTION STATUS CHANGED TO : DISCONNECTED |
From the timestamps on the above logs, it seems that the device fails to reconnect, but it only tries for a minute. While typically I would expect the device to reconnect in under a minute, could you try extending the device operation timeout to a long period of time? client.setOperationTimeout(30*60*1000); //30 minute timeout I'm curious if the timeout is taking longer than expected, or if the reconnection attempts will always fail for you. Also, just for comparison, does AMQPS or AMQPS_WS reconnect more reliably than MQTT for you? |
A member of the MQTT Paho team is actively looking into the bug in Paho 1.2.1 that was preventing us from upgrading our dependency. We plan on updating and releasing our SDK as soon as that new Paho version becomes available. That new version of Paho should resolve this issue. |
As of this release, the Device client and deps libraries will now use the latest Paho version 1.2.2 which fixes this issue. As such, I'm closing this issue |
@hurtmil, @soumithb, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey |
OS Version
Android 8.1.0
Azure SDK Version
Description of the issue:
While we are connected to the IotHub via the DeviceClient, we turn lte/wifi off on the connection we are currently using, after we detect the adapter is off we proceed to call closeNow on the DeviceClient and wait for the callback that we are in the DISCONNECTED state instead of the DISCONNECTED_RETRYING state, after that we then try to call open on DeviceClient again when we detect we have an adapter available (lte or wifi). After we go through this disconnect and reconnect cycle we noticed that a thread seems to be leaked under the name: "MQTT Con:" Here is a sample of how it looks after a few disconnect/reconnect cycles:
system 2293 4318 633 4365148 64608 0 0 S MQTT Con: 437tv
system 2293 4579 633 4365148 64608 0 0 S MQTT Con: 437tv
system 2293 4672 633 4365148 64608 0 0 S MQTT Con: 437tv
system 2293 4761 633 4365148 64608 0 0 S MQTT Con: 437tv
system 2293 4831 633 4365148 64608 0 0 S MQTT Call: 437t
system 2293 4836 633 4365148 64608 0 0 S MQTT Rec: 437tv
system 2293 4837 633 4365148 64608 0 0 S MQTT Snd: 437tv
system 2293 4840 633 4365148 64608 0 0 S MQTT Ping: 437t
The text was updated successfully, but these errors were encountered: