Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When losing a connection and then calling closeNow and open on a DeviceClient a connection thread is leaked #541

Closed
hurtmil opened this issue Jun 10, 2019 · 20 comments
Assignees
Labels
bug fix checked in Fix checked into main or preview, but not yet released. Underlying Library Issue Issue is a result of a known issue with a 3rd party library that we use

Comments

@hurtmil
Copy link

hurtmil commented Jun 10, 2019

OS Version

Android 8.1.0

Azure SDK Version

implementation 'com.microsoft.azure.sdk.iot:iot-device-client:1.17.3'
implementation 'com.microsoft.azure.sdk.iot.provisioning:provisioning-device-client:1.7.1'
implementation 'com.microsoft.azure.sdk.iot.provisioning.security:x509-provider:1.1.3'

Description of the issue:

While we are connected to the IotHub via the DeviceClient, we turn lte/wifi off on the connection we are currently using, after we detect the adapter is off we proceed to call closeNow on the DeviceClient and wait for the callback that we are in the DISCONNECTED state instead of the DISCONNECTED_RETRYING state, after that we then try to call open on DeviceClient again when we detect we have an adapter available (lte or wifi). After we go through this disconnect and reconnect cycle we noticed that a thread seems to be leaked under the name: "MQTT Con:" Here is a sample of how it looks after a few disconnect/reconnect cycles:
system 2293 4318 633 4365148 64608 0 0 S MQTT Con: 437tv
system 2293 4579 633 4365148 64608 0 0 S MQTT Con: 437tv
system 2293 4672 633 4365148 64608 0 0 S MQTT Con: 437tv
system 2293 4761 633 4365148 64608 0 0 S MQTT Con: 437tv
system 2293 4831 633 4365148 64608 0 0 S MQTT Call: 437t
system 2293 4836 633 4365148 64608 0 0 S MQTT Rec: 437tv
system 2293 4837 633 4365148 64608 0 0 S MQTT Snd: 437tv
system 2293 4840 633 4365148 64608 0 0 S MQTT Ping: 437t

@anhashia anhashia added the bug label Jun 17, 2019
@timtay-microsoft
Copy link
Member

From my initial testing, upgrading the paho MQTT dependency to 1.2.1 resolves this issue. However, when I upgrade our dependency it breaks all MQTT_WS connections, so I can't commit this fix yet. I have opened an issue on paho to investigate why upgrading is breaking us here

eclipse-paho/paho.mqtt.java#679

@timtay-microsoft
Copy link
Member

upon more testing on the latest version (1.17.4), it seems that MQTT threads are only leaked if you disconnect while the device is still within the DISCONNECTED_RETRYING state. If you allow your retry policy to end naturally, or if you set the retry policy to never retry, no MQTT threads are leaked after you call closeNow().

AMQP seems to not leak threads regardless of what state the SDK is in when you call closeNow(), for what it's worth.

@timtay-microsoft timtay-microsoft added the Underlying Library Issue Issue is a result of a known issue with a 3rd party library that we use label Jun 18, 2019
@soumithb
Copy link

Hello @timtay-microsoft , i did try setting the retry policy to never retry using the NoRetry class. But i always receive a DISCONNECTED_RETRYING status callback through the IotHubConnectionStatusChangeCallback.

This is what i added after creating the deviceclient object

DeviceClient client = DeviceClient.createFromSecurityProvider(
iotHubUri, deviceId, securityProvider, protocol);
NoRetry retryPolicy = new NoRetry();
client.setRetryPolicy(retryPolicy);

But everytime there is a connection loss i still see the DISCONNECTED_RETRYING .

But in this case i do not see any lingering MQTT threads from the previous connection.

@soumithb
Copy link

Hello @timtay-microsoft , on a side note , do you know if the azure sdk distinguishes between connections over LTE and Wifi ?
I ask this because, we use an internal library to keep count of the number of connections.
From our testing we observed us hitting the connection limit when we connect and disconnect over LTE (even though we have no active connections) , but we don't observe the same over WIFI.

@timtay-microsoft
Copy link
Member

Even though you get the callback saying DISCONNECTED_RETRYING, the SDK isn't doing any retry logic. Since you have set the retry policy to NoRetry, the SDK should transition immediately from DISCONNECTED_RETRYING to DISCONNECTED without trying to re-establish the connection.

@timtay-microsoft
Copy link
Member

Hello @timtay-microsoft , on a side note , do you know if the azure sdk distinguishes between connections over LTE and Wifi ?
I ask this because, we use an internal library to keep count of the number of connections.
From our testing we observed us hitting the connection limit when we connect and disconnect over LTE (even though we have no active connections) , but we don't observe the same over WIFI.

No, the SDK does not distinguish between LTE or Wifi connections. When you say connection limits, you are referring to the limit of active connections to IotHub, correct?

@soumithb
Copy link

Hello @timtay-microsoft . Thanks for the clarification on the above two questions.

Yes, i am referring the maximum active connections allowed to the IOT at at time from the service running on the device.
Regarding the question about distinguishing between LTE and wifi , I ask this because of the same reason i mentioned above. Over LTE whenvert we call a deviceClient.closeNow() , it doesn't appear that we are actually closing the connection since when we run a test where we keep calling the closeNow() over lte we hit the connection limit. But do not see the same happening over wifi. Are there any logs that you could help point to, which can help us identify, if the connections are in fact being closed.

Do you think the IOT server is implementing any Session Persistence, which is keeping the connection still active from being closed completely?

@soumithb
Copy link

soumithb commented Jun 24, 2019

Hello @timtay-microsoft. Through our testing we observed that on connection loss/regain , when we call closeNow, the azure sdk sometimes calls socket.close + sock.shutdowninput and sometimes just calls socket.shutdowninput before trying to connect again to the same server. For the cases that only socket.shutdowninput is called we still retain the connection and keep waiting for the client app (i.e. azure client) to call close on the socket. The result is another used connection by the azure client.

One theory is that the azure client is not calling close is that its not waiting (blocked) on an inputstream.read and does not get the IOException for the sockettimeoutevent caused by the connection failure and instead triggers on some other azure framework event that only calls close on the inputstream. Normally sockets expect a socket close to propagate the stream closure and cleanup.

Can you help answer why in some cases the azure client calls close on connection drop and on others it doesn't ?

@soumithb
Copy link

For HTTP it's a common misconception that a call to close from the client actually closes the connection. Most HTTP servers implement persistent connections whereby the server controls when the connection is actually closed even if the client calls close on itself.
https://en.wikipedia.org/wiki/HTTP_persistent_connection

For HTTP we recommend that HTTP header "Connection: close" be added to the HTTP request to inform the server to not use persistent connections for a given connection. Without this option an HTTP server can hold onto a connection for minutes (typically around 5mins upwards of 30mins).

For the Azure case the underlying protocol between the client/server for the Iot Hub communication is MQTT. There may be a similar persistent session mechanism at work. There is reason to believe the server is holding the sessions for this long to avoid re-handshaking on every client poll.

Can you provide feedback on how the server holds onto connection and when a close the SDK will actually free the resources?

If MQTT operates like HTTP then on every WIFI/LTE switch you could be holding on to 1-3 connections (DPS, Hub, Upload server). It would only take a few switches to consume the max connections allowed.

@soumithb
Copy link

Also, Anil suggested that we can avoid these open connections altogether by using the same device client object for every connection without calling the closeNow on every connection loss and instead rely on the retry mechanism to trigger the CONNECTED status callback.

In the below case
So i ran a simple test to not call the closeNow on a connection loss scenario (going into an area with no internet connectivity). Instead i was waiting for the sdk to reply back with a CONNECTED callback. I did not see the sdk reporting us of a connection loss and sending us a DISCONNECTED_RETRYING callback. At this point i tried sending messages to the IOTserver and noticed the azure sdk queuing up these messages instead.

DeviceIO: Message with messageid 6196d513-96af-41e4-8bb2-dc1dd570aa18 along with callback and callbackcontext is added to the queue, method name is getMethodName.

And once i went into an area with good internet connectivity i saw these messages being sent out.

Can you confirm if the sdk should report a connection loss and send us a Disconnected Retrying Callback ?

@soumithb
Copy link

soumithb commented Jun 24, 2019

There was another test that I ran in which i turned off the Network adapter while connected to IOT. In this case i did see the azure sdk detecting this connection loss and sending us a DISCONNECTED_RETRYING. And at this point i turned the network adapter on and confirmed that my device was connected to the internet. But i did not see a CONNECTED callback from azure and it timed out with a DISCONNECTED call back after 4 minutes which is the default timeout specified DEFAULT_OPERATION_TIMEOUT = 4 * 60 * 1000; //4 minutes

Can you check this on your end and help. Thanks .

@timtay-microsoft
Copy link
Member

There was another test that I ran in which i turned off the Network adapter while connected to IOT. In this case i did see the azure sdk detecting this connection loss and sending us a DISCONNECTED_RETRYING. And at this point i turned the network adapter on and confirmed that my device was connected to the internet. But i did not see a CONNECTED callback from azure and it timed out with a DISCONNECTED call back after 4 minutes which is the default timeout specified DEFAULT_OPERATION_TIMEOUT = 4 * 60 * 1000; //4 minutes

Can you check this on your end and help. Thanks .

The SDK's retry logic should handle this scenario gracefully for you, and re-establish its connection to IotHub. Can you provide the connection status change callback values that you see, including any throwables given by the callback?

@timtay-microsoft
Copy link
Member

Also, Anil suggested that we can avoid these open connections altogether by using the same device client object for every connection without calling the closeNow on every connection loss and instead rely on the retry mechanism to trigger the CONNECTED status callback.

I second this recommendation

In the below case
So i ran a simple test to not call the closeNow on a connection loss scenario (going into an area with no internet connectivity). Instead i was waiting for the sdk to reply back with a CONNECTED callback. I did not see the sdk reporting us of a connection loss and sending us a DISCONNECTED_RETRYING callback. At this point i tried sending messages to the IOTserver and noticed the azure sdk queuing up these messages instead.

This may be the expected behavior, depending on how long you believe the device lost connectivity. The SDK will not report DISCONNECTED_RETRYING immediately on pulling the ethernet plug/disabling the router. It will take some time for the underlying transport libraries to report to the SDK that connection was lost. If the SDK successfully regains connectivity to IotHub before the underlying transport libraries decide that connectivity was lost, then you will see the behavior you described above.

And once i went into an area with good internet connectivity i saw these messages being sent out.

Again, this is expected. The SDK may not report DISCONNECTED_RETRYING if the network outage is brief enough.

@soumithb
Copy link

There was another test that I ran in which i turned off the Network adapter while connected to IOT. In this case i did see the azure sdk detecting this connection loss and sending us a DISCONNECTED_RETRYING. And at this point i turned the network adapter on and confirmed that my device was connected to the internet. But i did not see a CONNECTED callback from azure and it timed out with a DISCONNECTED call back after 4 minutes which is the default timeout specified DEFAULT_OPERATION_TIMEOUT = 4 * 60 * 1000; //4 minutes
Can you check this on your end and help. Thanks .

The SDK's retry logic should handle this scenario gracefully for you, and re-establish its connection to IotHub. Can you provide the connection status change callback values that you see, including any throwables given by the callback?

Here is the test we Ran.

Scenario : . Turn Network Adapter On/OFF to verify that the device connects to iot hub while retrying

  1. First let device connect to IOT
  2. After verifying iot connection turn network adapter off and verify we get the DISCONNECTED_RETRYING callback and azure sdk reporting a connection loss .

Observe the following being logged by the sdk, indicating a connection loss
IotHubTransport( 2280): The messages in progress are buffered to be sent again due to a connection loss, method name is getMethodName
Verify the callback received from the sdk.
IOT HUB CONNECTION STATUS CHANGED TO : DISCONNECTED_RETRYING

  1. Once the DISCONNECTED_RETRYING is received , turn the network adapter on and verify internet connectivity.
  2. Observe here that the retry mechanism times-out after 4 minutes (default operation timeout) and we receive a DISCONNECTED callback.

IOT HUB CONNECTION STATUS CHANGED TO : DISCONNECTED

@timtay-microsoft
Copy link
Member

In the connection status callback, there is a field that contains the relevant throwable that caused the connection status to change

public void execute(IotHubConnectionStatus status, IotHubConnectionStatusChangeReason statusChangeReason, Throwable throwable, Object callbackContext)

What stacktraces do you see from that throwable object when you run your above test?

@soumithb
Copy link

soumithb commented Jun 26, 2019

In the connection status callback, there is a field that contains the relevant throwable that caused the connection status to change

public void execute(IotHubConnectionStatus status, IotHubConnectionStatusChangeReason statusChangeReason, Throwable throwable, Object callbackContext)

What stacktraces do you see from that throwable object when you run your above test?

I see this . once i get the DISCONNECTED_RETRYING , i turn the underlying network adapter on.

06-26 14:17:03.948 IOT HUB CONNECTION STATUS CHANGED TO : CONNECTED
06-26 14:17:03.948 IOT HUB CONNECTION STATUS CHANGED due to : CONNECTION_OK
06-26 14:17:03.948 IOT HUB CONNECTION STATUS throwable : null

06-26 14:17:13.260 IOT HUB CONNECTION STATUS CHANGED TO : DISCONNECTED_RETRYING
06-26 14:17:13.260 IOT HUB CONNECTION STATUS CHANGED due to : NO_NETWORK
06-26 14:17:13.260 IOT HUB CONNECTION STATUS throwable : com.microsoft.azure.sdk.iot.device.exceptions.ProtocolException: Mqtt connection lost

06-26 14:18:17.196 IOT HUB CONNECTION STATUS CHANGED TO : DISCONNECTED
06-26 14:18:17.197 IOT HUB CONNECTION STATUS CHANGED due to : RETRY_EXPIRED
06-26 14:18:17.197 IOT HUB CONNECTION STATUS throwable : com.microsoft.azure.sdk.iot.device.exceptions.DeviceOperationTimeoutException: Device operation for reconnection timed out

@timtay-microsoft
Copy link
Member

From the timestamps on the above logs, it seems that the device fails to reconnect, but it only tries for a minute. While typically I would expect the device to reconnect in under a minute, could you try extending the device operation timeout to a long period of time?

client.setOperationTimeout(30*60*1000); //30 minute timeout

I'm curious if the timeout is taking longer than expected, or if the reconnection attempts will always fail for you.

Also, just for comparison, does AMQPS or AMQPS_WS reconnect more reliably than MQTT for you?

@timtay-microsoft
Copy link
Member

A member of the MQTT Paho team is actively looking into the bug in Paho 1.2.1 that was preventing us from upgrading our dependency. We plan on updating and releasing our SDK as soon as that new Paho version becomes available. That new version of Paho should resolve this issue.

@timtay-microsoft timtay-microsoft added the fix checked in Fix checked into main or preview, but not yet released. label Sep 30, 2019
@timtay-microsoft
Copy link
Member

As of this release, the Device client and deps libraries will now use the latest Paho version 1.2.2 which fixes this issue. As such, I'm closing this issue

@az-iot-builder-01
Copy link
Collaborator

@hurtmil, @soumithb, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey

@timtay-microsoft timtay-microsoft self-assigned this Nov 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix checked in Fix checked into main or preview, but not yet released. Underlying Library Issue Issue is a result of a known issue with a 3rd party library that we use
Projects
None yet
Development

No branches or pull requests

5 participants