-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iot-hub-device-client unable to connect to IoT-Hub when switched from network #548
Comments
To be clear: |
How are you switching networks? Is this done by switching cables, bringing interfaces up and down, or by some other method? |
@BertKleewein Yes, switching cables. |
@BertKleewein Are you able to reproduce this? |
@BertKleewein |
It sure feels like a DNS issue to me, but I'm not 100% confident on that. I'd be curious to see the output from |
I’ll rebuild that setup and let u know. |
Hi, |
@BertKleewein |
Hello I have a similar issue running the 2.2.0 version of the SDK. The "connection refused" (as in issue #606) followed by the inability to reconnect is triggered by at least 2 situations:
|
@electrozen - I'm not sure if your issue is related to this one. Are you switching between two networks? If this isn't related to switching between networks, can you open a new issue please? For situation number 1, can you provide more information, or even better, a log? Simply overloading the CPU doesn't seem to be sufficient to cause problems, and I suspect there is something more specific happening. In your log for situation #2, it looks like the SAS credentials that are built from of your symmetric key are expiring while the system is asleep. Then, when the code tries to automatically reconnect, the connection is refused because the credentials are old. The client treats "UnauthorizedError" as a condition which requires user intervention, so it stops the automatic reconnection logic. Unfortunately, I don't think calling the 'connect' method to force a reconnect would work either because the token renewal is based on a timer countdown and not clock time. In other words, when you connect, the client connects using credentials which are valid for about an hour of clock time (give-or-take). Then the client starts counting down, and after 50 minutes, it renews the credentials. Normally this means that the client renews the connection credentials before they expire. However, when you put the computer to sleep, the count-down pauses but the credentials still expire. If you go to sleep when the token still has 20 minutes before it needs to be renewed, the library assumes it still has 20 minutes of validity after it wakes up. |
Thanks. Indeed the issue does not appear specifically when changing networks. I will open another issue with logs for a CPU overload situation. Overall my use case involve user devices sleeping, changing networks, getting high CPU etc etc. So far I'm unable to have device connections reliably last more than a couple days on my test fleet. Based on your explanation, I will have to setup a specific Azure client status check and process restart in my separate watcher service. |
@robbinvandamme, I was finally able to reproduce this and I'm afraid that I can't do anything to fix it. Before I go deeper, I need to explain that it takes our MQTT transport library (Paho) about 2 minutes to detect a broken connection on Linux. This is because the broken connection doesn't cause the socket to fail on Linux like it does on Windows. Instead, Paho has to detect the break by sending a This means that you can unplug your network cable for up to 2 minutes without the "dropped connection" code being executed. After that 2 minutes, it closes the socket and tries to re-open it again. Back to your bug, it looks like an OS component, maybe Network Manager, can get into a broken state where it doesn't recognize when Ethernet cables are unplugged. Once I was able to reproduce this bug once, I could reproduce it every time. I assume it won't reproduce once I reboot. I'm not sure exactly how it got into a broken state, but I just kept reproducing your issue with different timings (sometimes < 2 minutes, sometimes > 2 minutes). After about 10 or 15 times, it reproduced. When broken:
If I watch syslog (
I see this message 3 times, for 3 tries, and then the connection starts working. I assume Network Manager does something to repair the connection but I didn't dig any deeper. I was able to find discussions on similar problems by searching for |
@BertKleewein You speak about an OS issue? Networkmanager that would go into a broken state for example. Switching networks did not brake my internet connection. |
I think we had the same issue in one of our C++ applications. Maybe it has something todo with this...? |
@BertKleewein if needed, I'm willing to prepare a test-setup and have a video call. |
@robbinvandamme - it sounds like the |
@robbinvandamme - I'm sorry, I am unable to do anything to help you with this issue. Almost everything about this bug points to a problem with the underlying network stack, and the workaround calling All of the reading and experimenting that I've done points to the conclusion that the DNS resolver landscape is a mess. Most recently, it's this article ( https://tailscale.com/blog/sisyphean-dns-client-linux/ ) and a conversation that goes along with it ( https://news.ycombinator.com/item?id=26821298 ). This article talks about the challenging task of getting a single resolver configuration correct. Trying to come up with a solution that fixes a bug in a particular configuration without affecting any other configurations is an impossible task. You have been so incredibly patient with us and with me on this issue. You have endured my slow response. You have offered your time and assistance. I appreciate all of this more than I can say. I only wish I could help more (or at all), and I hope this doesn't cause you more pain. (For what it's worth, I wonder if the reason that |
@BertKleewein I'm using azure-iot-device==2.9.0. What happens:
I don't have much logs for the moment. I still think this issue have something to do with the resolver not getting updated in the application/sdk? Something I don't get is, why are the rest calls to other servers recovering, but the iot-hub-device-sdk is not? Kind regards, Robbin |
@BertKleewein I'll come back with the results :-). |
@BertKleewein When connection drops due to network switch I check the connection state and if its disconnect, I again call __res_init(). That did the trick but I'm not convinced the user of the sdk should keep the resolver/dns settings up to date? Could you drop me your opinion on that? I got a hold of __res_init() like this, and I call it when we are disconnected right before connect()
I think this will also fix the scenario where router and application are booted together and router is not fully up and running at the time we call our first connect(). Kind regards, Robbin |
@BertKleewein do you know how they handled this in the c-sdk? |
Context
Description of the issue
When switching from network while application is running and connected, connection to IoT-Hub fails on the new network.
Steps to reproduce:
Code sample exhibiting the issue
No code sample needed, just make sure the iot-hub-device-client.connect() is called when in network1.
AB#7366704
The text was updated successfully, but these errors were encountered: