-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[openhabcloud] Reconnection Fixes #14251
Conversation
Signed-off-by: Dan Cunningham <dan@digitaldan.com>
@ssalonen there seems to be a pretty bad issue with our cloud binding. Would appreciate some eyes on this one. Unfortunetly I'm leaving town tomorrow, but will try and look at this when i can until i'm back mid next week. When debugging connection issues on our cloud service, i noticed our nginx load balancers are being slammed with the following logs hundreds of time per minute (or more, its very noisy).
If you look you can see that the there are a few polling calls then a websocket attempt, which fails with a 400 error. At first i thought we had an issue with something server side preventing websockets and forcing polling, so started going down that route. But on my test system i noticed that even when I stopped the cloud binding on my local machine through the karaf console, the binding was continuing to log connection events !!!
What i have realized is that the binding is somehow creating orphaned socket.io connections , so a thread leak. I don't think these connections actually fully connect, the session gets rejected with the As a test i stopped the binding (using the console) on my home production system which has been running for a few days, and can see the same behavior, my log continues to show these ping and pong messages, so i assume this is extremely widespread. As a start i went through the code looking for how something like this could occur, the first thing i noticed was the reconnect code schedules a future, but does not keep a reference to it for clean up purposes, this seems like a good place to start, as i could see this connecting a socket after the binding was stopped or restarted (like when saving configs). I'm going to let my test system run for a bit, but so far i am not seeing the same behavior after this fix, but need to do more testing. |
FYI, the |
I will have a look with this PR. In #11153 the reconnect behaviour changed such that the addon retries connect, also with "errors during connection": Line 266 in dff0bda
It was later made to use scheduler: #13421 re. Http 400 and retry Actually invalid uuid was one of the methods I tested #11153 (comment) Probably tuning of time between reconnects would be nice? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to follow the thread-safe pattern established in many bindings, e.g https://github.com/openhab/openhab-addons/blob/main/bundles/org.openhab.binding.webthing/src/main/java/org/openhab/binding/webthing/internal/WebThingHandler.java
....openhab.io.openhabcloud/src/main/java/org/openhab/io/openhabcloud/internal/CloudClient.java
Show resolved
Hide resolved
....openhab.io.openhabcloud/src/main/java/org/openhab/io/openhabcloud/internal/CloudClient.java
Show resolved
Hide resolved
....openhab.io.openhabcloud/src/main/java/org/openhab/io/openhabcloud/internal/CloudClient.java
Outdated
Show resolved
Hide resolved
....openhab.io.openhabcloud/src/main/java/org/openhab/io/openhabcloud/internal/CloudClient.java
Outdated
Show resolved
Hide resolved
....openhab.io.openhabcloud/src/main/java/org/openhab/io/openhabcloud/internal/CloudClient.java
Outdated
Show resolved
Hide resolved
....openhab.io.openhabcloud/src/main/java/org/openhab/io/openhabcloud/internal/CloudClient.java
Outdated
Show resolved
Hide resolved
Reading that PR again, i noticed you mentioned:
So with the new code (well newish at this point) , we are now trying indefinitely where before the binding gave up? I wondering if this is the source of so many rejections (the logs are full of them). I have noticed this in the node logs as well.
If the server specifically rejects us, worst case we should not be trying very frequently to reconnect, ideally we would not try and reconnect at all until the user takes some action, but the binding is not set up for that right now. Long term its probably a good idea to revisit the whole UUID / Secret thing, make this part of the binding configuration and maybe have the server generate the secret and have the user enter it in the binding config in the Main UI, which at that point we try a connection, but not retry if the server specifically rejects it. (this is all of the top of my head, so probably need to think about it more).
Will do, and thanks, i was also concerned about the multi threaded aspect of this quick fix, but wanted to get something up for discussion. |
Also whatever we do here , i would want to make this part of the 3.4.2 fix release. Let me know what you think about tuning specific server rejections, would love to reduce that load if possible. |
Yes, agreed that we should not retry if server explicitly rejects. That makes sense. Perhaps there is a way to differentiate this from other errors... And indeed: before the PR, any error during connect phase was "fatal", no new attempts were made. |
yeah, i don't see that specifically in the cloud code, which is unfortunate. We could add this, so send back a 401 or similar if we can't match credentials. I'm not sure how that get passed down to the java socket.io client, would have to do some testing. |
in the worst case, we should be able to look for some string "unauthorized" in the error ... whatever gets logged down |
Hi @ssalonen , i'm back spending some time on this. My latest push incorporates you suggestions, i also added an option to have the library only use websocket connections which removes the annoying polling requests to create the session and instead does it all on a single request (and that we never actually use polling ever, afaik). so this part is a little tl;dr , but i would appreciate you taking a look since you know this code better then most. I have another question for you about the socket.io automatic reconnect logic. During an event when my system experienced this problem, my logs show that actually the custom reconnect logic was never actually used. Socket.io was reconnecting internally , which lead to the split brain. I think i may understand why, but would like your opinion since you also worked on this. Before i post my logs, i'll mention a couple important things. It seems that the socket.io client has reconnects enabled by default, so if the connection dies, it tries to connect again behind the scenes. I believe you put in some nice logging here, so it is much easier to debug the connection lifecycle. The other piece here is the okhttp client, by default it has a read timeout of 10 seconds before throwing an error. This will be important below. So in the logs below you will see that we
The problem i believe is that on the first disconnect socket.io tries reconnecting, but then gives up after 10 seconds. Unfortunately the server may have been very busy and is still processing that request, but the client is now immediately connecting again, so our service is processing 2 connections, and we get a race condition which can lead to the first reconnection overwriting the DB on the second reconnection (which started after, but also completed before the first one)
I have a couple of ideas here. First is that i think we need to make the okhttp client have a larger read timeout, so something like: okHttpBuilder.readTimeout(1, TimeUnit.MINUTES); Second is that there are a number of options we can set on the socket.io client around reconnects, for example i would maybe suggest a much slower connect cycle with longer timeouts .setReconnection(true)
.setReconnectionAttempts(Integer.MAX_VALUE) //default value Integer.MAX_VALUE
.setReconnectionDelay(5_000) //default value 1_000
.setReconnectionDelayMax(60_000) // default value 5_000
.setRandomizationFactor(0.5) // default value 0.5
.setTimeout(60_000) // default value 20_000 The other thing i am thinking about (but need to sleep on), is using redis as a network lock, so when a new connection comes in, it creates an entry in redis with some TTL on it. Once the connection is complete that lock is removed, or if something dies, it expires after a reasonable amount of time. If a second connection comes in, it will see the lock and fail. wdyt? |
…p, removes unused variables. Signed-off-by: Dan Cunningham <dan@digitaldan.com>
38ad2d8
to
6467cd7
Compare
Signed-off-by: Dan Cunningham <dan@digitaldan.com>
@ssalonen i went ahead and added the backoff settings to the internal socket.io options, and increased the timeout values as i mentioned above. |
and i have a redis solution i am testing on my system |
redis PR |
I will answer the quick points first
agreed. Those are fairly big numbers(1), but I guess still reasonable: delay of at most 60s between reconnects.
It will recover certain connection failures which would not be otherwise retried (I can try to find the details from the old PR). Or that is the attempt at least. The intent was to see if this was the culprit, before all the debugging and logging done by the community. Indeed many failures are connected by socket.io internal logic, and no separate reconnection logic is applied on binding side. Actually, I have no records of any user reports where this would have "fixed" a connection. I might even go further and say we remove this logic alltogether. Let me still check the old PR, hopefully later today
this is a great finding and... plausible I will check the redis pr, initially this sounds very robust solution and exactly what we need! Was redis part of the stack already? (1) see earlier experiments #12121 (comment) |
....openhab.io.openhabcloud/src/main/java/org/openhab/io/openhabcloud/internal/CloudClient.java
Show resolved
Hide resolved
....openhab.io.openhabcloud/src/main/java/org/openhab/io/openhabcloud/internal/CloudClient.java
Show resolved
Hide resolved
From my testing, that code gets used when the server specifically calls
Yes, its used as the backing session store for express as well as doing mongo query caching using Mongoose. Beyond using it as a network connection lock, i think there is a good opportunity to use it for other purposes, like offline tracking, proxy route look ups, etc.... I'm right now looking to upgrade our Redis and Mongo servers to the latest versions, there are a few improvements that we could use to make the lock solution better with less calls.
Yikes, if i had gone back and read your post again, it probably would have saved me much time as you clearly pointed out the reconnection settings in socket.io ! Apologies for not responding back to that post ! Thanks for taking a look! |
Ok, sounds promising. I must say I have hard time reading the socket/engine client code and understand when that is triggered. I think I have only seen it with first connect errors (wrong uuid example).. |
Actually you may be right, i think this gets called when the server calls .on(Socket.EVENT_DISCONNECT, args -> {
if (args.length > 0) {
logger.warn("Socket.IO disconnected: {}", args[0]);
} else {
logger.warn("Socket.IO disconnected");
}
isConnected = false;
onDisconnect();
})// This means we are no longer connected, but no retry logic is attempted. Its likely this is occasionally the cause of the binding needed to be restarted after server issues ? |
Hmm, you mean that server calling socket.disconnect would in some cases result in error on client side? |
Signed-off-by: Dan Cunningham <dan@digitaldan.com>
Signed-off-by: Dan Cunningham <dan@digitaldan.com>
If the server calls |
Ok, i'm good with this PR, i think this should also be part of the next 3.4.x release as well. |
Signed-off-by: Dan Cunningham <dan@digitaldan.com>
....openhab.io.openhabcloud/src/main/java/org/openhab/io/openhabcloud/internal/CloudClient.java
Show resolved
Hide resolved
Signed-off-by: Dan Cunningham <dan@digitaldan.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. If you like, add a comment referencing socket io java client code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you
@jlaur : could you please cherry-pick that PR in 3.4.x branch ? |
* [openhabcloud] Possible connection leak * Creates thread safe reconnection, reduces unnecessary polling on setup, removes unused variables. * adds the reconnect settings to the internal socket.io options. * Up the min reconnect time * Use @ssalonen sugestion for backoff mins and randomness * Reconnect after server initiated disconnect * Remove unhelpful comments Signed-off-by: Dan Cunningham <dan@digitaldan.com>
Done. Can you have a look at these, which I would also consider for 3.4.x:
I just need a second pair of eyes to make the decision. Especially the last one since it's my own PR. |
I already answered for the wemo PR ;) Will have a look for the JDBC one. |
Agree with you for the JDBC PR. |
Thanks! |
* [openhabcloud] Possible connection leak * Creates thread safe reconnection, reduces unnecessary polling on setup, removes unused variables. * adds the reconnect settings to the internal socket.io options. * Up the min reconnect time * Use @ssalonen sugestion for backoff mins and randomness * Reconnect after server initiated disconnect * Remove unhelpful comments Signed-off-by: Dan Cunningham <dan@digitaldan.com> # Conflicts: # bundles/org.openhab.io.openhabcloud/src/main/java/org/openhab/io/openhabcloud/internal/CloudClient.java
This pull request has been mentioned on openHAB Community. There might be relevant details there: |
* [openhabcloud] Possible connection leak * Creates thread safe reconnection, reduces unnecessary polling on setup, removes unused variables. * adds the reconnect settings to the internal socket.io options. * Up the min reconnect time * Use @ssalonen sugestion for backoff mins and randomness * Reconnect after server initiated disconnect * Remove unhelpful comments Signed-off-by: Dan Cunningham <dan@digitaldan.com>
* [openhabcloud] Possible connection leak * Creates thread safe reconnection, reduces unnecessary polling on setup, removes unused variables. * adds the reconnect settings to the internal socket.io options. * Up the min reconnect time * Use @ssalonen sugestion for backoff mins and randomness * Reconnect after server initiated disconnect * Remove unhelpful comments Signed-off-by: Dan Cunningham <dan@digitaldan.com>
* [openhabcloud] Possible connection leak * Creates thread safe reconnection, reduces unnecessary polling on setup, removes unused variables. * adds the reconnect settings to the internal socket.io options. * Up the min reconnect time * Use @ssalonen sugestion for backoff mins and randomness * Reconnect after server initiated disconnect * Remove unhelpful comments Signed-off-by: Dan Cunningham <dan@digitaldan.com>
This pull request has been mentioned on openHAB Community. There might be relevant details there: https://community.openhab.org/t/connection-to-openhab-cloud-fails-for-openhab-3-4-1/147253/1 |
Signed-off-by: Dan Cunningham dan@digitaldan.com