Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[openhabcloud] Reconnection Fixes #14251

Merged
merged 7 commits into from
Jan 29, 2023
Merged

Conversation

digitaldan
Copy link
Contributor

Signed-off-by: Dan Cunningham dan@digitaldan.com

Signed-off-by: Dan Cunningham <dan@digitaldan.com>
@digitaldan digitaldan added the work in progress A PR that is not yet ready to be merged label Jan 21, 2023
@digitaldan digitaldan requested a review from ssalonen January 21, 2023 04:58
@digitaldan
Copy link
Contributor Author

digitaldan commented Jan 21, 2023

@ssalonen there seems to be a pretty bad issue with our cloud binding. Would appreciate some eyes on this one. Unfortunetly I'm leaving town tomorrow, but will try and look at this when i can until i'm back mid next week.

When debugging connection issues on our cloud service, i noticed our nginx load balancers are being slammed with the following logs hundreds of time per minute (or more, its very noisy).

1.2.3.4 - - [21/Jan/2023:04:53:02 +0000] "GET /socket.io/?EIO=3&transport=polling HTTP/1.1" 200 99 "-" "okhttp/3.8.1"
1.2.3.4 - - [21/Jan/2023:04:53:02 +0000] "GET /socket.io/?EIO=3&transport=polling&sid=9nGwDNdsO20YmYRsKHNf HTTP/1.1" 200 21 "-" "okhttp/3.8.1"
1.2.3.4 - - [21/Jan/2023:04:53:02 +0000] "POST /socket.io/?EIO=3&transport=polling&sid=9nGwDNdsO20YmYRsKHNf HTTP/1.1" 200 2 "-" "okhttp/3.8.1"
1.2.3.4 - - [21/Jan/2023:04:53:03 +0000] "GET /socket.io/?EIO=3&transport=websocket&sid=9nGwDNdsO20YmYRsKHNf HTTP/1.1" 400 18 "-" "okhttp/3.8.1"
1.2.3.4 - - [21/Jan/2023:04:53:32 +0000] "GET /socket.io/?EIO=3&transport=polling HTTP/1.1" 200 99 "-" "okhttp/3.8.1"
1.2.3.4 - - [21/Jan/2023:04:53:32 +0000] "GET /socket.io/?EIO=3&transport=polling&sid=QEz-JkWRBJNHY-D3KHPe HTTP/1.1" 200 21 "-" "okhttp/3.8.1"
1.2.3.4 - - [21/Jan/2023:04:53:32 +0000] "POST /socket.io/?EIO=3&transport=polling&sid=QEz-JkWRBJNHY-D3KHPe HTTP/1.1" 200 2 "-" "okhttp/3.8.1"
1.2.3.4 - - [21/Jan/2023:04:53:33 +0000] "GET /socket.io/?EIO=3&transport=websocket&sid=QEz-JkWRBJNHY-D3KHPe HTTP/1.1" 400 18 "-" "okhttp/3.8.1"

If you look you can see that the there are a few polling calls then a websocket attempt, which fails with a 400 error. At first i thought we had an issue with something server side preventing websockets and forcing polling, so started going down that route.

But on my test system i noticed that even when I stopped the cloud binding on my local machine through the karaf console, the binding was continuing to log connection events !!!

2023-01-20 21:04:42.922 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO ping
2023-01-20 21:04:42.996 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO pong: 74 m

What i have realized is that the binding is somehow creating orphaned socket.io connections , so a thread leak. I don't think these connections actually fully connect, the session gets rejected with the 400 error, but nevertheless they are slamming our servers and making debugging very hard.

As a test i stopped the binding (using the console) on my home production system which has been running for a few days, and can see the same behavior, my log continues to show these ping and pong messages, so i assume this is extremely widespread.

As a start i went through the code looking for how something like this could occur, the first thing i noticed was the reconnect code schedules a future, but does not keep a reference to it for clean up purposes, this seems like a good place to start, as i could see this connecting a socket after the binding was stopped or restarted (like when saving configs). I'm going to let my test system run for a bit, but so far i am not seeing the same behavior after this fix, but need to do more testing.

@digitaldan digitaldan added the bug An unexpected problem or unintended behavior of an add-on label Jan 21, 2023
@digitaldan
Copy link
Contributor Author

FYI, the 400 errors and the leak here may not be related, i thought they were, but they may be openhabs that have installed the binding but have not actually signed up for an account (so likely a lot). The leak problem still exists, and i'm not sure if it could cause other issues, but may not be responsible for the noise in the logs.

@ssalonen
Copy link
Contributor

ssalonen commented Jan 21, 2023

I will have a look with this PR.

In #11153 the reconnect behaviour changed such that the addon retries connect, also with "errors during connection":

// We are not connected currently, manual reconnection is needed to keep trying to (re-)establish

It was later made to use scheduler: #13421

re. Http 400 and retry

Actually invalid uuid was one of the methods I tested #11153 (comment)

Probably tuning of time between reconnects would be nice?

Copy link
Contributor

@ssalonen ssalonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@digitaldan
Copy link
Contributor Author

In #11153 the reconnect behaviour changed such that the addon retries connect, also with "errors during connection":

Reading that PR again, i noticed you mentioned:

One way to confirm changed behaviour is to introduce invalid UUID, which is rejected by the server ("not authorized" error). With the new code, client would automatically retry. Old implementation stopped alltogether without retriying.

So with the new code (well newish at this point) , we are now trying indefinitely where before the binding gave up? I wondering if this is the source of so many rejections (the logs are full of them). I have noticed this in the node logs as well.

Probably tuning of time between reconnects would be nice?

If the server specifically rejects us, worst case we should not be trying very frequently to reconnect, ideally we would not try and reconnect at all until the user takes some action, but the binding is not set up for that right now.

Long term its probably a good idea to revisit the whole UUID / Secret thing, make this part of the binding configuration and maybe have the server generate the secret and have the user enter it in the binding config in the Main UI, which at that point we try a connection, but not retry if the server specifically rejects it. (this is all of the top of my head, so probably need to think about it more).

I suggest to follow the thread-safe pattern established in many bindings,

Will do, and thanks, i was also concerned about the multi threaded aspect of this quick fix, but wanted to get something up for discussion.

@digitaldan
Copy link
Contributor Author

Also whatever we do here , i would want to make this part of the 3.4.2 fix release. Let me know what you think about tuning specific server rejections, would love to reduce that load if possible.

@ssalonen
Copy link
Contributor

Yes, agreed that we should not retry if server explicitly rejects. That makes sense.

Perhaps there is a way to differentiate this from other errors...

And indeed: before the PR, any error during connect phase was "fatal", no new attempts were made.

@digitaldan
Copy link
Contributor Author

Perhaps there is a way to differentiate this from other errors...

yeah, i don't see that specifically in the cloud code, which is unfortunate. We could add this, so send back a 401 or similar if we can't match credentials. I'm not sure how that get passed down to the java socket.io client, would have to do some testing.

@ssalonen
Copy link
Contributor

in the worst case, we should be able to look for some string "unauthorized" in the error ... whatever gets logged down

@digitaldan
Copy link
Contributor Author

Hi @ssalonen , i'm back spending some time on this. My latest push incorporates you suggestions, i also added an option to have the library only use websocket connections which removes the annoying polling requests to create the session and instead does it all on a single request (and that we never actually use polling ever, afaik).

so this part is a little tl;dr , but i would appreciate you taking a look since you know this code better then most.

I have another question for you about the socket.io automatic reconnect logic. During an event when my system experienced this problem, my logs show that actually the custom reconnect logic was never actually used. Socket.io was reconnecting internally , which lead to the split brain. I think i may understand why, but would like your opinion since you also worked on this.

Before i post my logs, i'll mention a couple important things.

It seems that the socket.io client has reconnects enabled by default, so if the connection dies, it tries to connect again behind the scenes. I believe you put in some nice logging here, so it is much easier to debug the connection lifecycle.

The other piece here is the okhttp client, by default it has a read timeout of 10 seconds before throwing an error. This will be important below.

So in the logs below you will see that we

  1. have a ping timeout
  2. socket.io disconnects
  3. socket.io tries to reconnect (attempt 1),
  4. after 10 seconds we get this "EngineIOException xhr poll error"
  5. socket.io immediately tries again (attempt 2),
  6. socket.io then successfully connects,
  7. but then disconnects 3 seconds later with this same poll error
  8. and starts over again with new attempt 1, which now succeeds.

The problem i believe is that on the first disconnect socket.io tries reconnecting, but then gives up after 10 seconds. Unfortunately the server may have been very busy and is still processing that request, but the client is now immediately connecting again, so our service is processing 2 connections, and we get a race condition which can lead to the first reconnection overwriting the DB on the second reconnection (which started after, but also completed before the first one)

2023-01-25 22:36:04.851 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO CLOSE: ping timeout
2023-01-25 22:36:04.853 [WARN ] [io.openhabcloud.internal.CloudClient] - Socket.IO disconnected: ping timeout
2023-01-25 22:36:04.854 [INFO ] [io.openhabcloud.internal.CloudClient] - Disconnected from the openHAB Cloud service (UUID = 1b...8a, base URL = http://localhost:8080)
2023-01-25 22:36:05.517 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO re-connecting (attempt 1)
2023-01-25 22:36:16.249 [DEBUG] [io.openhabcloud.internal.CloudClient] - Error connecting to the openHAB Cloud instance: EngineIOException xhr poll error. Should reconnect automatically.
2023-01-25 22:36:16.251 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO re-connect attempt error: SocketIOException Connection error
2023-01-25 22:36:18.971 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO re-connecting (attempt 2)
2023-01-25 22:36:19.612 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO OPEN
2023-01-25 22:36:19.613 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO re-connected successfully (attempt 2)
2023-01-25 22:36:23.370 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO connected
2023-01-25 22:36:23.370 [INFO ] [io.openhabcloud.internal.CloudClient] - Connected to the openHAB Cloud service (UUID = 1b...8a, base URL = http://localhost:8080)
2023-01-25 22:36:26.079 [WARN ] [io.openhabcloud.internal.CloudClient] - Error during communication: EngineIOException xhr poll error
2023-01-25 22:36:26.081 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO CLOSE: transport error
2023-01-25 22:36:26.082 [WARN ] [io.openhabcloud.internal.CloudClient] - Socket.IO disconnected: transport error
2023-01-25 22:36:26.083 [INFO ] [io.openhabcloud.internal.CloudClient] - Disconnected from the openHAB Cloud service (UUID = 1b...8a, base URL = http://localhost:8080)
2023-01-25 22:36:26.767 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO re-connecting (attempt 1)
2023-01-25 22:36:28.515 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO OPEN
2023-01-25 22:36:28.516 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO re-connected successfully (attempt 1)
2023-01-25 22:36:38.364 [DEBUG] [io.openhabcloud.internal.CloudClient] - Socket.IO connected
2023-01-25 22:36:38.365 [INFO ] [io.openhabcloud.internal.CloudClient] - Connected to the openHAB Cloud service (UUID = 1b...8a, base URL = http://localhost:8080)

I have a couple of ideas here. First is that i think we need to make the okhttp client have a larger read timeout, so something like:

okHttpBuilder.readTimeout(1, TimeUnit.MINUTES);

Second is that there are a number of options we can set on the socket.io client around reconnects, for example i would maybe suggest a much slower connect cycle with longer timeouts

    .setReconnection(true)
    .setReconnectionAttempts(Integer.MAX_VALUE) //default value Integer.MAX_VALUE
    .setReconnectionDelay(5_000) //default value 1_000
    .setReconnectionDelayMax(60_000) // default value 5_000
    .setRandomizationFactor(0.5) // default value 0.5
    .setTimeout(60_000) // default value 20_000

The other thing i am thinking about (but need to sleep on), is using redis as a network lock, so when a new connection comes in, it creates an entry in redis with some TTL on it. Once the connection is complete that lock is removed, or if something dies, it expires after a reasonable amount of time. If a second connection comes in, it will see the lock and fail.

wdyt?

…p, removes unused variables.

Signed-off-by: Dan Cunningham <dan@digitaldan.com>
Signed-off-by: Dan Cunningham <dan@digitaldan.com>
@digitaldan
Copy link
Contributor Author

@ssalonen i went ahead and added the backoff settings to the internal socket.io options, and increased the timeout values as i mentioned above.

@digitaldan
Copy link
Contributor Author

@digitaldan
Copy link
Contributor Author

redis PR
openhab/openhab-cloud#413

@ssalonen
Copy link
Contributor

ssalonen commented Jan 28, 2023

I will answer the quick points first

a much slower connect cycle with longer timeouts

agreed. Those are fairly big numbers(1), but I guess still reasonable: delay of at most 60s between reconnects.

socket.io automatic reconnect logic. During an event when my system experienced this problem, my logs show that actually the custom reconnect logic was never actually used.

It will recover certain connection failures which would not be otherwise retried (I can try to find the details from the old PR). Or that is the attempt at least. The intent was to see if this was the culprit, before all the debugging and logging done by the community.

Indeed many failures are connected by socket.io internal logic, and no separate reconnection logic is applied on binding side.

Actually, I have no records of any user reports where this would have "fixed" a connection. I might even go further and say we remove this logic alltogether. Let me still check the old PR, hopefully later today

we get a race condition which can lead to the first reconnection overwriting the DB on the second reconnection (which started after, but also completed before the first one)

this is a great finding and... plausible

I will check the redis pr, initially this sounds very robust solution and exactly what we need! Was redis part of the stack already?

(1) see earlier experiments #12121 (comment)

@digitaldan
Copy link
Contributor Author

It will recover certain connection failures which would not be otherwise retried (I can try to find the details from the old PR).

From my testing, that code gets used when the server specifically calls socket.disconnect() (which can happen for a number of server related issues) or i think if the initial attempt to connect does not succeed. (you point this out as well in #12121) Its basically doing the same thing as the internal socket.io code does once its connected. Its probably still useful, so i would say we leave it in and just tune it.

Was redis part of the stack already?

Yes, its used as the backing session store for express as well as doing mongo query caching using Mongoose. Beyond using it as a network connection lock, i think there is a good opportunity to use it for other purposes, like offline tracking, proxy route look ups, etc....

I'm right now looking to upgrade our Redis and Mongo servers to the latest versions, there are a few improvements that we could use to make the lock solution better with less calls.

(1) see earlier experiments #12121 (comment)

Yikes, if i had gone back and read your post again, it probably would have saved me much time as you clearly pointed out the reconnection settings in socket.io ! Apologies for not responding back to that post !

Thanks for taking a look!

@ssalonen
Copy link
Contributor

From my testing, that code gets used when the server specifically calls socket.disconnect()

Ok, sounds promising. I must say I have hard time reading the socket/engine client code and understand when that is triggered. I think I have only seen it with first connect errors (wrong uuid example)..

@digitaldan
Copy link
Contributor Author

I think I have only seen it with first connect errors (wrong uuid example)..

Actually you may be right, i think this gets called when the server calls socket.disconnect()

                .on(Socket.EVENT_DISCONNECT, args -> {
                    if (args.length > 0) {
                        logger.warn("Socket.IO disconnected: {}", args[0]);
                    } else {
                        logger.warn("Socket.IO disconnected");
                    }
                    isConnected = false;
                    onDisconnect();
                })//

This means we are no longer connected, but no retry logic is attempted. Its likely this is occasionally the cause of the binding needed to be restarted after server issues ?

@ssalonen
Copy link
Contributor

Hmm, you mean that server calling socket.disconnect would in some cases result in error on client side?

Signed-off-by: Dan Cunningham <dan@digitaldan.com>
Signed-off-by: Dan Cunningham <dan@digitaldan.com>
@digitaldan
Copy link
Contributor Author

you mean that server calling socket.disconnect would in some cases result in error on client side?

If the server calls socket.disconnect() then that code path gets executed, which does not try to reconnect. We call disconnect on the server if we have a general runtime error, like if we are not able to talk to mongo . I'll play with that some more, but maybe in another PR as i have already made this one a little broad 😱

@digitaldan digitaldan removed the work in progress A PR that is not yet ready to be merged label Jan 28, 2023
@digitaldan
Copy link
Contributor Author

Ok, i'm good with this PR, i think this should also be part of the next 3.4.x release as well.

@digitaldan digitaldan changed the title [openhabcloud] Possible connection leak [openhabcloud] Connection leak fix and reasonable reconnect values Jan 28, 2023
Signed-off-by: Dan Cunningham <dan@digitaldan.com>
Signed-off-by: Dan Cunningham <dan@digitaldan.com>
@digitaldan digitaldan requested a review from ssalonen January 29, 2023 05:14
Copy link
Contributor

@ssalonen ssalonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. If you like, add a comment referencing socket io java client code

@digitaldan digitaldan changed the title [openhabcloud] Connection leak fix and reasonable reconnect values [openhabcloud] Reconnection Fixes Jan 29, 2023
Copy link
Contributor

@lolodomo lolodomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you

@lolodomo lolodomo merged commit 9ba3c07 into openhab:main Jan 29, 2023
@lolodomo lolodomo added this to the 4.0 milestone Jan 29, 2023
@lolodomo
Copy link
Contributor

i think this should also be part of the next 3.4.x release as well.

@jlaur : could you please cherry-pick that PR in 3.4.x branch ?

@lolodomo lolodomo added the patch A PR that has been cherry-picked to a patch release branch label Jan 29, 2023
jlaur pushed a commit that referenced this pull request Jan 29, 2023
* [openhabcloud] Possible connection leak
* Creates thread safe reconnection, reduces unnecessary polling on setup, removes unused variables.
* adds the reconnect settings to the internal socket.io options.
* Up the min reconnect time
* Use @ssalonen sugestion for backoff mins and randomness
* Reconnect after server initiated disconnect
* Remove unhelpful comments

Signed-off-by: Dan Cunningham <dan@digitaldan.com>
@jlaur
Copy link
Contributor

jlaur commented Jan 29, 2023

could you please cherry-pick that PR in 3.4.x branch ?

Done. Can you have a look at these, which I would also consider for 3.4.x:

I just need a second pair of eyes to make the decision. Especially the last one since it's my own PR.

@lolodomo
Copy link
Contributor

I already answered for the wemo PR ;)

Will have a look for the JDBC one.

@lolodomo
Copy link
Contributor

Agree with you for the JDBC PR.

@digitaldan
Copy link
Contributor Author

Thanks!

accelero-brake pushed a commit to accelero-brake/openhab-addons that referenced this pull request Feb 9, 2023
* [openhabcloud] Possible connection leak
* Creates thread safe reconnection, reduces unnecessary polling on setup, removes unused variables.
* adds the reconnect settings to the internal socket.io options.
* Up the min reconnect time
* Use @ssalonen sugestion for backoff mins and randomness
* Reconnect after server initiated disconnect
* Remove unhelpful comments

Signed-off-by: Dan Cunningham <dan@digitaldan.com>
# Conflicts:
#	bundles/org.openhab.io.openhabcloud/src/main/java/org/openhab/io/openhabcloud/internal/CloudClient.java
accelero-brake pushed a commit to accelero-brake/openhab-addons that referenced this pull request Feb 9, 2023
@openhab-bot
Copy link
Collaborator

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/openhab-cloud-connection-stops-working-does-not-try-to-reconnect/133608/77

psmedley pushed a commit to psmedley/openhab-addons that referenced this pull request Feb 23, 2023
* [openhabcloud] Possible connection leak
* Creates thread safe reconnection, reduces unnecessary polling on setup, removes unused variables.
* adds the reconnect settings to the internal socket.io options.
* Up the min reconnect time
* Use @ssalonen sugestion for backoff mins and randomness
* Reconnect after server initiated disconnect
* Remove unhelpful comments

Signed-off-by: Dan Cunningham <dan@digitaldan.com>
nemerdaud pushed a commit to nemerdaud/openhab-addons that referenced this pull request Feb 28, 2023
* [openhabcloud] Possible connection leak
* Creates thread safe reconnection, reduces unnecessary polling on setup, removes unused variables.
* adds the reconnect settings to the internal socket.io options.
* Up the min reconnect time
* Use @ssalonen sugestion for backoff mins and randomness
* Reconnect after server initiated disconnect
* Remove unhelpful comments

Signed-off-by: Dan Cunningham <dan@digitaldan.com>
FordPrfkt pushed a commit to FordPrfkt/openhab-addons that referenced this pull request Apr 20, 2023
* [openhabcloud] Possible connection leak
* Creates thread safe reconnection, reduces unnecessary polling on setup, removes unused variables.
* adds the reconnect settings to the internal socket.io options.
* Up the min reconnect time
* Use @ssalonen sugestion for backoff mins and randomness
* Reconnect after server initiated disconnect
* Remove unhelpful comments

Signed-off-by: Dan Cunningham <dan@digitaldan.com>
@openhab-bot
Copy link
Collaborator

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/connection-to-openhab-cloud-fails-for-openhab-3-4-1/147253/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An unexpected problem or unintended behavior of an add-on patch A PR that has been cherry-picked to a patch release branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants