Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hue] Improve connection stability (API v2) #15477

Merged
merged 12 commits into from
Aug 26, 2023

Conversation

andrewfg
Copy link
Contributor

@andrewfg andrewfg commented Aug 21, 2023

This PR contains several improvements to improve the stability of HTTP/2 connections.

Resolves #15350
Resolves #15460 (part 2)
Related to #15468 (temporary fix)
Resolves issue with duplicate event messages after recycle as reported here

Signed-off-by: Andrew Fiddian-Green software@whitebear.ch

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>
@andrewfg andrewfg added enhancement An enhancement or new feature for an existing add-on additional testing preferred The change works for the pull request author. A test from someone else is preferred though. labels Aug 21, 2023
@andrewfg andrewfg self-assigned this Aug 21, 2023
@andrewfg
Copy link
Contributor Author

@jlaur / @maniac103 I am still testing this code on my operative system, but I am posting this PR so that you can a) test it for yourselves, and b) critique the code changes.

@andrewfg
Copy link
Contributor Author

NOTA BENE: due to the added HTTP status checking this PR now fails with '404' errors on some things due to #15468 !!

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>
Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>
@andrewfg
Copy link
Contributor Author

andrewfg commented Aug 22, 2023

TODO ... currently the code waits until it receives a GO_AWAY before recycling the session. The Throttler and SessionSynchronizer objects should prevent most conflicts during the session recycle phase. But there is still just a slim chance that if a) the binding makes two GET requests almost concurrently, and b) the bridge sends the GO_AWAY in response to the first GET, and c) the first GET takes more than 100 msec to complete, then it is possible that the second GET might pass through the Throttler lock and start executing before the session recycle process has taken over the SessionSynchronizer lock, and this would cause the GET/PUT stream count to have exceeded the nginx 1000 limit, which would in turn cause the second GET to fail catastrophically. I think the only solution is for the binding to start the session recycle process on its own side when the GET/PUT stream count is at least 3 lower than the nginx limit (3 because the binding may make up to 3 GET calls concurrently). It is tricky to imagine exactly how such timing can work, and impossible to simulate in tests. So I would appreciate your thoughts on this -- especially @maniac103 ..


EDIT: resolved see next post.

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>
@andrewfg
Copy link
Contributor Author

Apropos my prior post on session recycling, I just committed a change whereby it recycles the session 6 calls before the nginx 1000 limit is reached. This eliminates the potential timing error alluded to above, since the Hue bridge server would still have 6 calls in hand before it would send the GO_AWAY message, and could therefore accept a handful of GET calls getting past the SessionSynchronizer locks.

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>
@maniac103
Copy link
Contributor

Apropos my prior post on session recycling, I just committed a change whereby it recycles the session 6 calls before the nginx 1000 limit is reached

I'm not sure whether this really is the ideal solution, given the 1000 request limit can change with any bridge FW update.

then it is possible that the second GET might pass through the Throttler lock and start executing before the session recycle process has taken over the SessionSynchronizer lock, and this would cause the GET/PUT stream count to have exceeded the nginx 1000 limit, which would in turn cause the second GET to fail catastrophically.

Can't we detect that situation (by receiving the error and noticing the session of the failed request doesn't match the reestablished or closed session) and thus just issue a retry in that case?

@andrewfg
Copy link
Contributor Author

andrewfg commented Aug 22, 2023

not sure whether this really is the ideal solution, given the 1000 request limit can change with any bridge FW update.

Indeed. For that reason I am recycling the session a) after NGINX_MAX_REQUEST_COUNT - 1 - (MAX_CONCURRENT_STREAMS * 2) (i.e. 993) GET/PUT calls, AND/OR b) when the server actually sends its GO_AWAY message -- whichever comes sooner. The former pre-emptive approach guarantees to avoid the edge case risk of timing errors entirely, whilst the latter does not (entirely) .. albeit the risk of three unlikely things happening at the same time, is IMHO pretty small. (I will post a flow chart to try to explain it).

My ideal hypothesis is that you -- @maniac103 -- have an amazingly bright idea to fix my synchronization code so that the current edge case risk of timing errors can be eliminated. But if you don't have any such ideas, then we have to figure out a way to ameliorate rather than eliminate the risk.

I thought about adding a config param for the bridge thing whereby one could manually change the 1000 limit in case the nginx firmware did reduce that number (it would not be a problem if they increased it). However I concluded that if such a thing would happen, it would better, as a courtesy to the users, to make a new PR to change the NGINX_MAX_REQUEST_COUNT constant in the binding code, rather than telling them all to change a config param.

As I write this, it occurs to me that we could even perhaps make the above behaviour self adaptive. If we see that the Hue server consistently sends GO_AWAY messages before the presumptive 1000 limit, we could dynamically reduce that limit in code.

Can't we detect that situation (...) and thus just issue a retry in that case?

Well we do detect the error too, and this does trigger a connection restart. But the OH handler architecture is asynchronous and blocking, so although we can BLOCK one single GET call until a recycle has completed, we cannot issue a call, receive an exception, trigger a restart, and then RE-ISSUE a duplicate of the same call. We would need to create an architecture mechanism to cache such failed calls and repeat them, after reconnection, until they succeed, or time out, or whatever. I think that would be horribly messy.

@andrewfg
Copy link
Contributor Author

(I will post a flow chart to try to explain it).

I spent a few hours making flow charts. And as a result I am pretty sure that the GO_AWAY synchronization scheme DOES work in all cases after all. In which case you can ignore the 'nightmares' in my prior posts. However I want to complete those flow charts, and post them here for you to critique. And I also need to do some timing tests on the Hue Bridge server to determine its exact sequence of events. And study the source code of ReentrantReadWriteLock too. I will get back to you ASAP.

@andrewfg
Copy link
Contributor Author

andrewfg commented Aug 23, 2023

I am very happy! A few things..

  • Research of the ReentrantReadWriteLock code and JavaDocs show that if the lock is created as 'fair' then pending write locks always take precedence over pending read locks.
  • Measurements of the Hue Bridge show that it sends the GO_AWAY within 1ms of the client opening the last request that would succeed.
  • Measurements of the Hue Bridge show that fast GET requests can take less time than the throttle interval of 50ms to complete and that slow GETs can take more.
  • Attached HERE is a timing chart for two overlapped GET requests, the first being the GET that triggers the GO_AWAY, and the second immediately thereafter. The chart contains the two possible synchronization scenarios -- namely if the first GET takes less than the 50ms throttle interval, and if it takes more.

The above analysis and chart proves that the GO_AWAY process thread synchronization should always succeed -- specifically..

  • After the GO_AWAY no further GET will be made on that session.
  • The session will be recycled cleanly without errors.
  • Subsequent concurrent GETs just prior to the GO_AWAY, are postponed and made on the new session.
  • And (therefore) no GET calls will be lost.
  • Note: the situation for PUTs is yet safer, since the Throttler prevents concurrent calls during PUTs.

Conclusions..

  1. Please ignore my prior 'nightmare' posts on this topic.
  2. We can dispense with the idea of recycling the session prior to the (opaque) 'nginx' limit, as the GO_AWAY process is fine.
  3. I shall to make some more changes based on the above findings, and I will commit them shortly.

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>
@andrewfg
Copy link
Contributor Author

^
FYI I am running it on my operative system with a rule to toggle a test lamp every 5 seconds (so reaches the GO_AWAY limit in just over 1 hour) and I will test until tomorrow to confirm all is Ok.

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>
@andrewfg
Copy link
Contributor Author

andrewfg commented Aug 24, 2023

I will test until tomorrow to confirm all is Ok

So far all is looking good :)


EDIT: just now I was able to actually observe a close batch of 4 GET requests where the 2nd request triggered the GO_AWAY limit, and I can confirm that the first two calls were made on the original session and the last two were postponed to the new session i.e. a real proof that the synchronization does work.

@andrewfg andrewfg removed the additional testing preferred The change works for the pull request author. A test from someone else is preferred though. label Aug 24, 2023
Copy link
Contributor

@jlaur jlaur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements. I'm now also running this version in my production system. I have added a few minor comments. @maniac103 - as always, thanks for reviewing.

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>
@andrewfg andrewfg requested a review from jlaur August 25, 2023 09:57
@jlaur
Copy link
Contributor

jlaur commented Aug 26, 2023

@maniac103 - do you want to check your comment resolutions before merging this PR?

@maniac103
Copy link
Contributor

@Laur Looks good to me as far as I am concerned.

@andrewfg andrewfg requested a review from jlaur August 26, 2023 13:28
Copy link
Contributor

@jlaur jlaur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jlaur jlaur merged commit 7fb9efc into openhab:main Aug 26, 2023
2 checks passed
@jlaur jlaur added this to the 4.1 milestone Aug 26, 2023
jlaur pushed a commit that referenced this pull request Aug 26, 2023
Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>
@jlaur jlaur added the patch A PR that has been cherry-picked to a patch release branch label Aug 26, 2023
@jlaur jlaur changed the title [hue] Improve connection stability [hue] Improve connection stability (API v2) Dec 22, 2023
austvik pushed a commit to austvik/openhab-addons that referenced this pull request Mar 27, 2024
Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>
Signed-off-by: Jørgen Austvik <jaustvik@acm.org>
@andrewfg andrewfg deleted the hue-connection-improvements branch August 25, 2024 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement or new feature for an existing add-on patch A PR that has been cherry-picked to a patch release branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[hue] Error handling improvements [hue] ApiException fails to reconnect bridge
3 participants