[hue] Improve connection stability (API v2) #15477

andrewfg · 2023-08-21T22:22:26Z

This PR contains several improvements to improve the stability of HTTP/2 connections.

Resolves #15350
Resolves #15460 (part 2)
Related to #15468 (temporary fix)
Resolves issue with duplicate event messages after recycle as reported here

Signed-off-by: Andrew Fiddian-Green software@whitebear.ch

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

andrewfg · 2023-08-21T22:24:24Z

@jlaur / @maniac103 I am still testing this code on my operative system, but I am posting this PR so that you can a) test it for yourselves, and b) critique the code changes.

andrewfg · 2023-08-21T22:45:40Z

NOTA BENE: due to the added HTTP status checking this PR now fails with '404' errors on some things due to #15468 !!

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

andrewfg · 2023-08-22T09:02:56Z

TODO ... currently the code waits until it receives a GO_AWAY before recycling the session. The Throttler and SessionSynchronizer objects should prevent most conflicts during the session recycle phase. But there is still just a slim chance that if a) the binding makes two GET requests almost concurrently, and b) the bridge sends the GO_AWAY in response to the first GET, and c) the first GET takes more than 100 msec to complete, then it is possible that the second GET might pass through the Throttler lock and start executing before the session recycle process has taken over the SessionSynchronizer lock, and this would cause the GET/PUT stream count to have exceeded the nginx 1000 limit, which would in turn cause the second GET to fail catastrophically. I think the only solution is for the binding to start the session recycle process on its own side when the GET/PUT stream count is at least 3 lower than the nginx limit (3 because the binding may make up to 3 GET calls concurrently). It is tricky to imagine exactly how such timing can work, and impossible to simulate in tests. So I would appreciate your thoughts on this -- especially @maniac103 ..

EDIT: resolved see next post.

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

andrewfg · 2023-08-22T12:37:15Z

Apropos my prior post on session recycling, I just committed a change whereby it recycles the session 6 calls before the nginx 1000 limit is reached. This eliminates the potential timing error alluded to above, since the Hue bridge server would still have 6 calls in hand before it would send the GO_AWAY message, and could therefore accept a handful of GET calls getting past the SessionSynchronizer locks.

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

maniac103 · 2023-08-22T13:53:26Z

Apropos my prior post on session recycling, I just committed a change whereby it recycles the session 6 calls before the nginx 1000 limit is reached

I'm not sure whether this really is the ideal solution, given the 1000 request limit can change with any bridge FW update.

then it is possible that the second GET might pass through the Throttler lock and start executing before the session recycle process has taken over the SessionSynchronizer lock, and this would cause the GET/PUT stream count to have exceeded the nginx 1000 limit, which would in turn cause the second GET to fail catastrophically.

Can't we detect that situation (by receiving the error and noticing the session of the failed request doesn't match the reestablished or closed session) and thus just issue a retry in that case?

andrewfg · 2023-08-22T15:52:56Z

not sure whether this really is the ideal solution, given the 1000 request limit can change with any bridge FW update.

Indeed. For that reason I am recycling the session a) after NGINX_MAX_REQUEST_COUNT - 1 - (MAX_CONCURRENT_STREAMS * 2) (i.e. 993) GET/PUT calls, AND/OR b) when the server actually sends its GO_AWAY message -- whichever comes sooner. The former pre-emptive approach guarantees to avoid the edge case risk of timing errors entirely, whilst the latter does not (entirely) .. albeit the risk of three unlikely things happening at the same time, is IMHO pretty small. (I will post a flow chart to try to explain it).

My ideal hypothesis is that you -- @maniac103 -- have an amazingly bright idea to fix my synchronization code so that the current edge case risk of timing errors can be eliminated. But if you don't have any such ideas, then we have to figure out a way to ameliorate rather than eliminate the risk.

I thought about adding a config param for the bridge thing whereby one could manually change the 1000 limit in case the nginx firmware did reduce that number (it would not be a problem if they increased it). However I concluded that if such a thing would happen, it would better, as a courtesy to the users, to make a new PR to change the NGINX_MAX_REQUEST_COUNT constant in the binding code, rather than telling them all to change a config param.

As I write this, it occurs to me that we could even perhaps make the above behaviour self adaptive. If we see that the Hue server consistently sends GO_AWAY messages before the presumptive 1000 limit, we could dynamically reduce that limit in code.

Can't we detect that situation (...) and thus just issue a retry in that case?

Well we do detect the error too, and this does trigger a connection restart. But the OH handler architecture is asynchronous and blocking, so although we can BLOCK one single GET call until a recycle has completed, we cannot issue a call, receive an exception, trigger a restart, and then RE-ISSUE a duplicate of the same call. We would need to create an architecture mechanism to cache such failed calls and repeat them, after reconnection, until they succeed, or time out, or whatever. I think that would be horribly messy.

andrewfg · 2023-08-22T22:14:10Z

(I will post a flow chart to try to explain it).

I spent a few hours making flow charts. And as a result I am pretty sure that the GO_AWAY synchronization scheme DOES work in all cases after all. In which case you can ignore the 'nightmares' in my prior posts. However I want to complete those flow charts, and post them here for you to critique. And I also need to do some timing tests on the Hue Bridge server to determine its exact sequence of events. And study the source code of ReentrantReadWriteLock too. I will get back to you ASAP.

andrewfg · 2023-08-23T14:28:03Z

I am very happy! A few things..

Research of the ReentrantReadWriteLock code and JavaDocs show that if the lock is created as 'fair' then pending write locks always take precedence over pending read locks.
Measurements of the Hue Bridge show that it sends the GO_AWAY within 1ms of the client opening the last request that would succeed.
Measurements of the Hue Bridge show that fast GET requests can take less time than the throttle interval of 50ms to complete and that slow GETs can take more.
Attached HERE is a timing chart for two overlapped GET requests, the first being the GET that triggers the GO_AWAY, and the second immediately thereafter. The chart contains the two possible synchronization scenarios -- namely if the first GET takes less than the 50ms throttle interval, and if it takes more.

The above analysis and chart proves that the GO_AWAY process thread synchronization should always succeed -- specifically..

After the GO_AWAY no further GET will be made on that session.
The session will be recycled cleanly without errors.
Subsequent concurrent GETs just prior to the GO_AWAY, are postponed and made on the new session.
And (therefore) no GET calls will be lost.
Note: the situation for PUTs is yet safer, since the Throttler prevents concurrent calls during PUTs.

Conclusions..

Please ignore my prior 'nightmare' posts on this topic.
We can dispense with the idea of recycling the session prior to the (opaque) 'nginx' limit, as the GO_AWAY process is fine.
I shall to make some more changes based on the above findings, and I will commit them shortly.

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

andrewfg · 2023-08-23T16:47:26Z

^
FYI I am running it on my operative system with a rule to toggle a test lamp every 5 seconds (so reaches the GO_AWAY limit in just over 1 hour) and I will test until tomorrow to confirm all is Ok.

...enhab.binding.hue/src/main/java/org/openhab/binding/hue/internal/connection/Clip2Bridge.java

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

andrewfg · 2023-08-24T11:16:01Z

I will test until tomorrow to confirm all is Ok

So far all is looking good :)

EDIT: just now I was able to actually observe a close batch of 4 GET requests where the 2nd request triggered the GO_AWAY limit, and I can confirm that the first two calls were made on the original session and the last two were postponed to the new session i.e. a real proof that the synchronization does work.

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

jlaur

Thanks for the improvements. I'm now also running this version in my production system. I have added a few minor comments. @maniac103 - as always, thanks for reviewing.

...enhab.binding.hue/src/main/java/org/openhab/binding/hue/internal/connection/Clip2Bridge.java

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

...enhab.binding.hue/src/main/java/org/openhab/binding/hue/internal/connection/Clip2Bridge.java

jlaur · 2023-08-26T11:10:57Z

@maniac103 - do you want to check your comment resolutions before merging this PR?

maniac103 · 2023-08-26T12:03:53Z

@Laur Looks good to me as far as I am concerned.

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

jlaur

LGTM

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch> Signed-off-by: Jørgen Austvik <jaustvik@acm.org>

[hue] improve connection stability

bc6c074

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

andrewfg added enhancement An enhancement or new feature for an existing add-on additional testing preferred The change works for the pull request author. A test from someone else is preferred though. labels Aug 21, 2023

andrewfg self-assigned this Aug 21, 2023

andrewfg requested a review from cweitkamp as a code owner August 21, 2023 22:22

andrewfg added 2 commits August 22, 2023 00:08

[hue] temporary work around for openhab#15468

74a0fa3

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

hue] fix typo in javadoc

bbe8aba

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

[hue] trigger session recycle before nginx limit is reached

3550b79

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

[hue] fix arithmetic error

5c36311

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

[hue] revert prior; tune thread synchronization

d4cfbe7

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

maniac103 reviewed Aug 23, 2023

View reviewed changes

...enhab.binding.hue/src/main/java/org/openhab/binding/hue/internal/connection/Clip2Bridge.java Outdated Show resolved Hide resolved

maniac103 reviewed Aug 23, 2023

View reviewed changes

...enhab.binding.hue/src/main/java/org/openhab/binding/hue/internal/connection/Clip2Bridge.java Outdated Show resolved Hide resolved

maniac103 reviewed Aug 23, 2023

View reviewed changes

...enhab.binding.hue/src/main/java/org/openhab/binding/hue/internal/connection/Clip2Bridge.java Outdated Show resolved Hide resolved

maniac103 reviewed Aug 23, 2023

View reviewed changes

...enhab.binding.hue/src/main/java/org/openhab/binding/hue/internal/connection/Clip2Bridge.java Outdated Show resolved Hide resolved

maniac103 reviewed Aug 23, 2023

View reviewed changes

...enhab.binding.hue/src/main/java/org/openhab/binding/hue/internal/connection/Clip2Bridge.java Outdated Show resolved Hide resolved

[hue] adopt reviewer suggestions

569b2eb

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

andrewfg added 2 commits August 24, 2023 12:30

Merge branch 'openhab:main' into hue-connection-improvements

76a6bcd

[hue] recreateSession shall not notify handler

ffc9afb

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

andrewfg removed the additional testing preferred The change works for the pull request author. A test from someone else is preferred though. label Aug 24, 2023

jlaur requested changes Aug 24, 2023

View reviewed changes

[hue] adopt reviewer suggestions

623dee8

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

andrewfg requested a review from jlaur August 25, 2023 09:57

jlaur reviewed Aug 26, 2023

View reviewed changes

...enhab.binding.hue/src/main/java/org/openhab/binding/hue/internal/connection/Clip2Bridge.java Outdated Show resolved Hide resolved

andrewfg added 2 commits August 26, 2023 13:35

[hue] handle taurus_7455 and similar errors better

3b6c3e2

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

Merge branch 'openhab:main' into hue-connection-improvements

97cb8b0

andrewfg requested a review from jlaur August 26, 2023 13:28

jlaur approved these changes Aug 26, 2023

View reviewed changes

jlaur merged commit 7fb9efc into openhab:main Aug 26, 2023
2 checks passed

jlaur added this to the 4.1 milestone Aug 26, 2023

jlaur pushed a commit that referenced this pull request Aug 26, 2023

[hue] Improve connection stability (#15477)

933f753

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch>

jlaur added the patch A PR that has been cherry-picked to a patch release branch label Aug 26, 2023

This was referenced Aug 27, 2023

[hue] Unhandled HTTP status code 207 (API v2) #15511

Closed

[hue] Fix and improve error logging and status descriptions for API v2 #15512

Merged

jlaur changed the title ~~[hue] Improve connection stability~~ [hue] Improve connection stability (API v2) Dec 22, 2023

andrewfg mentioned this pull request Feb 2, 2024

[hue] Trigger firing multiple times after "GO_AWAY => closing session" event #15630

Closed

austvik pushed a commit to austvik/openhab-addons that referenced this pull request Mar 27, 2024

[hue] Improve connection stability (openhab#15477)

4af4e52

Signed-off-by: Andrew Fiddian-Green <software@whitebear.ch> Signed-off-by: Jørgen Austvik <jaustvik@acm.org>

andrewfg deleted the hue-connection-improvements branch August 25, 2024 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hue] Improve connection stability (API v2) #15477

[hue] Improve connection stability (API v2) #15477

andrewfg commented Aug 21, 2023 •

edited by jlaur

Loading

andrewfg commented Aug 21, 2023

andrewfg commented Aug 21, 2023

andrewfg commented Aug 22, 2023 •

edited

Loading

andrewfg commented Aug 22, 2023

maniac103 commented Aug 22, 2023

andrewfg commented Aug 22, 2023 •

edited

Loading

andrewfg commented Aug 22, 2023

andrewfg commented Aug 23, 2023 •

edited

Loading

andrewfg commented Aug 23, 2023

andrewfg commented Aug 24, 2023 •

edited

Loading

jlaur left a comment

jlaur commented Aug 26, 2023

maniac103 commented Aug 26, 2023

jlaur left a comment

[hue] Improve connection stability (API v2) #15477

[hue] Improve connection stability (API v2) #15477

Conversation

andrewfg commented Aug 21, 2023 • edited by jlaur Loading

andrewfg commented Aug 21, 2023

andrewfg commented Aug 21, 2023

andrewfg commented Aug 22, 2023 • edited Loading

andrewfg commented Aug 22, 2023

maniac103 commented Aug 22, 2023

andrewfg commented Aug 22, 2023 • edited Loading

andrewfg commented Aug 22, 2023

andrewfg commented Aug 23, 2023 • edited Loading

andrewfg commented Aug 23, 2023

andrewfg commented Aug 24, 2023 • edited Loading

jlaur left a comment

Choose a reason for hiding this comment

jlaur commented Aug 26, 2023

maniac103 commented Aug 26, 2023

jlaur left a comment

Choose a reason for hiding this comment

andrewfg commented Aug 21, 2023 •

edited by jlaur

Loading

andrewfg commented Aug 22, 2023 •

edited

Loading

andrewfg commented Aug 22, 2023 •

edited

Loading

andrewfg commented Aug 23, 2023 •

edited

Loading

andrewfg commented Aug 24, 2023 •

edited

Loading