Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hue] ApiException fails to reconnect bridge #15350

Closed
andrewfg opened this issue Aug 1, 2023 · 10 comments · Fixed by #15477
Closed

[hue] ApiException fails to reconnect bridge #15350

andrewfg opened this issue Aug 1, 2023 · 10 comments · Fixed by #15477
Assignees
Labels
bug An unexpected problem or unintended behavior of an add-on

Comments

@andrewfg
Copy link
Contributor

andrewfg commented Aug 1, 2023

Problem

On GET and PUT requests, (for as yet unknown reasons), the Hue Bridge server sometimes returns an HTML payload instead of the expected XML payload. This causes an ApiException: Unexpected Content-Type: text/html. And this causes the bridge handler to try to reconnect to the server. It attempts to reconnect for RECONNECT_MAX_TRIES (5) using an exponential back off delay of 0, 1, 2, 4, 8 minutes. However even after the maximum number of attempts (a total of 15 minutes), the reconnection attempt fails, and the bridge goes permanently offline.

Work Around

The work around is to manually disable and then reenable the bridge thing.

Solution

The bridge thing should go online automatically.

@andrewfg andrewfg added the bug An unexpected problem or unintended behavior of an add-on label Aug 1, 2023
@andrewfg andrewfg changed the title [hue] An ApiException during discovery scan should cause bridge thing to reconnect [hue] ApiException: Unexpected Content-Type: text/html fails to reconnect bridge Aug 1, 2023
@andrewfg andrewfg changed the title [hue] ApiException: Unexpected Content-Type: text/html fails to reconnect bridge [hue] ApiException fails to reconnect bridge Aug 1, 2023
@andrewfg andrewfg self-assigned this Aug 2, 2023
@jlaur
Copy link
Contributor

jlaur commented Aug 6, 2023

This issue is probably the same as this one I just noticed when getting back home after a few days away and noticing some power management not working and the bridge being offline?

2023-08-01 04:20:44.601 [WARN ] [.hue.internal.connection.Clip2Bridge] - Fatal error SessionListenerAdapter GO_AWAY => closing session.
2023-08-02 16:02:15.310 [WARN ] [.hue.internal.connection.Clip2Bridge] - Fatal error SessionListenerAdapter GO_AWAY => closing session.
2023-08-02 21:32:34.078 [WARN ] [.hue.internal.connection.Clip2Bridge] - Fatal error SessionListenerAdapter GO_AWAY => closing session.
2023-08-02 22:02:27.068 [WARN ] [e.internal.handler.Clip2ThingHandler] - Command 'ON' for thing 'hue:device:home:hueplug1', channel 'hue:device:home:hueplug1:switch' failed with error 'Unexpected Content-Type: text/html'.
2023-08-02 22:02:28.083 [WARN ] [.hue.internal.connection.Clip2Bridge] - Fatal error SessionListenerAdapter RESET => closing session.

@andrewfg
Copy link
Contributor Author

andrewfg commented Aug 6, 2023

@jlaur I think the binding DID successfully restart after the 'normal' GO_AWAY errors. But it did NOT restart after the 'Unexpected Content Type' error.

As mentioned elsewhere it seems that overlapping PUT requests can lead to Unexpected Content Type so my other open PR should prevent the majority of such cases.

Nevertheless it seems that Unexpected Content Type can still be triggered by something else as well. Maybe bridge firmware updating?? I have been logging to debug for about one week, but still have not tracked down what is happening. So maybe you help with logging to debug too?

@andrewfg
Copy link
Contributor Author

andrewfg commented Aug 6, 2023

PS in case of Unexpected Content Type the handler restart code tries for 15 minutes and fails to restart. Whereas calling dispose() and initialize() does seem to fix it ('hammer' restart) .. but I am not sure why..

@jlaur
Copy link
Contributor

jlaur commented Aug 6, 2023

Nevertheless it seems that Unexpected Content Type can still be triggered by something else as well. Maybe bridge firmware updating??

I'm on the latest firmware, but I don't auto-update firmware, so in this scenario I experienced it wasn't caused by an ongoing firmware update.

I have been logging to debug for about one week, but still have not tracked down what is happening. So maybe you help with logging yo debug too?

Sure, I might have a look and try to add some tailored logging for this specific scenario. I'm a little hesitant to enable full debug logging in my production system since it would need to be enabled for long time period and it's quite verbose.

@jlaur
Copy link
Contributor

jlaur commented Aug 6, 2023

I added logging of http code and returned content just above this line:

So the next time it happens, at least I'll know that, which could give a hint. Unless you already know this?

@andrewfg
Copy link
Contributor Author

andrewfg commented Aug 6, 2023

Unless you already know this?

Well, I DO know that whenever the bridge returns HTML instead of JSON it means that it is more difficult to restart the connection than when any other error occurs. But what I do NOT know is a) why it returns HTML rather than JSON, and b) why it is more difficult to restart.

@andrewfg
Copy link
Contributor Author

Just for info: I was reading more about HTTP2 GO_AWAY; the Hue Bridge runs on Linux and it uses the ‘nginx’ web server; and it seems that by default the nginx server sends a GO_AWAY after 1000 GET or PUT requests.

The binding GET / PUT request count is roughly as in the following formula; so it is likely that the 1000 limit will occur every 1…2 days depending on your system size and load.

GET / PUT count ~= (quantity of things x 4) + (hours up time X 24) + (number of thing commands issued)

I shall run some more tests on my own system to try to confirm this 1000 GET / PUT hypothesis. And if it seems to be correct, then I shall modify the binding to keep count of the number of GET / PUT calls sent, and to softly reset the HTTP2 connection on the client side BEFORE the Hue Bridge encounters the 1000 call GO_AWAY limit. That would certainly very much reduce the occurrences of the bug that we are observing.

@jlaur
Copy link
Contributor

jlaur commented Aug 18, 2023

Just for info: I was reading more about HTTP2 GO_AWAY

Thanks. Your research made some new searches possible. If you didn't already read this, this seems interesting: https://stackoverflow.com/questions/55087292/how-to-handle-http-2-goaway-with-httpclient

Also confirmed that nginx has this default configuration (previously 100, now 1000).

Keep in mind that other clients can also connect to the bridge, so we should aim for correct handling rather than trying to predict when it will happen.

@andrewfg
Copy link
Contributor Author

Your research made some new searches possible

Actually it was the other way round. :) The post that you cited is among several that I had already googled before I made my own post above.

Keep in mind that other clients can also connect to the bridge

The 1000 limit would be the max number of streams that could be opened on any given HTTP2 session. So the presence of other clients is irrelevant in this context. Nevertheless when I make the next PR it will probably a) count the used streams and softly restart the session at (say) 90% of the 1000 limit, b) make the 1000 a config param (in case nginx changes in future), c) nevertheless process any eventual GO_AWAY, by d) checking its error code to see if it is a soft or a hard GO_AWAY and if the former we can be more soft on restart than if the latter.

PS I am currently running some tests on my operative system to do more detailed logging of stream counts and soft/hard GO_AWAY occurrences; I have a rule running that power cycles a test lamp every n seconds and logging to correlate GO_AWAY with PUT call stream count..

@andrewfg
Copy link
Contributor Author

andrewfg commented Aug 19, 2023

^
In the meantime I can confirm that the GO_AWAY does indeed occur every 1000 GET / PUT calls.


EDIT: actually it occurs when the 999th call is made, between receiving the GET/PUT call and sending the content response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An unexpected problem or unintended behavior of an add-on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants