[remoteopenhab] Reliable SSE reconnection mechanism #9680

lolodomo · 2021-01-03T18:53:56Z

A mechanism is already implemented to detect remote servers that are no more reachable and to automatically reconnect when the remote server is reachable again.
Unfortunately, this mechanism is not yet 100% reliable and the binding could sometimes not see that the remote server was OFF at a certain time, in particular if the remote server was restarted on a fast machine. This leads to the SSE connection not failing in error but not receiving new events when the remote server is alive again.

One option could be to frequently restart the SSE connection but I think there is a risk to loose some events and it looks a little too much just to avoid a potential remote server restart (that almost never happens in normal case).

Another option I just thought about would be to frequently open a new SSE connection and when opened, close the old one. Doing that, we are sure to reestablish the SSE link and to not loose events in normal situation. We just would have to avoid handling some events twice (one for each alive SSE link). I will try to implement this solution.

If someone has a better option to detect server disconnection from a SSE link, he is welcome to propose his solution.

PS: Ideally, the best solution would have been to have the openHAB server sending an ALIVE message to SSE clients. Like that, any client could detect a dead server. This could be implemented in OH3 but of course this would be missing in OH2.x servers.

fwolter · 2021-02-04T22:25:12Z

Couldn't you poll some state, that is available in all OH instances?

lolodomo · 2021-02-05T08:12:10Z

This is what is already done but not enough reliable.
The problem is not there, the problem is the SSE link not detected as broken.

I have implemented my option 2 but not yet created a PR. Not fully finished and not fully satisfied by my code.

fwolter · 2021-02-05T08:31:19Z

OK, great that you have a solution! Maybe my thoughts were too simple. I meant that you penetrate the remote OH so that an event via the SSE link is sent and if it's not received in time, the connection is re-established.

lolodomo · 2021-02-05T18:24:33Z

The JAX RS SSE API allows to register onComplete. And you know ? This is triggered when the remote server is stopped.
I don't believe it. How is it possible that I did not find that before !

So I have a reliable way to detect that the remote server was shutdown and so reopen a new SSE when the remote server is alive again.

I still have to check if this cover the case the server or the client is just disconnected from the network.

fwolter · 2021-02-05T18:49:47Z

Great that I could help by accident :)

lolodomo · 2021-02-05T20:05:22Z

In fact you did not help ;) but I told me that I should test this feature !

Another good news: if the server and the client are not reachable during a certain time because I disconnected my network cable, when I reconnect my cable, the events are again received by the client without any additional code.

lolodomo · 2021-02-05T20:21:16Z

Regarding the network cut, unfortunately it depends on the duration. If I cut the network around 45s, no problem events are coming back after reconnection. If the network is cut during 3 minutes, then events will never come again when I reconnect the cable and I have nothing that tells me that this SSE connection is a phantom connection.
Maybe this is a setting of the connection, the connect timeout ?

lolodomo · 2021-02-05T21:43:27Z

When the network is cut a certain amount of time (around 2 minutes in my case), it looks the OH server data sending is failing and the sink is closed.

2021-02-05 22:19:09.149 [DEBUG] [.openhab.core.io.rest.SseBroadcaster] - Sending event to sink failed
java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms
        at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:234) ~[bundleFile:9.4.20.v20190813]
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:227) ~[bundleFile:9.4.20.v20190813]
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:546) ~[bundleFile:9.4.20.v20190813]
        at java.io.OutputStream.write(OutputStream.java:122) ~[?:?]
        at org.openhab.core.io.rest.core.internal.GsonMessageBodyWriter.writeTo(GsonMessageBodyWriter.java:86) ~[?:?]
        at org.openhab.core.io.rest.core.internal.MediaTypeExtension.writeTo(MediaTypeExtension.java:84) ~[?:?]
        at org.apache.cxf.jaxrs.sse.OutboundSseEventBodyWriter.writePayloadTo(OutboundSseEventBodyWriter.java:133) ~[bundleFile:1.0.9]
        at org.apache.cxf.jaxrs.sse.OutboundSseEventBodyWriter.writeTo(OutboundSseEventBodyWriter.java:112) ~[bundleFile:1.0.9]
        at org.apache.cxf.jaxrs.sse.OutboundSseEventBodyWriter.writeTo(OutboundSseEventBodyWriter.java:40) ~[bundleFile:1.0.9]
        at org.apache.cxf.jaxrs.sse.SseEventSinkImpl.dequeue(SseEventSinkImpl.java:238) [bundleFile:1.0.9]
        at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1392) [bundleFile:9.4.20.v20190813]
        at org.eclipse.jetty.server.AsyncContextState$1.run(AsyncContextState.java:149) [bundleFile:9.4.20.v20190813]
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:782) [bundleFile:9.4.20.v20190813]
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:918) [bundleFile:9.4.20.v20190813]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms
        at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:171) ~[bundleFile:9.4.20.v20190813]
        at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:113) ~[bundleFile:9.4.20.v20190813]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        ... 1 more
2021-02-05 22:19:09.174 [DEBUG] [.openhab.core.io.rest.SseBroadcaster] - Closing SSE event sink

So a timeout on server side is reached because the client is not reading data and as a consequence the OH server is closing the SSE sink. But as the network is still cut, the client is never informed of this closure.

As you can see in the server stack trace, this is the Jetty idle timeout.

At least now, I know that short network cuts are handled transparently and correctly.

when the remote server is alive again Related to openhab#9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr>

#10060) when the remote server is alive again Related to #9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr>

fwolter · 2021-02-06T10:00:53Z

#10060 works for me if the remote is shut-down gracefully. "idleTimeout" sounds like client and server are exchanging data constantly. So, the client should be able to detect the silence, too, theoretically.

I'm wondering why the client doesn't detect that the TCP connection is closed. I'd assume that the client's OS should close the TCP connection after some time when the remote is gone.

lolodomo · 2021-02-06T10:15:14Z

For me this is logic. In case of a long network failure (because I unplug the network cable several minutes), the server is blocked and can no more send data to the client, it detects it and decide to close the connection (OH server deciision). As the client is not reachable, it is not informed. When I plug again the cable, the client is still waiting for events but it will never receive event because the server has already closed this connection.

I worked this morning to add an advanced setting to restart the SSE connection in case there is no activity on the SSE link. During my tests, I discovered a new exception, not in SSE but when I run a first new HTTP request to the server after I plug again my calble: EOFException ! The next request is then OK. Maybe a retry mechanism is required.

lolodomo · 2021-02-06T10:17:23Z

But the most important was the proper detection of the remote server shutdown or restart and at now this case is fixed ;)

lolodomo · 2021-02-06T10:19:25Z

As I mentioned yesterday, if your network failure is short, something like 30s for example, everything is OK.
Only remaining problem is the case of long network failure (several minutes).

fwolter · 2021-02-06T10:47:41Z

to restart the SSE connection in case there is no activity on the SSE link

Are you going to generate some dummy traffic by the client, that the connection is not closed if there is simply no data to be transmitted? I think you mentioned this case in an older post.

fwolter · 2021-02-06T10:51:00Z

In fact, if the client sends periodic keepalives you should be able to rely on the client's TCP stack to close the connection. I'd assume that onComplete is called, then.

lolodomo · 2021-02-06T11:01:01Z

The SSE in in one direction, server to client. The client cannot send to the server.
But yes the proper solution would be a keepalive message sent by the OH server to its clients but this is not yet implemented and will never be in OH2.

…ivity Fix openhab#9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr>

lolodomo · 2021-02-06T11:23:36Z

New setting implemented. Disabled by default. I recommend to enable it if your remote server generate regular events.
With no keepalive mechanism implemented on server side, I am not able to distinguish a normal inactivity from an unnormal inactivity.

fwolter · 2021-02-06T11:42:08Z

What a shame that the client can't send anything... I'm wondering if you could utilize the TCP stack's keepalive ability by setting SO_KEEPALIVE. There's a great article explaining this: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/ under "Idle ESTAB is forever".

I peaked into the source and saw the flag: https://github.com/apache/cxf/search?q=keepalive

fwolter · 2021-02-06T14:16:37Z

I tested your code in #10063 sucessfully. Nevertheless, it would be nice if the OS could handle a dead TCP connection, but I didn't figure out how to tell Apache CXF to use SO_KEEPALIVE.

#10063) * [remoteopenhab] New setting to restart the SSE connection after inactivity Fix #9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr> * Review comments: doc Signed-off-by: Laurent Garnier <lg.hc@free.fr>

openhab#10060) when the remote server is alive again Related to openhab#9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr> Signed-off-by: John Marshall <john.marshall.au@gmail.com>

openhab#10063) * [remoteopenhab] New setting to restart the SSE connection after inactivity Fix openhab#9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr> * Review comments: doc Signed-off-by: Laurent Garnier <lg.hc@free.fr> Signed-off-by: John Marshall <john.marshall.au@gmail.com>

openhab#10060) when the remote server is alive again Related to openhab#9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr>

openhab#10063) * [remoteopenhab] New setting to restart the SSE connection after inactivity Fix openhab#9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr> * Review comments: doc Signed-off-by: Laurent Garnier <lg.hc@free.fr>

openhab#10060) when the remote server is alive again Related to openhab#9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr>

openhab#10063) * [remoteopenhab] New setting to restart the SSE connection after inactivity Fix openhab#9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr> * Review comments: doc Signed-off-by: Laurent Garnier <lg.hc@free.fr>

lolodomo added the bug An unexpected problem or unintended behavior of an add-on label Jan 3, 2021

lolodomo changed the title ~~[remoteopenhab] Reliable reconnection mechanism with SSE~~ [remoteopenhab] Reliable SSE reconnection mechanism Jan 3, 2021

lolodomo added a commit to lolodomo/openhab-addons that referenced this issue Feb 5, 2021

[remoteopenhab] Detect a remote server shutdown and reconnect properly

83d32a3

when the remote server is alive again Related to openhab#9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr>

lolodomo mentioned this issue Feb 5, 2021

[remoteopenhab] Detect a remote server shutdown and reconnect properly #10060

Merged

fwolter pushed a commit that referenced this issue Feb 6, 2021

[remoteopenhab] Detect a remote server shutdown and reconnect properly (

3e7ecbf

#10060) when the remote server is alive again Related to #9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr>

lolodomo added a commit to lolodomo/openhab-addons that referenced this issue Feb 6, 2021

[remoteopenhab] New setting to restart the SSE connection after inact…

832b8d3

…ivity Fix openhab#9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr>

lolodomo added a commit to lolodomo/openhab-addons that referenced this issue Feb 6, 2021

[remoteopenhab] New setting to restart the SSE connection after inact…

a615d5d

…ivity Fix openhab#9680 Signed-off-by: Laurent Garnier <lg.hc@free.fr>

lolodomo mentioned this issue Feb 6, 2021

[remoteopenhab] New setting to restart the SSE connection after inact… #10063

Merged

fwolter closed this as completed in #10063 Feb 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[remoteopenhab] Reliable SSE reconnection mechanism #9680

[remoteopenhab] Reliable SSE reconnection mechanism #9680

lolodomo commented Jan 3, 2021

fwolter commented Feb 4, 2021

lolodomo commented Feb 5, 2021

fwolter commented Feb 5, 2021

lolodomo commented Feb 5, 2021

fwolter commented Feb 5, 2021

lolodomo commented Feb 5, 2021

lolodomo commented Feb 5, 2021 •

edited

Loading

lolodomo commented Feb 5, 2021

fwolter commented Feb 6, 2021

lolodomo commented Feb 6, 2021

lolodomo commented Feb 6, 2021

lolodomo commented Feb 6, 2021

fwolter commented Feb 6, 2021

fwolter commented Feb 6, 2021

lolodomo commented Feb 6, 2021

lolodomo commented Feb 6, 2021

fwolter commented Feb 6, 2021

fwolter commented Feb 6, 2021

[remoteopenhab] Reliable SSE reconnection mechanism #9680

[remoteopenhab] Reliable SSE reconnection mechanism #9680

Comments

lolodomo commented Jan 3, 2021

fwolter commented Feb 4, 2021

lolodomo commented Feb 5, 2021

fwolter commented Feb 5, 2021

lolodomo commented Feb 5, 2021

fwolter commented Feb 5, 2021

lolodomo commented Feb 5, 2021

lolodomo commented Feb 5, 2021 • edited Loading

lolodomo commented Feb 5, 2021

fwolter commented Feb 6, 2021

lolodomo commented Feb 6, 2021

lolodomo commented Feb 6, 2021

lolodomo commented Feb 6, 2021

fwolter commented Feb 6, 2021

fwolter commented Feb 6, 2021

lolodomo commented Feb 6, 2021

lolodomo commented Feb 6, 2021

fwolter commented Feb 6, 2021

fwolter commented Feb 6, 2021

lolodomo commented Feb 5, 2021 •

edited

Loading