Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[remoteopenhab] Reliable SSE reconnection mechanism #9680

Closed
lolodomo opened this issue Jan 3, 2021 · 18 comments · Fixed by #10063
Closed

[remoteopenhab] Reliable SSE reconnection mechanism #9680

lolodomo opened this issue Jan 3, 2021 · 18 comments · Fixed by #10063
Labels
bug An unexpected problem or unintended behavior of an add-on

Comments

@lolodomo
Copy link
Contributor

lolodomo commented Jan 3, 2021

A mechanism is already implemented to detect remote servers that are no more reachable and to automatically reconnect when the remote server is reachable again.
Unfortunately, this mechanism is not yet 100% reliable and the binding could sometimes not see that the remote server was OFF at a certain time, in particular if the remote server was restarted on a fast machine. This leads to the SSE connection not failing in error but not receiving new events when the remote server is alive again.

One option could be to frequently restart the SSE connection but I think there is a risk to loose some events and it looks a little too much just to avoid a potential remote server restart (that almost never happens in normal case).

Another option I just thought about would be to frequently open a new SSE connection and when opened, close the old one. Doing that, we are sure to reestablish the SSE link and to not loose events in normal situation. We just would have to avoid handling some events twice (one for each alive SSE link). I will try to implement this solution.

If someone has a better option to detect server disconnection from a SSE link, he is welcome to propose his solution.

PS: Ideally, the best solution would have been to have the openHAB server sending an ALIVE message to SSE clients. Like that, any client could detect a dead server. This could be implemented in OH3 but of course this would be missing in OH2.x servers.

@lolodomo lolodomo added the bug An unexpected problem or unintended behavior of an add-on label Jan 3, 2021
@lolodomo lolodomo changed the title [remoteopenhab] Reliable reconnection mechanism with SSE [remoteopenhab] Reliable SSE reconnection mechanism Jan 3, 2021
@fwolter
Copy link
Member

fwolter commented Feb 4, 2021

Couldn't you poll some state, that is available in all OH instances?

@lolodomo
Copy link
Contributor Author

lolodomo commented Feb 5, 2021

This is what is already done but not enough reliable.
The problem is not there, the problem is the SSE link not detected as broken.

I have implemented my option 2 but not yet created a PR. Not fully finished and not fully satisfied by my code.

@fwolter
Copy link
Member

fwolter commented Feb 5, 2021

OK, great that you have a solution! Maybe my thoughts were too simple. I meant that you penetrate the remote OH so that an event via the SSE link is sent and if it's not received in time, the connection is re-established.

@lolodomo
Copy link
Contributor Author

lolodomo commented Feb 5, 2021

The JAX RS SSE API allows to register onComplete. And you know ? This is triggered when the remote server is stopped.
I don't believe it. How is it possible that I did not find that before !

So I have a reliable way to detect that the remote server was shutdown and so reopen a new SSE when the remote server is alive again.

I still have to check if this cover the case the server or the client is just disconnected from the network.

@fwolter
Copy link
Member

fwolter commented Feb 5, 2021

Great that I could help by accident :)

@lolodomo
Copy link
Contributor Author

lolodomo commented Feb 5, 2021

In fact you did not help ;) but I told me that I should test this feature !

Another good news: if the server and the client are not reachable during a certain time because I disconnected my network cable, when I reconnect my cable, the events are again received by the client without any additional code.

@lolodomo
Copy link
Contributor Author

lolodomo commented Feb 5, 2021

Regarding the network cut, unfortunately it depends on the duration. If I cut the network around 45s, no problem events are coming back after reconnection. If the network is cut during 3 minutes, then events will never come again when I reconnect the cable and I have nothing that tells me that this SSE connection is a phantom connection.
Maybe this is a setting of the connection, the connect timeout ?

@lolodomo
Copy link
Contributor Author

lolodomo commented Feb 5, 2021

When the network is cut a certain amount of time (around 2 minutes in my case), it looks the OH server data sending is failing and the sink is closed.

2021-02-05 22:19:09.149 [DEBUG] [.openhab.core.io.rest.SseBroadcaster] - Sending event to sink failed
java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms
        at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:234) ~[bundleFile:9.4.20.v20190813]
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:227) ~[bundleFile:9.4.20.v20190813]
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:546) ~[bundleFile:9.4.20.v20190813]
        at java.io.OutputStream.write(OutputStream.java:122) ~[?:?]
        at org.openhab.core.io.rest.core.internal.GsonMessageBodyWriter.writeTo(GsonMessageBodyWriter.java:86) ~[?:?]
        at org.openhab.core.io.rest.core.internal.MediaTypeExtension.writeTo(MediaTypeExtension.java:84) ~[?:?]
        at org.apache.cxf.jaxrs.sse.OutboundSseEventBodyWriter.writePayloadTo(OutboundSseEventBodyWriter.java:133) ~[bundleFile:1.0.9]
        at org.apache.cxf.jaxrs.sse.OutboundSseEventBodyWriter.writeTo(OutboundSseEventBodyWriter.java:112) ~[bundleFile:1.0.9]
        at org.apache.cxf.jaxrs.sse.OutboundSseEventBodyWriter.writeTo(OutboundSseEventBodyWriter.java:40) ~[bundleFile:1.0.9]
        at org.apache.cxf.jaxrs.sse.SseEventSinkImpl.dequeue(SseEventSinkImpl.java:238) [bundleFile:1.0.9]
        at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1392) [bundleFile:9.4.20.v20190813]
        at org.eclipse.jetty.server.AsyncContextState$1.run(AsyncContextState.java:149) [bundleFile:9.4.20.v20190813]
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:782) [bundleFile:9.4.20.v20190813]
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:918) [bundleFile:9.4.20.v20190813]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms
        at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:171) ~[bundleFile:9.4.20.v20190813]
        at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:113) ~[bundleFile:9.4.20.v20190813]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        ... 1 more
2021-02-05 22:19:09.174 [DEBUG] [.openhab.core.io.rest.SseBroadcaster] - Closing SSE event sink

So a timeout on server side is reached because the client is not reading data and as a consequence the OH server is closing the SSE sink. But as the network is still cut, the client is never informed of this closure.

As you can see in the server stack trace, this is the Jetty idle timeout.

At least now, I know that short network cuts are handled transparently and correctly.

lolodomo added a commit to lolodomo/openhab-addons that referenced this issue Feb 5, 2021
when the remote server is alive again

Related to openhab#9680

Signed-off-by: Laurent Garnier <lg.hc@free.fr>
fwolter pushed a commit that referenced this issue Feb 6, 2021
#10060)

when the remote server is alive again

Related to #9680

Signed-off-by: Laurent Garnier <lg.hc@free.fr>
@fwolter
Copy link
Member

fwolter commented Feb 6, 2021

#10060 works for me if the remote is shut-down gracefully. "idleTimeout" sounds like client and server are exchanging data constantly. So, the client should be able to detect the silence, too, theoretically.

I'm wondering why the client doesn't detect that the TCP connection is closed. I'd assume that the client's OS should close the TCP connection after some time when the remote is gone.

@lolodomo
Copy link
Contributor Author

lolodomo commented Feb 6, 2021

For me this is logic. In case of a long network failure (because I unplug the network cable several minutes), the server is blocked and can no more send data to the client, it detects it and decide to close the connection (OH server deciision). As the client is not reachable, it is not informed. When I plug again the cable, the client is still waiting for events but it will never receive event because the server has already closed this connection.

I worked this morning to add an advanced setting to restart the SSE connection in case there is no activity on the SSE link. During my tests, I discovered a new exception, not in SSE but when I run a first new HTTP request to the server after I plug again my calble: EOFException ! The next request is then OK. Maybe a retry mechanism is required.

@lolodomo
Copy link
Contributor Author

lolodomo commented Feb 6, 2021

But the most important was the proper detection of the remote server shutdown or restart and at now this case is fixed ;)

@lolodomo
Copy link
Contributor Author

lolodomo commented Feb 6, 2021

As I mentioned yesterday, if your network failure is short, something like 30s for example, everything is OK.
Only remaining problem is the case of long network failure (several minutes).

@fwolter
Copy link
Member

fwolter commented Feb 6, 2021

to restart the SSE connection in case there is no activity on the SSE link

Are you going to generate some dummy traffic by the client, that the connection is not closed if there is simply no data to be transmitted? I think you mentioned this case in an older post.

@fwolter
Copy link
Member

fwolter commented Feb 6, 2021

In fact, if the client sends periodic keepalives you should be able to rely on the client's TCP stack to close the connection. I'd assume that onComplete is called, then.

@lolodomo
Copy link
Contributor Author

lolodomo commented Feb 6, 2021

The SSE in in one direction, server to client. The client cannot send to the server.
But yes the proper solution would be a keepalive message sent by the OH server to its clients but this is not yet implemented and will never be in OH2.

lolodomo added a commit to lolodomo/openhab-addons that referenced this issue Feb 6, 2021
…ivity

Fix openhab#9680

Signed-off-by: Laurent Garnier <lg.hc@free.fr>
lolodomo added a commit to lolodomo/openhab-addons that referenced this issue Feb 6, 2021
…ivity

Fix openhab#9680

Signed-off-by: Laurent Garnier <lg.hc@free.fr>
@lolodomo
Copy link
Contributor Author

lolodomo commented Feb 6, 2021

New setting implemented. Disabled by default. I recommend to enable it if your remote server generate regular events.
With no keepalive mechanism implemented on server side, I am not able to distinguish a normal inactivity from an unnormal inactivity.

@fwolter
Copy link
Member

fwolter commented Feb 6, 2021

What a shame that the client can't send anything... I'm wondering if you could utilize the TCP stack's keepalive ability by setting SO_KEEPALIVE. There's a great article explaining this: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/ under "Idle ESTAB is forever".

I peaked into the source and saw the flag: https://github.com/apache/cxf/search?q=keepalive

@fwolter
Copy link
Member

fwolter commented Feb 6, 2021

I tested your code in #10063 sucessfully. Nevertheless, it would be nice if the OS could handle a dead TCP connection, but I didn't figure out how to tell Apache CXF to use SO_KEEPALIVE.

fwolter pushed a commit that referenced this issue Feb 6, 2021
#10063)

* [remoteopenhab] New setting to restart the SSE connection after inactivity

Fix #9680

Signed-off-by: Laurent Garnier <lg.hc@free.fr>

* Review comments: doc

Signed-off-by: Laurent Garnier <lg.hc@free.fr>
themillhousegroup pushed a commit to themillhousegroup/openhab2-addons that referenced this issue May 10, 2021
openhab#10060)

when the remote server is alive again

Related to openhab#9680

Signed-off-by: Laurent Garnier <lg.hc@free.fr>
Signed-off-by: John Marshall <john.marshall.au@gmail.com>
themillhousegroup pushed a commit to themillhousegroup/openhab2-addons that referenced this issue May 10, 2021
openhab#10063)

* [remoteopenhab] New setting to restart the SSE connection after inactivity

Fix openhab#9680

Signed-off-by: Laurent Garnier <lg.hc@free.fr>

* Review comments: doc

Signed-off-by: Laurent Garnier <lg.hc@free.fr>
Signed-off-by: John Marshall <john.marshall.au@gmail.com>
thinkingstone pushed a commit to thinkingstone/openhab-addons that referenced this issue Nov 7, 2021
openhab#10060)

when the remote server is alive again

Related to openhab#9680

Signed-off-by: Laurent Garnier <lg.hc@free.fr>
thinkingstone pushed a commit to thinkingstone/openhab-addons that referenced this issue Nov 7, 2021
openhab#10063)

* [remoteopenhab] New setting to restart the SSE connection after inactivity

Fix openhab#9680

Signed-off-by: Laurent Garnier <lg.hc@free.fr>

* Review comments: doc

Signed-off-by: Laurent Garnier <lg.hc@free.fr>
marcfischerboschio pushed a commit to bosch-io/openhab-addons that referenced this issue May 5, 2022
openhab#10060)

when the remote server is alive again

Related to openhab#9680

Signed-off-by: Laurent Garnier <lg.hc@free.fr>
marcfischerboschio pushed a commit to bosch-io/openhab-addons that referenced this issue May 5, 2022
openhab#10063)

* [remoteopenhab] New setting to restart the SSE connection after inactivity

Fix openhab#9680

Signed-off-by: Laurent Garnier <lg.hc@free.fr>

* Review comments: doc

Signed-off-by: Laurent Garnier <lg.hc@free.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An unexpected problem or unintended behavior of an add-on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants