[homematic] Improve (re)connect handling to Homematic gateways #11429

MHerbst · 2021-10-23T14:48:07Z

The PR improves the connect and especially the re-connect handling to Homematic gateways through the following measures:

by using a global unique id for each bridge installation it is possible to use two or more OH installation with the same Homematic gateway in parallel.
the HmIP daemon normally need much longer than the HM daemon until it accepts event registration. Therefore a configurable retry mechanismn has been implemented.

Signed-off-by: Martin Herbst develop@mherbst.de

Flole998 · 2021-10-30T23:07:38Z

I don't really like the idea of limiting the amount of reconnects. I want it to attempt to reconnect forever in case the connection is dropped. If my gateway crashes and I'm not at home and don't notice it right away then after some time a simple restart of the gateway won't help, did I understand that correctly? So in my opinion there should be a way to specify "forever" in some way, even if it's just as "65535 means retry forever".

MHerbst · 2021-10-31T09:04:30Z

I think my description and probably the naming of the new options is a bit misleading. After the connection to the CCU was lost, the binding attempts to reconnect forever. This has not been changes. I have added a retry for the registration of callbacks. After the binding finds again a running CCU it immediately tries to register itself for events. For HM-RF components, this worked without problems. But for HmIP components this did not work because the HmIP daemon on the CCU needs much longer to start (on a CCU2 about 2 minutes). Therefore, I have added the retry for the registration of the callbacks.
I will improve the naming of the options and the description to make this clearer. I don't think that it makes sense to retry the callback registration forever. If it does not work within some minutes, it will no work until the CCU is restarted.

fwolter · 2021-11-20T11:47:32Z

Is this ready to be reviewed?

MHerbst · 2021-11-20T12:27:42Z

Yes, would be nice if you could have a look.

fwolter · 2021-11-21T14:58:15Z

...atic/src/main/java/org/openhab/binding/homematic/internal/communicator/client/RpcClient.java

+                    throw (e);
+                }
+                try {
+                    Thread.sleep(waitTime);


The ThingHandler's initialize() method must return fast. As this code is called via that method, it must also return fast. Despite that, sleep should be avoided, because it blocks a thread in the schedulers thread pool. You could schedule another try via scheduler.scheduleWithFixedDelay(), if sendMessage() fails.

Thanks for the review and the tips.
The "fast return" is guaranteed in my opinion, because this method is called via scheduler.submit:

openhab-addons/bundles/org.openhab.binding.homematic/src/main/java/org/openhab/binding/homematic/internal/handler/HomematicBridgeHandler.java

Line 95 in e24d43a

initializeFuture = scheduler.submit(this::initializeInternal);

I will replace the sleep, but unfortunately this will require some refactoring because the current implementation depends on a "synchronous" execution of RpcClient#init.

@fwolter I have changed the implementation and replaced the sleep and implemented it as you suggested.

fwolter · 2021-11-21T14:58:48Z

...atic/src/main/java/org/openhab/binding/homematic/internal/communicator/client/RpcClient.java

+                } catch (InterruptedException e1) {
+                    // ignore


The InterruptedException is thrown when OH is shut-down. That means the current operation should be finished as fast as possible and no further operations should be done. You can achieve this by not catching the exception here but throw the InterruptedException and catch it at the highest level you can.

If I replace the sleep, the catch is no longer needed

@fwolter changed it

fwolter · 2021-11-21T14:59:23Z

bundles/org.openhab.binding.homematic/src/main/resources/OH-INF/thing/bridge.xml

@@ -95,6 +95,18 @@
 				<advanced>true</advanced>
 				<default>9292</default>
 			</parameter>
+			<parameter name="callbackRegistrationRetries" type="integer" min="1">
+				<label>Number of Callback Registration Retries </label>


Labels are expected to be as short as possible. Guideline is 2-3 words with max 20-25 chars. See https://www.openhab.org/docs/developer/bindings/thing-xml.html#formatting-labels-and-descriptions

Also there seems to be a space at the end that should be removed IMO

OK, will find a shorter label name.

I have changed the configuration and this parameter has been removed

fwolter · 2021-12-11T11:25:01Z

...atic/src/main/java/org/openhab/binding/homematic/internal/communicator/client/RpcClient.java

+                }
+            }, INITIAL_CALLBACK_REG_DELAY, CALLBACK_REG_DELAY, TimeUnit.SECONDS);
+            try {
+                future.get(config.getCallbackRegTimeout(), TimeUnit.SECONDS);


get() will block forever when called on a future returned by scheduleWithFixedDelay(). Excerpt from the Javadoc:

Returns: a ScheduledFuture representing pending completion ofthe series of repeated tasks. The future's get() method will never return normally,and will throw an exception upon task cancellation orabnormal termination of a task execution.

This is called from the handler's initialize(), thus initialize() will block forever. Why not removing the entire try block?

You need to cancel the task when the handler is disposed.

Are you registering the callbacks repeatedly by intention?

This is called from the handler's initialize(), thus initialize() will block forever. Why not removing the entire try block?
I have tried it, but without the blocking get, there can be up to 4 parallel requests to different ports (and services) on the Homematic gateway and the gateway simply does not like this. I have tried it this way and it did not work. In some tests I even had to hard reset the gateway because it was completely stuck.

The current implementation of the binding needs a more or less synchronous execution.
I am not happy with the solution, but with the current implementation, I really see no other way to implement it :-(.

This is called from the handler's initialize(), thus initialize()

As far as I can see the initialize in the Bridge handler (HomematicBridgeHandler#initialize) is not blocked and finishes fast (I have verified it in my tests).

Are you registering the callbacks repeatedly by intention?

The callback registration is executed after the binding was started or after the connection to the HM gateway was lost (e.g. if the gateway was restarted). In the first case, the first registration request will be successful and the scheduler is not used.

In the latter case, the binding performs a complete reconnect to the Homematic gateway. This includes a new registration of the callbacks. For each possible connection type (HM-RF, HmIP, Groups, CuxD) a separate callback must be registered because on the gateway, a service (daemon) is executed for each connection type listening on different ports. The HmIP service e.g. needs up to 2 min. until it is ready for callback registration requests. Therefore, the binding needs to perform these retries. There is no other way to figure out when a service on the gateway is ready to accept requests.

It is even worse. In the first seconds after the gateway restart you even don't get a correct reply to the XML RPC requests. Instead the gateway returns a HTML document with a message "please wait ...".

In my opinion, major parts of the binding's implementation need a complete overhaul. I would also be better to create a bridge for each of the possible connection types (would be a breaking change :-()
But such an overhaul requires time especially for testing.

You need to cancel the task when the handler is disposed.

OK, will check this.

In my opinion, major parts of the binding's implementation need a complete overhaul.

I agree on that one. Also the channel creation stuff that is "fixed" in another PR is kinda broken by design and fixing that properly would require huge changes. So now we seem to have 2 changes/fixes that would ideally be handled with a complete rewrite of most of the binding....

After looking into the code again, I saw that the Future cancels itself when the connection has been established successfully. So, initialize() will block as long as the remote side is unavailable. The current implementation calls sendMessage() only once (which is also bad in initialize()).

So, if I see correctly your code will block initialize() forever when the remote side is not available.

@fwolter A sendMessage during the Bridge initialization it is not blocking the handler initialization, as you can see in HomematicBridgeHandler#initialize:

public void initialize() { synchronized (initDisposeLock) { isDisposed = false; initializeFuture = scheduler.submit(this::initializeInternal); } }

You can believe me, I have tested it (several times).

The binding works correctly. If the gateway is not accessible when the binding is started, all things are set to state "ERROR:BRIDGE". After the gateway comes up this is correctly detected and the thing's state changes to ONLINE.

I know, this solution is at least in parts "ugly as hell". But I have invested really hours of testing and trying to find a better solution without having to rewrite larger part of the binding. Now am I am really fed up with these connection problems and I for the next couple of weeks I don't have the time to work on it.

If you can't accept the PR in its current state, it is OK to me. But in this case, it will remain in this state for at least a month. Or someone else continues to work on this problems.

I didn't notice that initialize() wasn't the initialize() from the ThingHandler, but from a custom class. Would be awesome if this code could be re-written sometime!

the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de>

Some services on the CCU need longer to start and are not available immediately after the connection to the CCU has been resumed. Improves the solution for #8808 Fixes #10439 Signed-off-by: Martin Herbst <develop@mherbst.de>

Signed-off-by: Martin Herbst <develop@mherbst.de>

Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de>

Signed-off-by: Martin Herbst <develop@mherbst.de>

- unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de>

Signed-off-by: Martin Herbst <develop@mherbst.de>

…ab#11429) * Use globally unique id for registration of callback to allow ... the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de> * Retry callback re-registration after connection is resumed Some services on the CCU need longer to start and are not available immediately after the connection to the CCU has been resumed. Improves the solution for openhab#8808 Fixes openhab#10439 Signed-off-by: Martin Herbst <develop@mherbst.de> * Description was missing. Signed-off-by: Martin Herbst <develop@mherbst.de> * Changed setting name and description to avoid confusion Signed-off-by: Martin Herbst <develop@mherbst.de> * Added a troubleshooting tip to solve a communication problem Signed-off-by: Martin Herbst <develop@mherbst.de> * Shortened the label name to follow the guide lines Signed-off-by: Martin Herbst <develop@mherbst.de> * Print more information about the reason for the failure Signed-off-by: Martin Herbst <develop@mherbst.de> * Using scheduler thread pool and simplified configuration Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de> * Don't retry to send if gateway does not answer at all Signed-off-by: Martin Herbst <develop@mherbst.de> * Improved reconnect handling - unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de> * Spotless Signed-off-by: Martin Herbst <develop@mherbst.de> * Cancel an active future if the binding is stopped Signed-off-by: Martin Herbst <develop@mherbst.de> Signed-off-by: Nick Waterton <n.waterton@outlook.com>

…ab#11429) * Use globally unique id for registration of callback to allow ... the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de> * Retry callback re-registration after connection is resumed Some services on the CCU need longer to start and are not available immediately after the connection to the CCU has been resumed. Improves the solution for openhab#8808 Fixes openhab#10439 Signed-off-by: Martin Herbst <develop@mherbst.de> * Description was missing. Signed-off-by: Martin Herbst <develop@mherbst.de> * Changed setting name and description to avoid confusion Signed-off-by: Martin Herbst <develop@mherbst.de> * Added a troubleshooting tip to solve a communication problem Signed-off-by: Martin Herbst <develop@mherbst.de> * Shortened the label name to follow the guide lines Signed-off-by: Martin Herbst <develop@mherbst.de> * Print more information about the reason for the failure Signed-off-by: Martin Herbst <develop@mherbst.de> * Using scheduler thread pool and simplified configuration Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de> * Don't retry to send if gateway does not answer at all Signed-off-by: Martin Herbst <develop@mherbst.de> * Improved reconnect handling - unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de> * Spotless Signed-off-by: Martin Herbst <develop@mherbst.de> * Cancel an active future if the binding is stopped Signed-off-by: Martin Herbst <develop@mherbst.de> Signed-off-by: Michael Schmidt <mi.schmidt.83@gmail.com>

…ab#11429) * Use globally unique id for registration of callback to allow ... the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de> * Retry callback re-registration after connection is resumed Some services on the CCU need longer to start and are not available immediately after the connection to the CCU has been resumed. Improves the solution for openhab#8808 Fixes openhab#10439 Signed-off-by: Martin Herbst <develop@mherbst.de> * Description was missing. Signed-off-by: Martin Herbst <develop@mherbst.de> * Changed setting name and description to avoid confusion Signed-off-by: Martin Herbst <develop@mherbst.de> * Added a troubleshooting tip to solve a communication problem Signed-off-by: Martin Herbst <develop@mherbst.de> * Shortened the label name to follow the guide lines Signed-off-by: Martin Herbst <develop@mherbst.de> * Print more information about the reason for the failure Signed-off-by: Martin Herbst <develop@mherbst.de> * Using scheduler thread pool and simplified configuration Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de> * Don't retry to send if gateway does not answer at all Signed-off-by: Martin Herbst <develop@mherbst.de> * Improved reconnect handling - unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de> * Spotless Signed-off-by: Martin Herbst <develop@mherbst.de> * Cancel an active future if the binding is stopped Signed-off-by: Martin Herbst <develop@mherbst.de>

…ab#11429) * Use globally unique id for registration of callback to allow ... the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de> * Retry callback re-registration after connection is resumed Some services on the CCU need longer to start and are not available immediately after the connection to the CCU has been resumed. Improves the solution for openhab#8808 Fixes openhab#10439 Signed-off-by: Martin Herbst <develop@mherbst.de> * Description was missing. Signed-off-by: Martin Herbst <develop@mherbst.de> * Changed setting name and description to avoid confusion Signed-off-by: Martin Herbst <develop@mherbst.de> * Added a troubleshooting tip to solve a communication problem Signed-off-by: Martin Herbst <develop@mherbst.de> * Shortened the label name to follow the guide lines Signed-off-by: Martin Herbst <develop@mherbst.de> * Print more information about the reason for the failure Signed-off-by: Martin Herbst <develop@mherbst.de> * Using scheduler thread pool and simplified configuration Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de> * Don't retry to send if gateway does not answer at all Signed-off-by: Martin Herbst <develop@mherbst.de> * Improved reconnect handling - unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de> * Spotless Signed-off-by: Martin Herbst <develop@mherbst.de> * Cancel an active future if the binding is stopped Signed-off-by: Martin Herbst <develop@mherbst.de> Signed-off-by: Andras Uhrin <andras.uhrin@gmail.com>

MHerbst requested a review from gerrieg as a code owner October 23, 2021 14:48

fwolter requested changes Nov 21, 2021

View reviewed changes

wborn added the enhancement An enhancement or new feature for an existing add-on label Nov 27, 2021

MHerbst mentioned this pull request Dec 4, 2021

[homematic] Automatic reconnection after CCU to OpenHAB connection interruption #11427

Open

fwolter requested changes Dec 11, 2021

View reviewed changes

MHerbst added 12 commits December 11, 2021 15:18

Use globally unique id for registration of callback to allow ...

f3765b1

the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de>

Retry callback re-registration after connection is resumed

f147aa9

Some services on the CCU need longer to start and are not available immediately after the connection to the CCU has been resumed. Improves the solution for #8808 Fixes #10439 Signed-off-by: Martin Herbst <develop@mherbst.de>

Description was missing.

f28ba2f

Signed-off-by: Martin Herbst <develop@mherbst.de>

Changed setting name and description to avoid confusion

d5f46c2

Signed-off-by: Martin Herbst <develop@mherbst.de>

Added a troubleshooting tip to solve a communication problem

9faec99

Signed-off-by: Martin Herbst <develop@mherbst.de>

Shortened the label name to follow the guide lines

3c9150b

Signed-off-by: Martin Herbst <develop@mherbst.de>

Print more information about the reason for the failure

30f3d0e

Signed-off-by: Martin Herbst <develop@mherbst.de>

Using scheduler thread pool and simplified configuration

98a2d76

Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de>

Don't retry to send if gateway does not answer at all

68bd4a0

Signed-off-by: Martin Herbst <develop@mherbst.de>

Improved reconnect handling

671231a

- unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de>

Spotless

4d63703

Signed-off-by: Martin Herbst <develop@mherbst.de>

Cancel an active future if the binding is stopped

0a9fe99

Signed-off-by: Martin Herbst <develop@mherbst.de>

fwolter approved these changes Dec 11, 2021

View reviewed changes

fwolter merged commit adecb53 into openhab:main Dec 11, 2021

fwolter added this to the 3.2 milestone Dec 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[homematic] Improve (re)connect handling to Homematic gateways #11429

[homematic] Improve (re)connect handling to Homematic gateways #11429

MHerbst commented Oct 23, 2021

Flole998 commented Oct 30, 2021

MHerbst commented Oct 31, 2021 •

edited

Loading

fwolter commented Nov 20, 2021

MHerbst commented Nov 20, 2021

fwolter Nov 21, 2021

MHerbst Nov 21, 2021 •

edited

Loading

MHerbst Dec 4, 2021

fwolter Nov 21, 2021

MHerbst Nov 21, 2021

MHerbst Dec 4, 2021

fwolter Nov 21, 2021

Flole998 Nov 21, 2021

MHerbst Nov 21, 2021

MHerbst Dec 4, 2021

fwolter Dec 11, 2021

MHerbst Dec 11, 2021 •

edited

Loading

Flole998 Dec 11, 2021

fwolter Dec 11, 2021

MHerbst Dec 11, 2021

fwolter Dec 11, 2021

[homematic] Improve (re)connect handling to Homematic gateways #11429

[homematic] Improve (re)connect handling to Homematic gateways #11429

Conversation

MHerbst commented Oct 23, 2021

Flole998 commented Oct 30, 2021

MHerbst commented Oct 31, 2021 • edited Loading

fwolter commented Nov 20, 2021

MHerbst commented Nov 20, 2021

Choose a reason for hiding this comment

MHerbst Nov 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MHerbst Dec 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MHerbst commented Oct 31, 2021 •

edited

Loading

MHerbst Nov 21, 2021 •

edited

Loading

MHerbst Dec 11, 2021 •

edited

Loading