Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[homematic] Improve (re)connect handling to Homematic gateways #11429

Merged
merged 12 commits into from
Dec 11, 2021
Merged

[homematic] Improve (re)connect handling to Homematic gateways #11429

merged 12 commits into from
Dec 11, 2021

Conversation

MHerbst
Copy link
Contributor

@MHerbst MHerbst commented Oct 23, 2021

The PR improves the connect and especially the re-connect handling to Homematic gateways through the following measures:

  1. by using a global unique id for each bridge installation it is possible to use two or more OH installation with the same Homematic gateway in parallel.
  2. the HmIP daemon normally need much longer than the HM daemon until it accepts event registration. Therefore a configurable retry mechanismn has been implemented.

Signed-off-by: Martin Herbst develop@mherbst.de

@MHerbst MHerbst requested a review from gerrieg as a code owner October 23, 2021 14:48
@Flole998
Copy link
Member

I don't really like the idea of limiting the amount of reconnects. I want it to attempt to reconnect forever in case the connection is dropped. If my gateway crashes and I'm not at home and don't notice it right away then after some time a simple restart of the gateway won't help, did I understand that correctly? So in my opinion there should be a way to specify "forever" in some way, even if it's just as "65535 means retry forever".

@MHerbst
Copy link
Contributor Author

MHerbst commented Oct 31, 2021

I think my description and probably the naming of the new options is a bit misleading. After the connection to the CCU was lost, the binding attempts to reconnect forever. This has not been changes. I have added a retry for the registration of callbacks. After the binding finds again a running CCU it immediately tries to register itself for events. For HM-RF components, this worked without problems. But for HmIP components this did not work because the HmIP daemon on the CCU needs much longer to start (on a CCU2 about 2 minutes). Therefore, I have added the retry for the registration of the callbacks.
I will improve the naming of the options and the description to make this clearer. I don't think that it makes sense to retry the callback registration forever. If it does not work within some minutes, it will no work until the CCU is restarted.

@fwolter
Copy link
Member

fwolter commented Nov 20, 2021

Is this ready to be reviewed?

@MHerbst
Copy link
Contributor Author

MHerbst commented Nov 20, 2021

Yes, would be nice if you could have a look.

throw (e);
}
try {
Thread.sleep(waitTime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ThingHandler's initialize() method must return fast. As this code is called via that method, it must also return fast. Despite that, sleep should be avoided, because it blocks a thread in the schedulers thread pool. You could schedule another try via scheduler.scheduleWithFixedDelay(), if sendMessage() fails.

Copy link
Contributor Author

@MHerbst MHerbst Nov 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and the tips.
The "fast return" is guaranteed in my opinion, because this method is called via scheduler.submit:

I will replace the sleep, but unfortunately this will require some refactoring because the current implementation depends on a "synchronous" execution of RpcClient#init.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fwolter I have changed the implementation and replaced the sleep and implemented it as you suggested.

Comment on lines 112 to 113
} catch (InterruptedException e1) {
// ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InterruptedException is thrown when OH is shut-down. That means the current operation should be finished as fast as possible and no further operations should be done. You can achieve this by not catching the exception here but throw the InterruptedException and catch it at the highest level you can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I replace the sleep, the catch is no longer needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fwolter changed it

@@ -95,6 +95,18 @@
<advanced>true</advanced>
<default>9292</default>
</parameter>
<parameter name="callbackRegistrationRetries" type="integer" min="1">
<label>Number of Callback Registration Retries </label>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labels are expected to be as short as possible. Guideline is 2-3 words with max 20-25 chars. See https://www.openhab.org/docs/developer/bindings/thing-xml.html#formatting-labels-and-descriptions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there seems to be a space at the end that should be removed IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will find a shorter label name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the configuration and this parameter has been removed

@wborn wborn added the enhancement An enhancement or new feature for an existing add-on label Nov 27, 2021
}
}, INITIAL_CALLBACK_REG_DELAY, CALLBACK_REG_DELAY, TimeUnit.SECONDS);
try {
future.get(config.getCallbackRegTimeout(), TimeUnit.SECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get() will block forever when called on a future returned by scheduleWithFixedDelay(). Excerpt from the Javadoc:

Returns: a ScheduledFuture representing pending completion ofthe series of repeated tasks. The future's get() method will never return normally,and will throw an exception upon task cancellation orabnormal termination of a task execution.

This is called from the handler's initialize(), thus initialize() will block forever. Why not removing the entire try block?

You need to cancel the task when the handler is disposed.

Are you registering the callbacks repeatedly by intention?

Copy link
Contributor Author

@MHerbst MHerbst Dec 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called from the handler's initialize(), thus initialize() will block forever. Why not removing the entire try block?
I have tried it, but without the blocking get, there can be up to 4 parallel requests to different ports (and services) on the Homematic gateway and the gateway simply does not like this. I have tried it this way and it did not work. In some tests I even had to hard reset the gateway because it was completely stuck.

The current implementation of the binding needs a more or less synchronous execution.
I am not happy with the solution, but with the current implementation, I really see no other way to implement it :-(.

This is called from the handler's initialize(), thus initialize()

As far as I can see the initialize in the Bridge handler (HomematicBridgeHandler#initialize) is not blocked and finishes fast (I have verified it in my tests).

Are you registering the callbacks repeatedly by intention?

The callback registration is executed after the binding was started or after the connection to the HM gateway was lost (e.g. if the gateway was restarted). In the first case, the first registration request will be successful and the scheduler is not used.

In the latter case, the binding performs a complete reconnect to the Homematic gateway. This includes a new registration of the callbacks. For each possible connection type (HM-RF, HmIP, Groups, CuxD) a separate callback must be registered because on the gateway, a service (daemon) is executed for each connection type listening on different ports. The HmIP service e.g. needs up to 2 min. until it is ready for callback registration requests. Therefore, the binding needs to perform these retries. There is no other way to figure out when a service on the gateway is ready to accept requests.

It is even worse. In the first seconds after the gateway restart you even don't get a correct reply to the XML RPC requests. Instead the gateway returns a HTML document with a message "please wait ...".

In my opinion, major parts of the binding's implementation need a complete overhaul. I would also be better to create a bridge for each of the possible connection types (would be a breaking change :-()
But such an overhaul requires time especially for testing.

You need to cancel the task when the handler is disposed.

OK, will check this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, major parts of the binding's implementation need a complete overhaul.

I agree on that one. Also the channel creation stuff that is "fixed" in another PR is kinda broken by design and fixing that properly would require huge changes. So now we seem to have 2 changes/fixes that would ideally be handled with a complete rewrite of most of the binding....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking into the code again, I saw that the Future cancels itself when the connection has been established successfully. So, initialize() will block as long as the remote side is unavailable. The current implementation calls sendMessage() only once (which is also bad in initialize()).

So, if I see correctly your code will block initialize() forever when the remote side is not available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fwolter A sendMessage during the Bridge initialization it is not blocking the handler initialization, as you can see in HomematicBridgeHandler#initialize:

    public void initialize() {
        synchronized (initDisposeLock) {
            isDisposed = false;
            initializeFuture = scheduler.submit(this::initializeInternal);
        }
    }

You can believe me, I have tested it (several times).

The binding works correctly. If the gateway is not accessible when the binding is started, all things are set to state "ERROR:BRIDGE". After the gateway comes up this is correctly detected and the thing's state changes to ONLINE.

I know, this solution is at least in parts "ugly as hell". But I have invested really hours of testing and trying to find a better solution without having to rewrite larger part of the binding. Now am I am really fed up with these connection problems and I for the next couple of weeks I don't have the time to work on it.

If you can't accept the PR in its current state, it is OK to me. But in this case, it will remain in this state for at least a month. Or someone else continues to work on this problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice that initialize() wasn't the initialize() from the ThingHandler, but from a custom class. Would be awesome if this code could be re-written sometime!

the connection of multiple OH installations with one CCU.

The bridge id is not sufficient for this purpose because it is same in
all OH installations.


Signed-off-by: Martin Herbst <develop@mherbst.de>
Some services on the CCU need longer to start and are not available
immediately after the connection to the CCU has been resumed.

Improves the solution for #8808

Fixes #10439

Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
Instead of configuring separate values for retry delays and number of
retries only the maximum time for retries can be configured.
The init method uses fixed delays.

Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
- unregister callback not necessary if connection is lost
- wait 30s until clients and servers are restarted to give the gateway
some time to recover

Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
@fwolter fwolter merged commit adecb53 into openhab:main Dec 11, 2021
@fwolter fwolter added this to the 3.2 milestone Dec 11, 2021
NickWaterton pushed a commit to NickWaterton/openhab-addons that referenced this pull request Dec 30, 2021
…ab#11429)

* Use globally unique id for registration of callback to allow ...

the connection of multiple OH installations with one CCU.

The bridge id is not sufficient for this purpose because it is same in
all OH installations.


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Retry callback re-registration after connection is resumed

Some services on the CCU need longer to start and are not available
immediately after the connection to the CCU has been resumed.

Improves the solution for openhab#8808

Fixes openhab#10439

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Description was missing.

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Changed setting name and description to avoid confusion


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Added a troubleshooting tip to solve a communication problem


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Shortened the label name to follow the guide lines

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Print more information about the reason for the failure

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Using scheduler thread pool and simplified configuration

Instead of configuring separate values for retry delays and number of
retries only the maximum time for retries can be configured.
The init method uses fixed delays.

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Don't retry to send if gateway does not answer at all

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Improved reconnect handling

- unregister callback not necessary if connection is lost
- wait 30s until clients and servers are restarted to give the gateway
some time to recover

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Spotless

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Cancel an active future if the binding is stopped

Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Nick Waterton <n.waterton@outlook.com>
mischmidt83 pushed a commit to mischmidt83/openhab-addons that referenced this pull request Jan 9, 2022
…ab#11429)

* Use globally unique id for registration of callback to allow ...

the connection of multiple OH installations with one CCU.

The bridge id is not sufficient for this purpose because it is same in
all OH installations.


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Retry callback re-registration after connection is resumed

Some services on the CCU need longer to start and are not available
immediately after the connection to the CCU has been resumed.

Improves the solution for openhab#8808

Fixes openhab#10439

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Description was missing.

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Changed setting name and description to avoid confusion


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Added a troubleshooting tip to solve a communication problem


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Shortened the label name to follow the guide lines

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Print more information about the reason for the failure

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Using scheduler thread pool and simplified configuration

Instead of configuring separate values for retry delays and number of
retries only the maximum time for retries can be configured.
The init method uses fixed delays.

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Don't retry to send if gateway does not answer at all

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Improved reconnect handling

- unregister callback not necessary if connection is lost
- wait 30s until clients and servers are restarted to give the gateway
some time to recover

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Spotless

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Cancel an active future if the binding is stopped

Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Michael Schmidt <mi.schmidt.83@gmail.com>
nemerdaud pushed a commit to nemerdaud/openhab-addons that referenced this pull request Jan 28, 2022
…ab#11429)

* Use globally unique id for registration of callback to allow ...

the connection of multiple OH installations with one CCU.

The bridge id is not sufficient for this purpose because it is same in
all OH installations.


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Retry callback re-registration after connection is resumed

Some services on the CCU need longer to start and are not available
immediately after the connection to the CCU has been resumed.

Improves the solution for openhab#8808

Fixes openhab#10439

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Description was missing.

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Changed setting name and description to avoid confusion


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Added a troubleshooting tip to solve a communication problem


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Shortened the label name to follow the guide lines

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Print more information about the reason for the failure

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Using scheduler thread pool and simplified configuration

Instead of configuring separate values for retry delays and number of
retries only the maximum time for retries can be configured.
The init method uses fixed delays.

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Don't retry to send if gateway does not answer at all

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Improved reconnect handling

- unregister callback not necessary if connection is lost
- wait 30s until clients and servers are restarted to give the gateway
some time to recover

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Spotless

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Cancel an active future if the binding is stopped

Signed-off-by: Martin Herbst <develop@mherbst.de>
marcfischerboschio pushed a commit to bosch-io/openhab-addons that referenced this pull request May 5, 2022
…ab#11429)

* Use globally unique id for registration of callback to allow ...

the connection of multiple OH installations with one CCU.

The bridge id is not sufficient for this purpose because it is same in
all OH installations.


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Retry callback re-registration after connection is resumed

Some services on the CCU need longer to start and are not available
immediately after the connection to the CCU has been resumed.

Improves the solution for openhab#8808

Fixes openhab#10439

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Description was missing.

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Changed setting name and description to avoid confusion


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Added a troubleshooting tip to solve a communication problem


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Shortened the label name to follow the guide lines

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Print more information about the reason for the failure

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Using scheduler thread pool and simplified configuration

Instead of configuring separate values for retry delays and number of
retries only the maximum time for retries can be configured.
The init method uses fixed delays.

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Don't retry to send if gateway does not answer at all

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Improved reconnect handling

- unregister callback not necessary if connection is lost
- wait 30s until clients and servers are restarted to give the gateway
some time to recover

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Spotless

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Cancel an active future if the binding is stopped

Signed-off-by: Martin Herbst <develop@mherbst.de>
andan67 pushed a commit to andan67/openhab-addons that referenced this pull request Nov 6, 2022
…ab#11429)

* Use globally unique id for registration of callback to allow ...

the connection of multiple OH installations with one CCU.

The bridge id is not sufficient for this purpose because it is same in
all OH installations.


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Retry callback re-registration after connection is resumed

Some services on the CCU need longer to start and are not available
immediately after the connection to the CCU has been resumed.

Improves the solution for openhab#8808

Fixes openhab#10439

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Description was missing.

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Changed setting name and description to avoid confusion


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Added a troubleshooting tip to solve a communication problem


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Shortened the label name to follow the guide lines

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Print more information about the reason for the failure

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Using scheduler thread pool and simplified configuration

Instead of configuring separate values for retry delays and number of
retries only the maximum time for retries can be configured.
The init method uses fixed delays.

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Don't retry to send if gateway does not answer at all

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Improved reconnect handling

- unregister callback not necessary if connection is lost
- wait 30s until clients and servers are restarted to give the gateway
some time to recover

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Spotless

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Cancel an active future if the binding is stopped

Signed-off-by: Martin Herbst <develop@mherbst.de>
andrasU pushed a commit to andrasU/openhab-addons that referenced this pull request Nov 12, 2022
…ab#11429)

* Use globally unique id for registration of callback to allow ...

the connection of multiple OH installations with one CCU.

The bridge id is not sufficient for this purpose because it is same in
all OH installations.


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Retry callback re-registration after connection is resumed

Some services on the CCU need longer to start and are not available
immediately after the connection to the CCU has been resumed.

Improves the solution for openhab#8808

Fixes openhab#10439

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Description was missing.

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Changed setting name and description to avoid confusion


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Added a troubleshooting tip to solve a communication problem


Signed-off-by: Martin Herbst <develop@mherbst.de>

* Shortened the label name to follow the guide lines

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Print more information about the reason for the failure

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Using scheduler thread pool and simplified configuration

Instead of configuring separate values for retry delays and number of
retries only the maximum time for retries can be configured.
The init method uses fixed delays.

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Don't retry to send if gateway does not answer at all

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Improved reconnect handling

- unregister callback not necessary if connection is lost
- wait 30s until clients and servers are restarted to give the gateway
some time to recover

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Spotless

Signed-off-by: Martin Herbst <develop@mherbst.de>

* Cancel an active future if the binding is stopped

Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Andras Uhrin <andras.uhrin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement or new feature for an existing add-on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants