-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[homematic] Improve (re)connect handling to Homematic gateways #11429
Conversation
I don't really like the idea of limiting the amount of reconnects. I want it to attempt to reconnect forever in case the connection is dropped. If my gateway crashes and I'm not at home and don't notice it right away then after some time a simple restart of the gateway won't help, did I understand that correctly? So in my opinion there should be a way to specify "forever" in some way, even if it's just as "65535 means retry forever". |
I think my description and probably the naming of the new options is a bit misleading. After the connection to the CCU was lost, the binding attempts to reconnect forever. This has not been changes. I have added a retry for the registration of callbacks. After the binding finds again a running CCU it immediately tries to register itself for events. For HM-RF components, this worked without problems. But for HmIP components this did not work because the HmIP daemon on the CCU needs much longer to start (on a CCU2 about 2 minutes). Therefore, I have added the retry for the registration of the callbacks. |
Is this ready to be reviewed? |
Yes, would be nice if you could have a look. |
throw (e); | ||
} | ||
try { | ||
Thread.sleep(waitTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ThingHandler's initialize()
method must return fast. As this code is called via that method, it must also return fast. Despite that, sleep should be avoided, because it blocks a thread in the schedulers thread pool. You could schedule another try via scheduler.scheduleWithFixedDelay()
, if sendMessage()
fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review and the tips.
The "fast return" is guaranteed in my opinion, because this method is called via scheduler.submit
:
Line 95 in e24d43a
initializeFuture = scheduler.submit(this::initializeInternal); |
I will replace the sleep
, but unfortunately this will require some refactoring because the current implementation depends on a "synchronous" execution of RpcClient#init
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fwolter I have changed the implementation and replaced the sleep and implemented it as you suggested.
} catch (InterruptedException e1) { | ||
// ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InterruptedException
is thrown when OH is shut-down. That means the current operation should be finished as fast as possible and no further operations should be done. You can achieve this by not catching the exception here but throw the InterruptedException and catch it at the highest level you can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I replace the sleep
, the catch is no longer needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fwolter changed it
@@ -95,6 +95,18 @@ | |||
<advanced>true</advanced> | |||
<default>9292</default> | |||
</parameter> | |||
<parameter name="callbackRegistrationRetries" type="integer" min="1"> | |||
<label>Number of Callback Registration Retries </label> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Labels are expected to be as short as possible. Guideline is 2-3 words with max 20-25 chars. See https://www.openhab.org/docs/developer/bindings/thing-xml.html#formatting-labels-and-descriptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also there seems to be a space at the end that should be removed IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will find a shorter label name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the configuration and this parameter has been removed
} | ||
}, INITIAL_CALLBACK_REG_DELAY, CALLBACK_REG_DELAY, TimeUnit.SECONDS); | ||
try { | ||
future.get(config.getCallbackRegTimeout(), TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get()
will block forever when called on a future returned by scheduleWithFixedDelay()
. Excerpt from the Javadoc:
Returns: a ScheduledFuture representing pending completion ofthe series of repeated tasks. The future's get() method will never return normally,and will throw an exception upon task cancellation orabnormal termination of a task execution.
This is called from the handler's initialize()
, thus initialize()
will block forever. Why not removing the entire try
block?
You need to cancel the task when the handler is disposed.
Are you registering the callbacks repeatedly by intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called from the handler's initialize(), thus initialize() will block forever. Why not removing the entire try block?
I have tried it, but without the blocking get, there can be up to 4 parallel requests to different ports (and services) on the Homematic gateway and the gateway simply does not like this. I have tried it this way and it did not work. In some tests I even had to hard reset the gateway because it was completely stuck.
The current implementation of the binding needs a more or less synchronous execution.
I am not happy with the solution, but with the current implementation, I really see no other way to implement it :-(.
This is called from the handler's initialize(), thus initialize()
As far as I can see the initialize in the Bridge handler (HomematicBridgeHandler#initialize) is not blocked and finishes fast (I have verified it in my tests).
Are you registering the callbacks repeatedly by intention?
The callback registration is executed after the binding was started or after the connection to the HM gateway was lost (e.g. if the gateway was restarted). In the first case, the first registration request will be successful and the scheduler is not used.
In the latter case, the binding performs a complete reconnect to the Homematic gateway. This includes a new registration of the callbacks. For each possible connection type (HM-RF, HmIP, Groups, CuxD) a separate callback must be registered because on the gateway, a service (daemon) is executed for each connection type listening on different ports. The HmIP service e.g. needs up to 2 min. until it is ready for callback registration requests. Therefore, the binding needs to perform these retries. There is no other way to figure out when a service on the gateway is ready to accept requests.
It is even worse. In the first seconds after the gateway restart you even don't get a correct reply to the XML RPC requests. Instead the gateway returns a HTML document with a message "please wait ...".
In my opinion, major parts of the binding's implementation need a complete overhaul. I would also be better to create a bridge for each of the possible connection types (would be a breaking change :-()
But such an overhaul requires time especially for testing.
You need to cancel the task when the handler is disposed.
OK, will check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, major parts of the binding's implementation need a complete overhaul.
I agree on that one. Also the channel creation stuff that is "fixed" in another PR is kinda broken by design and fixing that properly would require huge changes. So now we seem to have 2 changes/fixes that would ideally be handled with a complete rewrite of most of the binding....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking into the code again, I saw that the Future cancels itself when the connection has been established successfully. So, initialize()
will block as long as the remote side is unavailable. The current implementation calls sendMessage()
only once (which is also bad in initialize()
).
So, if I see correctly your code will block initialize()
forever when the remote side is not available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fwolter A sendMessage
during the Bridge initialization it is not blocking the handler initialization, as you can see in HomematicBridgeHandler#initialize
:
public void initialize() {
synchronized (initDisposeLock) {
isDisposed = false;
initializeFuture = scheduler.submit(this::initializeInternal);
}
}
You can believe me, I have tested it (several times).
The binding works correctly. If the gateway is not accessible when the binding is started, all things are set to state "ERROR:BRIDGE". After the gateway comes up this is correctly detected and the thing's state changes to ONLINE.
I know, this solution is at least in parts "ugly as hell". But I have invested really hours of testing and trying to find a better solution without having to rewrite larger part of the binding. Now am I am really fed up with these connection problems and I for the next couple of weeks I don't have the time to work on it.
If you can't accept the PR in its current state, it is OK to me. But in this case, it will remain in this state for at least a month. Or someone else continues to work on this problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice that initialize()
wasn't the initialize()
from the ThingHandler, but from a custom class. Would be awesome if this code could be re-written sometime!
the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
- unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de>
Signed-off-by: Martin Herbst <develop@mherbst.de>
…ab#11429) * Use globally unique id for registration of callback to allow ... the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de> * Retry callback re-registration after connection is resumed Some services on the CCU need longer to start and are not available immediately after the connection to the CCU has been resumed. Improves the solution for openhab#8808 Fixes openhab#10439 Signed-off-by: Martin Herbst <develop@mherbst.de> * Description was missing. Signed-off-by: Martin Herbst <develop@mherbst.de> * Changed setting name and description to avoid confusion Signed-off-by: Martin Herbst <develop@mherbst.de> * Added a troubleshooting tip to solve a communication problem Signed-off-by: Martin Herbst <develop@mherbst.de> * Shortened the label name to follow the guide lines Signed-off-by: Martin Herbst <develop@mherbst.de> * Print more information about the reason for the failure Signed-off-by: Martin Herbst <develop@mherbst.de> * Using scheduler thread pool and simplified configuration Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de> * Don't retry to send if gateway does not answer at all Signed-off-by: Martin Herbst <develop@mherbst.de> * Improved reconnect handling - unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de> * Spotless Signed-off-by: Martin Herbst <develop@mherbst.de> * Cancel an active future if the binding is stopped Signed-off-by: Martin Herbst <develop@mherbst.de> Signed-off-by: Nick Waterton <n.waterton@outlook.com>
…ab#11429) * Use globally unique id for registration of callback to allow ... the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de> * Retry callback re-registration after connection is resumed Some services on the CCU need longer to start and are not available immediately after the connection to the CCU has been resumed. Improves the solution for openhab#8808 Fixes openhab#10439 Signed-off-by: Martin Herbst <develop@mherbst.de> * Description was missing. Signed-off-by: Martin Herbst <develop@mherbst.de> * Changed setting name and description to avoid confusion Signed-off-by: Martin Herbst <develop@mherbst.de> * Added a troubleshooting tip to solve a communication problem Signed-off-by: Martin Herbst <develop@mherbst.de> * Shortened the label name to follow the guide lines Signed-off-by: Martin Herbst <develop@mherbst.de> * Print more information about the reason for the failure Signed-off-by: Martin Herbst <develop@mherbst.de> * Using scheduler thread pool and simplified configuration Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de> * Don't retry to send if gateway does not answer at all Signed-off-by: Martin Herbst <develop@mherbst.de> * Improved reconnect handling - unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de> * Spotless Signed-off-by: Martin Herbst <develop@mherbst.de> * Cancel an active future if the binding is stopped Signed-off-by: Martin Herbst <develop@mherbst.de> Signed-off-by: Michael Schmidt <mi.schmidt.83@gmail.com>
…ab#11429) * Use globally unique id for registration of callback to allow ... the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de> * Retry callback re-registration after connection is resumed Some services on the CCU need longer to start and are not available immediately after the connection to the CCU has been resumed. Improves the solution for openhab#8808 Fixes openhab#10439 Signed-off-by: Martin Herbst <develop@mherbst.de> * Description was missing. Signed-off-by: Martin Herbst <develop@mherbst.de> * Changed setting name and description to avoid confusion Signed-off-by: Martin Herbst <develop@mherbst.de> * Added a troubleshooting tip to solve a communication problem Signed-off-by: Martin Herbst <develop@mherbst.de> * Shortened the label name to follow the guide lines Signed-off-by: Martin Herbst <develop@mherbst.de> * Print more information about the reason for the failure Signed-off-by: Martin Herbst <develop@mherbst.de> * Using scheduler thread pool and simplified configuration Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de> * Don't retry to send if gateway does not answer at all Signed-off-by: Martin Herbst <develop@mherbst.de> * Improved reconnect handling - unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de> * Spotless Signed-off-by: Martin Herbst <develop@mherbst.de> * Cancel an active future if the binding is stopped Signed-off-by: Martin Herbst <develop@mherbst.de>
…ab#11429) * Use globally unique id for registration of callback to allow ... the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de> * Retry callback re-registration after connection is resumed Some services on the CCU need longer to start and are not available immediately after the connection to the CCU has been resumed. Improves the solution for openhab#8808 Fixes openhab#10439 Signed-off-by: Martin Herbst <develop@mherbst.de> * Description was missing. Signed-off-by: Martin Herbst <develop@mherbst.de> * Changed setting name and description to avoid confusion Signed-off-by: Martin Herbst <develop@mherbst.de> * Added a troubleshooting tip to solve a communication problem Signed-off-by: Martin Herbst <develop@mherbst.de> * Shortened the label name to follow the guide lines Signed-off-by: Martin Herbst <develop@mherbst.de> * Print more information about the reason for the failure Signed-off-by: Martin Herbst <develop@mherbst.de> * Using scheduler thread pool and simplified configuration Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de> * Don't retry to send if gateway does not answer at all Signed-off-by: Martin Herbst <develop@mherbst.de> * Improved reconnect handling - unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de> * Spotless Signed-off-by: Martin Herbst <develop@mherbst.de> * Cancel an active future if the binding is stopped Signed-off-by: Martin Herbst <develop@mherbst.de>
…ab#11429) * Use globally unique id for registration of callback to allow ... the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de> * Retry callback re-registration after connection is resumed Some services on the CCU need longer to start and are not available immediately after the connection to the CCU has been resumed. Improves the solution for openhab#8808 Fixes openhab#10439 Signed-off-by: Martin Herbst <develop@mherbst.de> * Description was missing. Signed-off-by: Martin Herbst <develop@mherbst.de> * Changed setting name and description to avoid confusion Signed-off-by: Martin Herbst <develop@mherbst.de> * Added a troubleshooting tip to solve a communication problem Signed-off-by: Martin Herbst <develop@mherbst.de> * Shortened the label name to follow the guide lines Signed-off-by: Martin Herbst <develop@mherbst.de> * Print more information about the reason for the failure Signed-off-by: Martin Herbst <develop@mherbst.de> * Using scheduler thread pool and simplified configuration Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de> * Don't retry to send if gateway does not answer at all Signed-off-by: Martin Herbst <develop@mherbst.de> * Improved reconnect handling - unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de> * Spotless Signed-off-by: Martin Herbst <develop@mherbst.de> * Cancel an active future if the binding is stopped Signed-off-by: Martin Herbst <develop@mherbst.de>
…ab#11429) * Use globally unique id for registration of callback to allow ... the connection of multiple OH installations with one CCU. The bridge id is not sufficient for this purpose because it is same in all OH installations. Signed-off-by: Martin Herbst <develop@mherbst.de> * Retry callback re-registration after connection is resumed Some services on the CCU need longer to start and are not available immediately after the connection to the CCU has been resumed. Improves the solution for openhab#8808 Fixes openhab#10439 Signed-off-by: Martin Herbst <develop@mherbst.de> * Description was missing. Signed-off-by: Martin Herbst <develop@mherbst.de> * Changed setting name and description to avoid confusion Signed-off-by: Martin Herbst <develop@mherbst.de> * Added a troubleshooting tip to solve a communication problem Signed-off-by: Martin Herbst <develop@mherbst.de> * Shortened the label name to follow the guide lines Signed-off-by: Martin Herbst <develop@mherbst.de> * Print more information about the reason for the failure Signed-off-by: Martin Herbst <develop@mherbst.de> * Using scheduler thread pool and simplified configuration Instead of configuring separate values for retry delays and number of retries only the maximum time for retries can be configured. The init method uses fixed delays. Signed-off-by: Martin Herbst <develop@mherbst.de> * Don't retry to send if gateway does not answer at all Signed-off-by: Martin Herbst <develop@mherbst.de> * Improved reconnect handling - unregister callback not necessary if connection is lost - wait 30s until clients and servers are restarted to give the gateway some time to recover Signed-off-by: Martin Herbst <develop@mherbst.de> * Spotless Signed-off-by: Martin Herbst <develop@mherbst.de> * Cancel an active future if the binding is stopped Signed-off-by: Martin Herbst <develop@mherbst.de> Signed-off-by: Andras Uhrin <andras.uhrin@gmail.com>
The PR improves the connect and especially the re-connect handling to Homematic gateways through the following measures:
Signed-off-by: Martin Herbst develop@mherbst.de