Improve command retry #12

d3m3trius · 2021-09-30T10:31:14Z

At the moment the integration query the washer every minute and if it does not get a reply it makes several (max_tries=10?) connection attempts. This behaviors is not good when the device is not available, for example because it is off or in sleep.

For example, in my case, the machine goes in sleep mode (wifi is disconnected) after 1-2 minutes it complete the washing cycle. This results in flooding the log with rows like this:

2021-09-22 08:57:05 INFO (MainThread) [backoff] Backing off status_with_retry(...) for 2.7s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 192.168.1.45:80 ssl:default [Connect call failed ('192.168.1.45', 80)])

In few days my home assistant log has grow to several megabytes with 30k+ errors as above.

--

Another related problem experienced is that the retry mechanism does not really avoid connection problem. For example, in the following chart, it is possible to see that the machine became unavailable several time during the washing cycles.

This "unavailable time" is exactly 1 minute, the time between a query and the next one.
So the retry mechanism does not actually improve the connection reliability.

--

Possible improvements:

use another reply mechanism or function, e.g. make a single retry attempt after a certain time (10-30 seconds?), or just give up and wait the normal 1 minute cycle.
avoid logging all failed attempts; if necessary, just log the first the time the device became unavailable
if the device last query was "active", do not mark it unavailable the first time it does not reply, give it a "grace period" (this maybe not necessary if the retry mechanism is improved to be really effective)
if the device is disconnected, avoid additional retry

Thanks for the great job you are doing with the integration.

ofalvai · 2021-10-02T19:11:05Z

Thank you for the detailed post, I agree with most points, to be fair I am not happy with the current implementation either.

avoid logging all failed attempts; if necessary, just log the first the time the device became unavailable

This is simple to fix, the backoff library that I use for retries has its own logger. I'll fix this by only logging retries if debug logging is enabled for the component.

For the other points, I'm not sure what would be the ideal solution. First, I added a retry strategy to the requests because my washing machine was rejecting requests too, for no apparent reason. But as you can see, even with this retry mechanism sometimes the device becomes unavailable during a wash cycle. I still don't understand why this happens, I'm afraid I need to tweak the retry logic a bit.

At the moment the integration polls the device every minute (plus a few extra retries if a request fails), and I don't necessarily want to make this less frequent because the delayed status updates wouldn't be a good UX. I also thought of doing exponentially increasing refresh intervals. This would make less unnecessary polls when the device is truly offline, but the interval would be so big after a few days of being turned off (which is not uncommon for a washing machine) that when it would be finally turned on the next update would have a huge delay.

I think the ultimate solution will be something similar to what you mentioned as a "grace period". I'd make the updater "stateful" and store the last successful request's timestamp. Based on this, the updater would poll more frequently initially, but after a certain amount of time passed since the last update, it would switch to a less frequent interval.

rogerlz · 2021-11-16T11:58:16Z

not sure if it helps, but I noticed that the http server (mine is a wine cooler) only supports one connection, so:

1 - every connection must be closed before trying to open another one
modifying the ClientSession to use force_close might improve this, but I haven't tested it yet

aiohttp.ClientSession(connector=aiohttp.TCPConnector(force_close=True))

2 - if you are using the app at the same time, it will consume the only connection allowed

ofalvai · 2021-11-19T08:12:34Z

Thank you @rogerlz for the suggestion, I will try running this modification locally for a few days to see how stable it is.

ofalvai · 2021-11-19T17:23:06Z

Creating my own ClientSession instead of calling async_get_clientsession() prints this error to the logs every time the update fails (practically every time the device is turned off)

ERROR (MainThread) [homeassistant] Error doing job: Unclosed client session

rogerlz · 2021-11-25T13:47:04Z

I will get back to you in a few days with this tested. I'm currently adding support to the Wine Cooler.

gee-jay-bee · 2022-02-18T13:33:47Z

FWIW, I modified the component to specify None as the update_interval so that there is no polling at all.

I then created an automation that triggers against the motion sensor I have in my kitchen and then simply polls the machine every 2 minutes until the machine becomes unavailable or is idle.

I'm not so bothered about the UI accurately reflecting the state of the machine but more about generating notifications when I need to go back to the machine to do something.

My plan is to modify the automation to do something roughly like this:

waits for 5 minutes (to debounce)
Calls homeassistant.update_entity() on the sensor to see whether the washing machine is available
if it is not or it is idle, exit the automation
if it is, poll the machine (via update_entity()) every couple of minutes for the first 10 minutes as it works out how long to run the program (it does some measurements and adapts)
then, update a date time helper with the end time
then poll every 10 minutes of the reported program duration to correct any drift and detect when the program actually ends
trigger an event and exit the automation

I think it would be quite useful to have the option to disable polling in the config flow and choose to observe state via automation only. I imagine that automation blueprints could be used to define behaviours according to how people want to use the sensor. Or just use polling!

gee-jay-bee · 2022-02-18T13:36:51Z

Creating my own ClientSession instead of calling async_get_clientsession() prints this error to the logs every time the update fails (practically every time the device is turned off)
ERROR (MainThread) [homeassistant] Error doing job: Unclosed client session

I wonder if this is because of this bug:

https://docs.aiohttp.org/en/stable/client_advanced.html#graceful-shutdown

ofalvai mentioned this issue Oct 8, 2021

Log retries with correct logger #15

Merged

ofalvai closed this as completed in #15 Oct 8, 2021

ofalvai reopened this Oct 8, 2021

ofalvai added the help wanted Extra attention is needed label Nov 22, 2021

ofalvai mentioned this issue Nov 13, 2022

Retry configuration: Error communicating with API: TimeoutError() #60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve command retry #12

Improve command retry #12

d3m3trius commented Sep 30, 2021

ofalvai commented Oct 2, 2021 •

edited

Loading

rogerlz commented Nov 16, 2021

ofalvai commented Nov 19, 2021

ofalvai commented Nov 19, 2021

rogerlz commented Nov 25, 2021

gee-jay-bee commented Feb 18, 2022

gee-jay-bee commented Feb 18, 2022

Improve command retry #12

Improve command retry #12

Comments

d3m3trius commented Sep 30, 2021

ofalvai commented Oct 2, 2021 • edited Loading

rogerlz commented Nov 16, 2021

ofalvai commented Nov 19, 2021

ofalvai commented Nov 19, 2021

rogerlz commented Nov 25, 2021

gee-jay-bee commented Feb 18, 2022

gee-jay-bee commented Feb 18, 2022

ofalvai commented Oct 2, 2021 •

edited

Loading