Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve command retry #12

Open
d3m3trius opened this issue Sep 30, 2021 · 7 comments · Fixed by #15
Open

Improve command retry #12

d3m3trius opened this issue Sep 30, 2021 · 7 comments · Fixed by #15
Labels
help wanted Extra attention is needed

Comments

@d3m3trius
Copy link

At the moment the integration query the washer every minute and if it does not get a reply it makes several (max_tries=10?) connection attempts. This behaviors is not good when the device is not available, for example because it is off or in sleep.

For example, in my case, the machine goes in sleep mode (wifi is disconnected) after 1-2 minutes it complete the washing cycle. This results in flooding the log with rows like this:

2021-09-22 08:57:05 INFO (MainThread) [backoff] Backing off status_with_retry(...) for 2.7s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 192.168.1.45:80 ssl:default [Connect call failed ('192.168.1.45', 80)])

In few days my home assistant log has grow to several megabytes with 30k+ errors as above.

--

Another related problem experienced is that the retry mechanism does not really avoid connection problem. For example, in the following chart, it is possible to see that the machine became unavailable several time during the washing cycles.
Cattura
This "unavailable time" is exactly 1 minute, the time between a query and the next one.
So the retry mechanism does not actually improve the connection reliability.

--

Possible improvements:

  • use another reply mechanism or function, e.g. make a single retry attempt after a certain time (10-30 seconds?), or just give up and wait the normal 1 minute cycle.
  • avoid logging all failed attempts; if necessary, just log the first the time the device became unavailable
  • if the device last query was "active", do not mark it unavailable the first time it does not reply, give it a "grace period" (this maybe not necessary if the retry mechanism is improved to be really effective)
  • if the device is disconnected, avoid additional retry

Thanks for the great job you are doing with the integration.

@ofalvai
Copy link
Owner

ofalvai commented Oct 2, 2021

Thank you for the detailed post, I agree with most points, to be fair I am not happy with the current implementation either.

avoid logging all failed attempts; if necessary, just log the first the time the device became unavailable

This is simple to fix, the backoff library that I use for retries has its own logger. I'll fix this by only logging retries if debug logging is enabled for the component.

For the other points, I'm not sure what would be the ideal solution. First, I added a retry strategy to the requests because my washing machine was rejecting requests too, for no apparent reason. But as you can see, even with this retry mechanism sometimes the device becomes unavailable during a wash cycle. I still don't understand why this happens, I'm afraid I need to tweak the retry logic a bit.

At the moment the integration polls the device every minute (plus a few extra retries if a request fails), and I don't necessarily want to make this less frequent because the delayed status updates wouldn't be a good UX. I also thought of doing exponentially increasing refresh intervals. This would make less unnecessary polls when the device is truly offline, but the interval would be so big after a few days of being turned off (which is not uncommon for a washing machine) that when it would be finally turned on the next update would have a huge delay.

I think the ultimate solution will be something similar to what you mentioned as a "grace period". I'd make the updater "stateful" and store the last successful request's timestamp. Based on this, the updater would poll more frequently initially, but after a certain amount of time passed since the last update, it would switch to a less frequent interval.

@rogerlz
Copy link

rogerlz commented Nov 16, 2021

not sure if it helps, but I noticed that the http server (mine is a wine cooler) only supports one connection, so:

1 - every connection must be closed before trying to open another one
modifying the ClientSession to use force_close might improve this, but I haven't tested it yet

aiohttp.ClientSession(connector=aiohttp.TCPConnector(force_close=True))

2 - if you are using the app at the same time, it will consume the only connection allowed

@ofalvai
Copy link
Owner

ofalvai commented Nov 19, 2021

Thank you @rogerlz for the suggestion, I will try running this modification locally for a few days to see how stable it is.

@ofalvai
Copy link
Owner

ofalvai commented Nov 19, 2021

Creating my own ClientSession instead of calling async_get_clientsession() prints this error to the logs every time the update fails (practically every time the device is turned off)

ERROR (MainThread) [homeassistant] Error doing job: Unclosed client session

@ofalvai ofalvai added the help wanted Extra attention is needed label Nov 22, 2021
@rogerlz
Copy link

rogerlz commented Nov 25, 2021

I will get back to you in a few days with this tested. I'm currently adding support to the Wine Cooler.

@gee-jay-bee
Copy link

FWIW, I modified the component to specify None as the update_interval so that there is no polling at all.

I then created an automation that triggers against the motion sensor I have in my kitchen and then simply polls the machine every 2 minutes until the machine becomes unavailable or is idle.

I'm not so bothered about the UI accurately reflecting the state of the machine but more about generating notifications when I need to go back to the machine to do something.

My plan is to modify the automation to do something roughly like this:

  • waits for 5 minutes (to debounce)
  • Calls homeassistant.update_entity() on the sensor to see whether the washing machine is available
  • if it is not or it is idle, exit the automation
  • if it is, poll the machine (via update_entity()) every couple of minutes for the first 10 minutes as it works out how long to run the program (it does some measurements and adapts)
  • then, update a date time helper with the end time
  • then poll every 10 minutes of the reported program duration to correct any drift and detect when the program actually ends
  • trigger an event and exit the automation

I think it would be quite useful to have the option to disable polling in the config flow and choose to observe state via automation only. I imagine that automation blueprints could be used to define behaviours according to how people want to use the sensor. Or just use polling!

@gee-jay-bee
Copy link

Creating my own ClientSession instead of calling async_get_clientsession() prints this error to the logs every time the update fails (practically every time the device is turned off)


ERROR (MainThread) [homeassistant] Error doing job: Unclosed client session

I wonder if this is because of this bug:

https://docs.aiohttp.org/en/stable/client_advanced.html#graceful-shutdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants