-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider making DB retry timeout configurable #794
Comments
Hi, how long did that maintenance window last? Icinga DB actually retries such temporary errors, but only for |
Looks like there were 8 minutes between the last two loglines (offline operation) - I suspect that is the timeframe the DB was not reachable at all, yes. Do I have config options to prolong the |
I also added the following to both systems as a countermeasure:
Edit: but I am not sure if this is okay to do, or if it may lead to other problems. |
@louiswebdev: Your systemd unit fix should work. Restarting the daemon shouldn't lead to errors. The HA logic would restrict the restarted Icinga DB daemon to directly take over, let the lease time out and then allow them to be the lead again. For more time intense maintenance jobs, one should consider temporary stopping Icinga DB because it cannot work without a database. As @yhabteab wrote, it will retry the queries, but for a longer database absence, this will eventually fail. Furthermore, I am unsure if the "invalid connection" error will even be retried. |
No, you can't override the timeout at the moment!
See https://github.com/Icinga/icinga-go-library/blob/main/retry/retry.go#L190. |
Hm 🤔 The problem with this is: maintenance on RDS databases always occurs during defined timeframes, preferably outside peak hours. We never know when exactly it will happen, and even if we knew it's most probably at a time we're not awake to stop services accordingly. We might consider now to schedule maintenance during daytime, but we prefer not to. With the ido-mysql feature we are coming from this never was a problem, the system is up and running since mid 2019. |
Maybe that's a feature we would like to see then - but we think it's not super-urgent, since those maintenances that require offline-operations do not occur very often and the aforementioned systemd config is in effect now. |
Oh, one more thing: only one of the nodes exited with |
"HA exited with an error" also only happens if something seemed unrecoverable happened. However, could you please supply longer logs including the events before the crash? This would ease us to understand if Icinga DB works as intended or if we are dealing with a bug. |
Would Both nodes seem to have logged that another instance was active. Apart from that it does not look suspicious to me. |
Thanks for supplying those logs, @louiswebdev. But it seems you have uploaded the same file twice; the file content is identical.
Unfortunately, the log for |
Yeah, sorry, my bad - I re-uploaded icingadb_logs_node_b.txt |
Thank you for the updated log file. The node |
icingadb_verbose_logs_node_b.txt Sure, the timeframe is the same as in the previous log dumps. |
Thanks again for your fast response and the logs.
The HA handover reason was the timeout of the realization method. As no debug logging was enabled and no further information was logged, this was most likely due to a long blocking SQL command. I would guess this was the first effect of the database maintenance. I am unsure what else to do here. Your crash is unfortunate and generally this is something which should not happen. Thus, I would open for discussing if the retry timeout can be made configurable, if we want to consider some "database is absent, so let's idle a bit" mode or something else. |
Yeah, I understand - thanks for hearing me, and looking at the logs, and contextualising the problem for me. I will of course closely watch the service and especially the RDS maintenance windows in the future. I also think this must have been a particularly long maintenance, especially considering we have booked a multi-AZ database to prevent long outages. Besides of that I'd like to express how well I regard the Icinga2 and related projects like IcingaDB. The documentation is very good, and reactions to issues, too. Thanks for that and keep up the good work. |
Describe the bug
HA Icinga2 setup, two nodes in master zone. Database is AWS RDS MySQL, Multi-AZ. During/after scheduled maintenance of the database service we found both IcingaDB systemd units failed.
Expected behavior
I would have expected the connection loss to be handled grafecully, without the IcingaDB services failing. We had to restart them manually.
Your Environment
Additional context
Time in logs is UTC, time in screenshot is CEST (==UTC+2), so the events do actually match:
The text was updated successfully, but these errors were encountered: