Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider making DB retry timeout configurable #794

Open
louiswebdev opened this issue Sep 2, 2024 · 16 comments
Open

Consider making DB retry timeout configurable #794

louiswebdev opened this issue Sep 2, 2024 · 16 comments
Labels
enhancement New feature or request

Comments

@louiswebdev
Copy link

Describe the bug

HA Icinga2 setup, two nodes in master zone. Database is AWS RDS MySQL, Multi-AZ. During/after scheduled maintenance of the database service we found both IcingaDB systemd units failed.

Expected behavior

I would have expected the connection loss to be handled grafecully, without the IcingaDB services failing. We had to restart them manually.

Your Environment

  • Icinga DB version: 1.2.0-1+debian11
  • Icinga 2 version: 2.14.2-1+debian11
  • Operating System and version: Debian GNU/Linux 11 (bullseye)

Additional context

Time in logs is UTC, time in screenshot is CEST (==UTC+2), so the events do actually match:

Screenshot 2024-09-02 at 09 07 48

Sep 02 00:33:32 ip-10-7-10-85 icingadb[121007]: invalid connection
                                                can't commit transaction
                                                github.com/icinga/icingadb/pkg/icingadb.(*HA).realize.func1
                                                        github.com/icinga/icingadb/pkg/icingadb/ha.go:381
                                                github.com/icinga/icingadb/pkg/retry.WithBackoff
                                                        github.com/icinga/icingadb/pkg/retry/retry.go:60
                                                github.com/icinga/icingadb/pkg/icingadb.(*HA).realize
                                                        github.com/icinga/icingadb/pkg/icingadb/ha.go:284
                                                github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
                                                        github.com/icinga/icingadb/pkg/icingadb/ha.go:221
                                                runtime.goexit
                                                        runtime/asm_amd64.s:1695
                                                HA aborted
                                                github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1
                                                        github.com/icinga/icingadb/pkg/icingadb/ha.go:131
                                                sync.(*Once).doSlow
                                                        sync/once.go:74
                                                sync.(*Once).Do
                                                        sync/once.go:65
                                                github.com/icinga/icingadb/pkg/icingadb.(*HA).abort
                                                        github.com/icinga/icingadb/pkg/icingadb/ha.go:129
                                                github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
                                                        github.com/icinga/icingadb/pkg/icingadb/ha.go:233
                                                runtime.goexit
                                                        runtime/asm_amd64.s:1695
                                                HA exited with an error
                                                main.run
                                                        github.com/icinga/icingadb/cmd/icingadb/main.go:335
                                                main.main
                                                        github.com/icinga/icingadb/cmd/icingadb/main.go:37
                                                runtime.main
                                                        runtime/proc.go:271
                                                runtime.goexit
                                                        runtime/asm_amd64.s:1695
Sep 02 00:33:32 ip-10-7-10-85 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE

Sep 02 00:38:11 ip-10-7-11-6 icingadb[110917]: invalid connection
                                               can't perform "INSERT INTO \"state_history\" (\"max_check_attempts\", \"scheduling_source\", \"event_time\", \"hard_state\", \"previous_hard_state\", \"check_attempt\", \"output\", \"long_output\", \"check_source\", \"environment_id\", \"state_type\", \"soft_state\", \"id\", \"endpoint_id\", \"host_id\", \"service_id\", \"previous_soft_state\", \"object_type\") VALUES (:max_check_attempts,:scheduling_source,:event_time,:hard_state,:previous_hard_state,:check_attempt,:output,:long_output,:check_source,:environment_id,:state_type,:soft_state,:id,:endpoint_id,:host_id,:service_id,:previous_soft_state,:object_type) ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")"
                                               github.com/icinga/icingadb/internal.CantPerformQuery
                                                       github.com/icinga/icingadb/internal/internal.go:30
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2.1
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:412
                                               github.com/icinga/icingadb/pkg/retry.WithBackoff
                                                       github.com/icinga/icingadb/pkg/retry/retry.go:60
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:407
                                               golang.org/x/sync/errgroup.(*Group).Go.func1
                                                       golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
                                               runtime.goexit
                                                       runtime/asm_amd64.s:1695
                                               retry deadline exceeded
                                               github.com/icinga/icingadb/pkg/retry.WithBackoff
                                                       github.com/icinga/icingadb/pkg/retry/retry.go:95
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.2
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:407
                                               golang.org/x/sync/errgroup.(*Group).Go.func1
                                                       golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
                                               runtime.goexit
                                                       runtime/asm_amd64.s:1695
Sep 02 00:38:11 ip-10-7-11-6 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
@yhabteab
Copy link
Member

yhabteab commented Sep 2, 2024

During/after scheduled maintenance of the database service we found both IcingaDB systemd units failed.

Hi, how long did that maintenance window last? Icinga DB actually retries such temporary errors, but only for 5m and if the database does not return to its operational state within this time slot, it ultimately gives up and crashes with retry deadline exceeded.

@louiswebdev
Copy link
Author

Looks like there were 8 minutes between the last two loglines (offline operation) - I suspect that is the timeframe the DB was not reachable at all, yes. Do I have config options to prolong the 5m period?

@louiswebdev
Copy link
Author

louiswebdev commented Sep 2, 2024

I also added the following to both systems as a countermeasure:

└─ # ▶ cat /etc/systemd/system/icingadb.service.d/override.conf
[Unit]
StartLimitInterval=0
StartLimitBurst=30

[Service]
Restart=on-failure
RestartSec=30s

Edit: but I am not sure if this is okay to do, or if it may lead to other problems.

@oxzi
Copy link
Member

oxzi commented Sep 2, 2024

@louiswebdev: Your systemd unit fix should work. Restarting the daemon shouldn't lead to errors. The HA logic would restrict the restarted Icinga DB daemon to directly take over, let the lease time out and then allow them to be the lead again.

For more time intense maintenance jobs, one should consider temporary stopping Icinga DB because it cannot work without a database. As @yhabteab wrote, it will retry the queries, but for a longer database absence, this will eventually fail. Furthermore, I am unsure if the "invalid connection" error will even be retried.

@yhabteab
Copy link
Member

yhabteab commented Sep 2, 2024

Do I have config options to prolong the 5m period?

No, you can't override the timeout at the moment!

Furthermore, I am unsure if the "invalid connection" error will even be retried.

See https://github.com/Icinga/icinga-go-library/blob/main/retry/retry.go#L190.

@louiswebdev
Copy link
Author

@louiswebdev: Your systemd unit fix should work. Restarting the daemon shouldn't lead to errors. The HA logic would restrict the restarted Icinga DB daemon to directly take over, let the lease time out and then allow them to be the lead again.

For more time intense maintenance jobs, one should consider temporary stopping Icinga DB because it cannot work without a database. As @yhabteab wrote, it will retry the queries, but for a longer database absence, this will eventually fail. Furthermore, I am unsure if the "invalid connection" error will even be retried.

Hm 🤔

The problem with this is: maintenance on RDS databases always occurs during defined timeframes, preferably outside peak hours. We never know when exactly it will happen, and even if we knew it's most probably at a time we're not awake to stop services accordingly. We might consider now to schedule maintenance during daytime, but we prefer not to.

With the ido-mysql feature we are coming from this never was a problem, the system is up and running since mid 2019.

@louiswebdev
Copy link
Author

louiswebdev commented Sep 2, 2024

Do I have config options to prolong the 5m period?

No, you can't override the timeout at the moment!

Maybe that's a feature we would like to see then - but we think it's not super-urgent, since those maintenances that require offline-operations do not occur very often and the aforementioned systemd config is in effect now.

@louiswebdev
Copy link
Author

During/after scheduled maintenance of the database service we found both IcingaDB systemd units failed.

Hi, how long did that maintenance window last? Icinga DB actually retries such temporary errors, but only for 5m and if the database does not return to its operational state within this time slot, it ultimately gives up and crashes with retry deadline exceeded.

Oh, one more thing: only one of the nodes exited with retry deadline exceeded, the other one bailed out with HA aborted and HA exited with an error five minutes before the first node.

@oxzi
Copy link
Member

oxzi commented Sep 3, 2024

"HA exited with an error" also only happens if something seemed unrecoverable happened. However, could you please supply longer logs including the events before the crash? This would ease us to understand if Icinga DB works as intended or if we are dealing with a bug.

@louiswebdev
Copy link
Author

louiswebdev commented Sep 4, 2024

Would journalctl -u icingadb.service --since "2024-09-01 23:55:00" --until "2024-09-02 00:55:00"suffice?

Both nodes seem to have logged that another instance was active. Apart from that it does not look suspicious to me.

icingadb_logs_node_a.txt
icingadb_logs_node_b.txt

@oxzi
Copy link
Member

oxzi commented Sep 4, 2024

Thanks for supplying those logs, @louiswebdev. But it seems you have uploaded the same file twice; the file content is identical.

$ sha256sum icingadb_logs_node_{a,b}.txt
92ff2299ef441eb8cd5cb2d2fd611a881b909d4b87f08aee039e81f22749ff29  icingadb_logs_node_a.txt
92ff2299ef441eb8cd5cb2d2fd611a881b909d4b87f08aee039e81f22749ff29  icingadb_logs_node_b.txt

Unfortunately, the log for ip-10-7-10-85 does not offer any further insides. Based on these information and what you prior wrote, I would say that the issue was "just" the database being absence longer than the retry timeout of five minutes.

@louiswebdev
Copy link
Author

Yeah, sorry, my bad - I re-uploaded icingadb_logs_node_b.txt

@oxzi
Copy link
Member

oxzi commented Sep 5, 2024

Thank you for the updated log file. The node ip-10-7-11-6 had a flapping HA status at 00:32:01. Could you please share the "Handing over" log message including all additional fields, as described in https://github.com/Icinga/icingadb/blob/main/doc/03-Configuration.md#systemd-journald-fields?

@louiswebdev
Copy link
Author

icingadb_verbose_logs_node_b.txt

Sure, the timeframe is the same as in the previous log dumps.

@oxzi
Copy link
Member

oxzi commented Sep 5, 2024

Thanks again for your fast response and the logs.

    MESSAGE=Handing over
    ICINGADB_REASON=instance update/insert deadline exceeded heartbeat expiry time

The HA handover reason was the timeout of the realization method. As no debug logging was enabled and no further information was logged, this was most likely due to a long blocking SQL command. I would guess this was the first effect of the database maintenance.

I am unsure what else to do here. Your crash is unfortunate and generally this is something which should not happen.

Thus, I would open for discussing if the retry timeout can be made configurable, if we want to consider some "database is absent, so let's idle a bit" mode or something else.

@oxzi oxzi added the crash label Sep 5, 2024
@louiswebdev
Copy link
Author

Yeah, I understand - thanks for hearing me, and looking at the logs, and contextualising the problem for me.

I will of course closely watch the service and especially the RDS maintenance windows in the future.

I also think this must have been a particularly long maintenance, especially considering we have booked a multi-AZ database to prevent long outages.

Besides of that I'd like to express how well I regard the Icinga2 and related projects like IcingaDB. The documentation is very good, and reactions to issues, too. Thanks for that and keep up the good work.

@yhabteab yhabteab added enhancement New feature or request and removed crash labels Oct 18, 2024
@yhabteab yhabteab changed the title systemd units failed after "invalid connection" - manual recovery needed Consider making DB retry timeout configurable Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants