Fix status override and charm stuck on Waiting status after network issue #116

marceloneppel · 2023-03-21T20:00:06Z

Issue

Jira ticket: DPE-789
After a network issue, like the one reported by Maksim through Mattermost DM, the PostgreSQL charm gets stuck in a Waiting status with the message awaiting for primary endpoint to be ready. It's due the k8s services created by the charm (postgresql-k8s-primary and postgresql-k8s-replicas) doesn't exist anymore.

Also, the update_status_hook is overriding the status and the message with an Active status.

Solution

Recreate the resources when the charm starts again (it happens after a network issue or after a server reboot).

Context

This issue is intermittent (doesn't happen every time we have a network issue).

This solves the initial issue from #54 (from Maksim). The screenshots related to the initial test that reproduced the issue are attached to the Jira ticket.

The issue that Arturo faced (getting the same waiting state and message after a clean installation, and multiple times) couldn't be reproduced yet. Another ticket was created to investigate that more: DPE-1533

About the code:

Added missing exception handling to _on_leader_elected .
_initialize_cluster was created to remove some logic from the _on_postgresql_pebble_ready method, which became too long/complex.
On _initialize_cluster the k8s services are recreated if they were deleted (like after a network issue).
_on_update_status doesn't override a waiting status anymore.

Testing

Tested the fix manually by using some iptables rules to break microk8s network (and consequently lose the services created by the charm).

I could also test it by deleting the k8s services manually by using mcirok8s.kubectl and then rebooting the host machine by using sudo reboot. After the machine restarts, some hooks are fired, and now the services are created back (because pebble ready is one of those hooks).

Release Notes

Fix status override and charm stuck on Waiting status after network issue.

…anonical#116)

kbaccar-core · 2024-07-04T14:08:23Z

This is still an issue for me reproduced at issue.

dragomirp · 2024-07-04T14:39:34Z

Hi, @kbaccar-core, can you open a new issue to track this and get Postgresql's juju debug logs so that we can investigate?

kbaccar-core · 2024-07-05T08:36:53Z

Hello, done in #552.

Add resource creation right before check about k8s primary service

bc79e99

github-actions bot added the Libraries: OK label Mar 21, 2023

marceloneppel changed the title ~~Add resource creation right before check about k8s primary service~~ Fix status override and charm stuck on Waiting status after network issue Mar 22, 2023

marceloneppel marked this pull request as ready for review March 22, 2023 18:32

marceloneppel requested review from dragomirp and taurus-forever March 22, 2023 18:47

dragomirp approved these changes Mar 22, 2023

View reviewed changes

taurus-forever approved these changes Mar 23, 2023

View reviewed changes

marceloneppel merged commit 17ecbf4 into main Mar 23, 2023

marceloneppel deleted the fix-stuck-on-awaiting-primary-endpoint branch March 23, 2023 12:02

marceloneppel mentioned this pull request Mar 23, 2023

'awaiting for primary endpoint to be ready' takes forever #54

Closed

BON4 pushed a commit to BON4/postgresql-k8s-operator that referenced this pull request May 20, 2024

Add resource creation right before check about k8s primary service (c…

5dd72b7

…anonical#116)

github-actions bot added a commit to canonical/test-runners-2-is-arm64-postgresql-k8s-operator that referenced this pull request Jul 18, 2024

Allure report canonical#116

9bb2c54

github-actions bot added a commit to canonical/test-runners-2-github-x64-postgresql-k8s-operator that referenced this pull request Jul 18, 2024

Allure report canonical#116

ff33550

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix status override and charm stuck on Waiting status after network issue #116

Fix status override and charm stuck on Waiting status after network issue #116

Uh oh!

marceloneppel commented Mar 21, 2023 •

edited

Loading

Uh oh!

kbaccar-core commented Jul 4, 2024

Uh oh!

dragomirp commented Jul 4, 2024

Uh oh!

kbaccar-core commented Jul 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix status override and charm stuck on Waiting status after network issue #116

Fix status override and charm stuck on Waiting status after network issue #116

Uh oh!

Conversation

marceloneppel commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Solution

Context

Testing

Release Notes

Uh oh!

kbaccar-core commented Jul 4, 2024

Uh oh!

dragomirp commented Jul 4, 2024

Uh oh!

kbaccar-core commented Jul 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

marceloneppel commented Mar 21, 2023 •

edited

Loading