Fix status override and charm stuck on Waiting status after network issue #116
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
Jira ticket: DPE-789
After a network issue, like the one reported by Maksim through Mattermost DM, the PostgreSQL charm gets stuck in a Waiting status with the message
awaiting for primary endpoint to be ready
. It's due the k8s services created by the charm (postgresql-k8s-primary
andpostgresql-k8s-replicas
) doesn't exist anymore.Also, the
update_status_hook
is overriding the status and the message with an Active status.Solution
Recreate the resources when the charm starts again (it happens after a network issue or after a server reboot).
Context
This issue is intermittent (doesn't happen every time we have a network issue).
This solves the initial issue from #54 (from Maksim). The screenshots related to the initial test that reproduced the issue are attached to the Jira ticket.
The issue that Arturo faced (getting the same waiting state and message after a clean installation, and multiple times) couldn't be reproduced yet. Another ticket was created to investigate that more: DPE-1533
About the code:
_on_leader_elected
._initialize_cluster
was created to remove some logic from the_on_postgresql_pebble_ready
method, which became too long/complex._initialize_cluster
the k8s services are recreated if they were deleted (like after a network issue)._on_update_status
doesn't override a waiting status anymore.Testing
Tested the fix manually by using some iptables rules to break microk8s network (and consequently lose the services created by the charm).
I could also test it by deleting the k8s services manually by using
mcirok8s.kubectl
and then rebooting the host machine by usingsudo reboot
. After the machine restarts, some hooks are fired, and now the services are created back (because pebble ready is one of those hooks).Release Notes
Fix status override and charm stuck on Waiting status after network issue.