Skip to content

Conversation

marceloneppel
Copy link
Member

@marceloneppel marceloneppel commented Mar 21, 2023

Issue

Jira ticket: DPE-789
After a network issue, like the one reported by Maksim through Mattermost DM, the PostgreSQL charm gets stuck in a Waiting status with the message awaiting for primary endpoint to be ready. It's due the k8s services created by the charm (postgresql-k8s-primary and postgresql-k8s-replicas) doesn't exist anymore.

Also, the update_status_hook is overriding the status and the message with an Active status.

Solution

Recreate the resources when the charm starts again (it happens after a network issue or after a server reboot).

Context

This issue is intermittent (doesn't happen every time we have a network issue).

This solves the initial issue from #54 (from Maksim). The screenshots related to the initial test that reproduced the issue are attached to the Jira ticket.

The issue that Arturo faced (getting the same waiting state and message after a clean installation, and multiple times) couldn't be reproduced yet. Another ticket was created to investigate that more: DPE-1533

About the code:

  • Added missing exception handling to _on_leader_elected .
  • _initialize_cluster was created to remove some logic from the _on_postgresql_pebble_ready method, which became too long/complex.
  • On _initialize_cluster the k8s services are recreated if they were deleted (like after a network issue).
  • _on_update_status doesn't override a waiting status anymore.

Testing

Tested the fix manually by using some iptables rules to break microk8s network (and consequently lose the services created by the charm).

I could also test it by deleting the k8s services manually by using mcirok8s.kubectl and then rebooting the host machine by using sudo reboot. After the machine restarts, some hooks are fired, and now the services are created back (because pebble ready is one of those hooks).

Release Notes

Fix status override and charm stuck on Waiting status after network issue.

@marceloneppel marceloneppel changed the title Add resource creation right before check about k8s primary service Fix status override and charm stuck on Waiting status after network issue Mar 22, 2023
@marceloneppel marceloneppel marked this pull request as ready for review March 22, 2023 18:32
@marceloneppel marceloneppel merged commit 17ecbf4 into main Mar 23, 2023
@marceloneppel marceloneppel deleted the fix-stuck-on-awaiting-primary-endpoint branch March 23, 2023 12:02
BON4 pushed a commit to BON4/postgresql-k8s-operator that referenced this pull request May 20, 2024
@kbaccar-core
Copy link

This is still an issue for me reproduced at issue.

@dragomirp
Copy link
Contributor

Hi, @kbaccar-core, can you open a new issue to track this and get Postgresql's juju debug logs so that we can investigate?

@kbaccar-core
Copy link

Hello, done in #552.

github-actions bot added a commit to canonical/test-runners-2-is-arm64-postgresql-k8s-operator that referenced this pull request Jul 18, 2024
github-actions bot added a commit to canonical/test-runners-2-github-x64-postgresql-k8s-operator that referenced this pull request Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants