-
Notifications
You must be signed in to change notification settings - Fork 27
[DPE-7726] Use Patroni API for is_restart_pending() (instead of SQL select from pg_settings) #1049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
❌ Your project status has failed because the head coverage (65.10%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## 16/edge #1049 +/- ##
===========================================
+ Coverage 64.91% 65.10% +0.18%
===========================================
Files 17 17
Lines 4275 4281 +6
Branches 656 655 -1
===========================================
+ Hits 2775 2787 +12
+ Misses 1333 1319 -14
- Partials 167 175 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9c17d76 to
e31de74
Compare
1703639 to
ee8e44b
Compare
710efc1 to
0a7b549
Compare
The previous is_restart_pending() waited for long due to the Patroni's loop_wait default value (10 seconds), which tells how much time Patroni will wait before checking the configuration file again to reload it. Instead of checking PostgreSQL pending_restart from pg_settings, let's check Patroni API pending_restart=True flag.
17e5454 to
ae846e9
Compare
The current Patroni 3.2.2 has wired/flickering behaviour:
it temporary flag pending_restart=True on many changes to REST API,
which is gone within a second but long enough to be cougth by charm.
Sleepping a bit is a necessary evil, until Patroni 3.3.0 upgrade.
The previous code sleept for 15 seconds waiting for pg_settings update.
Also, the unnecessary restarts could be triggered by missmatch of
Patroni config file and in-memory changes coming from REST API,
e.g. the slots were undefined in yaml file but set as an empty JSON {} => None.
Updating the default template to match the default API PATCHes and avoid restarts.
On topology observer event, the primary unit used to loose Primarly label.
Also: * use commong logger everywhere * and add several useful log messaged (e.g. DB connection) * remove no longer necessary debug 'Init class PostgreSQL' * align Patroni API requests style everhywhere * add Patroni API duration to debug logs
The list of IPs were randomly sorted causing unnecessary Partroni configuration re-generation with following Patroni restart/reload.
Housekeeping cleanup.
…anged Those defers are necessary to support scale-up/scale-down during the refresh, while they have significalty slowdown PostgreSQL 16 bootstrap (and other daily related mainteinance tasks, like re-scaling, full node reboot/recovery, etc). Muting them for now with the proper documentation record to forbid rescaling during the refresh, untli we minimise amount of defers in PG16. Throw and warning for us to recall this promiss.
The current PG16 logic relies on Juju update-status or on_topology_change observer events, while in some cases we start Patroni without the Observer, causing a long waiting story till the next update-status arrives.
It is hard (impossible?) to catch the Juju Primary label manipulations from Juju debug-log. Logging it simplifyies troubleshooting.
We had to wait 30 seconds in case of lack of connection which is unnecessary long. Also, add details for the reason of failed connection Retry/CannotConnect.
It speedups the sinble unit app deployments.
…sses() Otherwise update-status event fails: > unit-postgresql-0: relations.async_replication:Partner addresses: [] > unit-postgresql-0: cluster:Unable to get the state of the cluster > Traceback (most recent call last): > File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 619, in online_cluster_members > cluster_status = self.cluster_status() > ^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function > return callable(*args, **kwargs) # type: ignore > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 279, in cluster_status > raise RetryError( > tenacity.RetryError: RetryError[<Future at 0xffddafe01160 state=finished raised Exception>]
…: Host not set.
Exception:
> 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 cluster:API get_patroni_health: <Response [200]> (0.057417)
> 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 cluster:API cluster_status: [{'name': 'postgresql-0', 'role': 'leader', 'state': 'running', 'api_url': 'https://10.182.246.123:8008/patroni', 'host': '10.182.246.123', 'port': 5432, 'timeline': 1}, {'name': 'postgresql-1', 'role': 'sync_standby', 'state': 'running', 'api_url': 'https://10.182.246.163:8008/patroni', 'host': '10.182.246.163', 'port': 5432, 'timeline': 1, 'lag': 0}, {'name': 'postgresql-2', 'role': 'sync_standby', 'state': 'running', 'api_url': 'https://10.182.246.246:8008/patroni', 'host': '10.182.246.246', 'port': 5432, 'timeline': 1, 'lag': 0}]
> 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 __main__:Early exit primary_endpoint: Primary IP not in cached peer list
> 2025-08-19 20:49:40 ERROR unit.postgresql/2.juju-log server.go:406 root:Uncaught exception while in charm code:
> Traceback (most recent call last):
> File "/var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py", line 2736, in <module>
> main(PostgresqlOperatorCharm)
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/__init__.py", line 356, in __call__
> return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 502, in main
> manager.run()
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 486, in run
> self._emit()
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 421, in _emit
> self._emit_charm_event(self.dispatcher.event_name)
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 465, in _emit_charm_event
> event_to_emit.emit(*args, **kwargs)
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 351, in emit
> framework._emit(event)
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 924, in _emit
> self._reemit(event_path)
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 1030, in _reemit
> custom_handler(event)
> File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
> return callable(*args, **kwargs) # type: ignore
> ^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py", line 1942, in _on_update_status
> self.postgresql_client_relation.oversee_users()
> File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
> return callable(*args, **kwargs) # type: ignore
> ^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/src/relations/postgresql_provider.py", line 172, in oversee_users
> user for user in self.charm.postgresql.list_users() if user.startswith("relation-")
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
> return callable(*args, **kwargs) # type: ignore
> ^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/postgresql_k8s/v1/postgresql.py", line 959, in list_users
> with self._connect_to_database(
> ^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
> return callable(*args, **kwargs) # type: ignore
> ^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/postgresql_k8s/v1/postgresql.py", line 273, in _connect_to_database
> raise PostgreSQLUndefinedHostError("Host not set")
> charms.postgresql_k8s.v1.postgresql.PostgreSQLUndefinedHostError: Host not set
> 2025-08-19 20:49:40 ERROR juju.worker.uniter.operation runhook.go:180 hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1
Tnx to dragomir.penev@ for unit tests fixes here!
ae846e9 to
0bd81cd
Compare
It backports data-platform-workflow commit f1f8d27 to local integration test: > patch(integration_test_charm.yaml): Increase disk space step timeout (#301) Otherwise: > Disk usage before cleanup > Filesystem Size Used Avail Use% Mounted on > /dev/root 72G 46G 27G 64% / > tmpfs 7.9G 84K 7.9G 1% /dev/shm > tmpfs 3.2G 1.1M 3.2G 1% /run > tmpfs 5.0M 0 5.0M 0% /run/lock > /dev/sdb16 881M 60M 760M 8% /boot > /dev/sdb15 105M 6.2M 99M 6% /boot/efi > /dev/sda1 74G 4.1G 66G 6% /mnt > tmpfs 1.6G 12K 1.6G 1% /run/user/1001 > Error: The action 'Free up disk space' has timed out after 1 minutes.
marceloneppel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the detailed description in each commit!
Issue
The previous is_restart_pending() waited for 15 seconds due to the
Patroni's loop_wait default value (10 seconds), which tells how much time
Patroni will wait before checking the configuration file again to reload it.
Solution
Instead of checking PostgreSQL pending_restart from pg_settings,
check Patroni API pending_restart=True/undefined.
While testing the new PR a lot of instabilities noticed, e.g.
pending_restart=True flag flickeringor
unsorted IPs in Patroni config triggered unnecessary Patroni rolling restarts.So, included all those fixes here (consider to review each commit independently).
The controversy commit here is refresh v3, this is significantly slows down PostgreSQL charm...
See Carl remarks below. We can extract it from this PR (if necessary).
Checklist