You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[DPE-7726] Use Patroni API for is_restart_pending() (instead of SQL select from pg_settings) (#1049)
* DPE-7726 Use Patroni API for is_restart_pending()
The previous is_restart_pending() waited for long due to the Patroni's
loop_wait default value (10 seconds), which tells how much time
Patroni will wait before checking the configuration file again to reload it.
Instead of checking PostgreSQL pending_restart from pg_settings,
let's check Patroni API pending_restart=True flag.
* DPE-7726 Avoid pending_restart=True flag flickering
The current Patroni 3.2.2 has wired/flickering behaviour:
it temporary flag pending_restart=True on many changes to REST API,
which is gone within a second but long enough to be cougth by charm.
Sleepping a bit is a necessary evil, until Patroni 3.3.0 upgrade.
The previous code sleept for 15 seconds waiting for pg_settings update.
Also, the unnecessary restarts could be triggered by missmatch of
Patroni config file and in-memory changes coming from REST API,
e.g. the slots were undefined in yaml file but set as an empty JSON {} => None.
Updating the default template to match the default API PATCHes and avoid restarts.
* DPE-7726 Fix topology obsert Primarly status removal
On topology observer event, the primary unit used to loose Primarly label.
* DPE-7726 Add Patroni API logging
Also:
* use commong logger everywhere
* and add several useful log messaged (e.g. DB connection)
* remove no longer necessary debug 'Init class PostgreSQL'
* align Patroni API requests style everhywhere
* add Patroni API duration to debug logs
* DPE-7726 Avoid unnecessary Patroni reloads
The list of IPs were randomly sorted causing unnecessary Partroni
configuration re-generation with following Patroni restart/reload.
* DPE-7726 Remove unnecessary property app_units() and scoped_peer_data()
Housekeeping cleanup.
* DPE-7726 Stop deffering for non-joined peers on on_start/on_config_changed
Those defers are necessary to support scale-up/scale-down during the refresh,
while they have significalty slowdown PostgreSQL 16 bootstrap (and other
daily related mainteinance tasks, like re-scaling, full node reboot/recovery, etc).
Muting them for now with the proper documentation record to
forbid rescaling during the refresh, untli we minimise amount of defers in PG16.
Throw and warning for us to recall this promiss.
* DPE-7726 Start observer on non-Primary Patroni start to speedup re-join
The current PG16 logic relies on Juju update-status or on_topology_change
observer events, while in some cases we start Patroni without the Observer,
causing a long waiting story till the next update-status arrives.
* DPE-7726 Log Patroni start/stop/restart (to undestand charm behavior)
* DPE-7726 Log unit status change to notice Primary label loose
It is hard (impossible?) to catch the Juju Primary label
manipulations from Juju debug-log. Logging it simplifyies troubleshooting.
* DPE-7726 Fixup logs polishing
* DPE-7726 Decrease waiting for DB connection timeout
We had to wait 30 seconds in case of lack of connection which is unnecessary long.
Also, add details for the reason of failed connection Retry/CannotConnect.
* DPE-7726 Stop propogating primary_endpoint=None for single unit app
It speedups the sinble unit app deployments.
* DPE-7726 Handling get primary cluster RetryError on get_partner_addresses()
Otherwise update-status event fails:
> unit-postgresql-0: relations.async_replication:Partner addresses: []
> unit-postgresql-0: cluster:Unable to get the state of the cluster
> Traceback (most recent call last):
> File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 619, in online_cluster_members
> cluster_status = self.cluster_status()
> ^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
> return callable(*args, **kwargs) # type: ignore
> ^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 279, in cluster_status
> raise RetryError(
> tenacity.RetryError: RetryError[<Future at 0xffddafe01160 state=finished raised Exception>]
* DPE-7726 Fix exception on update-status. PostgreSQLUndefinedHostError: Host not set.
Exception:
> 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 cluster:API get_patroni_health: <Response [200]> (0.057417)
> 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 cluster:API cluster_status: [{'name': 'postgresql-0', 'role': 'leader', 'state': 'running', 'api_url': 'https://10.182.246.123:8008/patroni', 'host': '10.182.246.123', 'port': 5432, 'timeline': 1}, {'name': 'postgresql-1', 'role': 'sync_standby', 'state': 'running', 'api_url': 'https://10.182.246.163:8008/patroni', 'host': '10.182.246.163', 'port': 5432, 'timeline': 1, 'lag': 0}, {'name': 'postgresql-2', 'role': 'sync_standby', 'state': 'running', 'api_url': 'https://10.182.246.246:8008/patroni', 'host': '10.182.246.246', 'port': 5432, 'timeline': 1, 'lag': 0}]
> 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 __main__:Early exit primary_endpoint: Primary IP not in cached peer list
> 2025-08-19 20:49:40 ERROR unit.postgresql/2.juju-log server.go:406 root:Uncaught exception while in charm code:
> Traceback (most recent call last):
> File "/var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py", line 2736, in <module>
> main(PostgresqlOperatorCharm)
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/__init__.py", line 356, in __call__
> return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 502, in main
> manager.run()
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 486, in run
> self._emit()
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 421, in _emit
> self._emit_charm_event(self.dispatcher.event_name)
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 465, in _emit_charm_event
> event_to_emit.emit(*args, **kwargs)
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 351, in emit
> framework._emit(event)
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 924, in _emit
> self._reemit(event_path)
> File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 1030, in _reemit
> custom_handler(event)
> File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
> return callable(*args, **kwargs) # type: ignore
> ^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py", line 1942, in _on_update_status
> self.postgresql_client_relation.oversee_users()
> File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
> return callable(*args, **kwargs) # type: ignore
> ^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/src/relations/postgresql_provider.py", line 172, in oversee_users
> user for user in self.charm.postgresql.list_users() if user.startswith("relation-")
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
> return callable(*args, **kwargs) # type: ignore
> ^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/postgresql_k8s/v1/postgresql.py", line 959, in list_users
> with self._connect_to_database(
> ^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
> return callable(*args, **kwargs) # type: ignore
> ^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/postgresql_k8s/v1/postgresql.py", line 273, in _connect_to_database
> raise PostgreSQLUndefinedHostError("Host not set")
> charms.postgresql_k8s.v1.postgresql.PostgreSQLUndefinedHostError: Host not set
> 2025-08-19 20:49:40 ERROR juju.worker.uniter.operation runhook.go:180 hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1
* DPE-7726 Adopt unit test for the new code
Tnx to dragomir.penev@ for unit tests fixes here!
* DPE-7726 Increase free disk space cleanup timeout (1->3 minutes)
It backports data-platform-workflow commit f1f8d27 to local integration test:
> patch(integration_test_charm.yaml): Increase disk space step timeout (#301)
Otherwise:
> Disk usage before cleanup
> Filesystem Size Used Avail Use% Mounted on
> /dev/root 72G 46G 27G 64% /
> tmpfs 7.9G 84K 7.9G 1% /dev/shm
> tmpfs 3.2G 1.1M 3.2G 1% /run
> tmpfs 5.0M 0 5.0M 0% /run/lock
> /dev/sdb16 881M 60M 760M 8% /boot
> /dev/sdb15 105M 6.2M 99M 6% /boot/efi
> /dev/sda1 74G 4.1G 66G 6% /mnt
> tmpfs 1.6G 12K 1.6G 1% /run/user/1001
> Error: The action 'Free up disk space' has timed out after 1 minutes.
0 commit comments