Skip to content

Commit f9c4aa4

Browse files
[DPE-7726] Use Patroni API for is_restart_pending() (instead of SQL select from pg_settings) (#1049)
* DPE-7726 Use Patroni API for is_restart_pending() The previous is_restart_pending() waited for long due to the Patroni's loop_wait default value (10 seconds), which tells how much time Patroni will wait before checking the configuration file again to reload it. Instead of checking PostgreSQL pending_restart from pg_settings, let's check Patroni API pending_restart=True flag. * DPE-7726 Avoid pending_restart=True flag flickering The current Patroni 3.2.2 has wired/flickering behaviour: it temporary flag pending_restart=True on many changes to REST API, which is gone within a second but long enough to be cougth by charm. Sleepping a bit is a necessary evil, until Patroni 3.3.0 upgrade. The previous code sleept for 15 seconds waiting for pg_settings update. Also, the unnecessary restarts could be triggered by missmatch of Patroni config file and in-memory changes coming from REST API, e.g. the slots were undefined in yaml file but set as an empty JSON {} => None. Updating the default template to match the default API PATCHes and avoid restarts. * DPE-7726 Fix topology obsert Primarly status removal On topology observer event, the primary unit used to loose Primarly label. * DPE-7726 Add Patroni API logging Also: * use commong logger everywhere * and add several useful log messaged (e.g. DB connection) * remove no longer necessary debug 'Init class PostgreSQL' * align Patroni API requests style everhywhere * add Patroni API duration to debug logs * DPE-7726 Avoid unnecessary Patroni reloads The list of IPs were randomly sorted causing unnecessary Partroni configuration re-generation with following Patroni restart/reload. * DPE-7726 Remove unnecessary property app_units() and scoped_peer_data() Housekeeping cleanup. * DPE-7726 Stop deffering for non-joined peers on on_start/on_config_changed Those defers are necessary to support scale-up/scale-down during the refresh, while they have significalty slowdown PostgreSQL 16 bootstrap (and other daily related mainteinance tasks, like re-scaling, full node reboot/recovery, etc). Muting them for now with the proper documentation record to forbid rescaling during the refresh, untli we minimise amount of defers in PG16. Throw and warning for us to recall this promiss. * DPE-7726 Start observer on non-Primary Patroni start to speedup re-join The current PG16 logic relies on Juju update-status or on_topology_change observer events, while in some cases we start Patroni without the Observer, causing a long waiting story till the next update-status arrives. * DPE-7726 Log Patroni start/stop/restart (to undestand charm behavior) * DPE-7726 Log unit status change to notice Primary label loose It is hard (impossible?) to catch the Juju Primary label manipulations from Juju debug-log. Logging it simplifyies troubleshooting. * DPE-7726 Fixup logs polishing * DPE-7726 Decrease waiting for DB connection timeout We had to wait 30 seconds in case of lack of connection which is unnecessary long. Also, add details for the reason of failed connection Retry/CannotConnect. * DPE-7726 Stop propogating primary_endpoint=None for single unit app It speedups the sinble unit app deployments. * DPE-7726 Handling get primary cluster RetryError on get_partner_addresses() Otherwise update-status event fails: > unit-postgresql-0: relations.async_replication:Partner addresses: [] > unit-postgresql-0: cluster:Unable to get the state of the cluster > Traceback (most recent call last): > File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 619, in online_cluster_members > cluster_status = self.cluster_status() > ^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function > return callable(*args, **kwargs) # type: ignore > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 279, in cluster_status > raise RetryError( > tenacity.RetryError: RetryError[<Future at 0xffddafe01160 state=finished raised Exception>] * DPE-7726 Fix exception on update-status. PostgreSQLUndefinedHostError: Host not set. Exception: > 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 cluster:API get_patroni_health: <Response [200]> (0.057417) > 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 cluster:API cluster_status: [{'name': 'postgresql-0', 'role': 'leader', 'state': 'running', 'api_url': 'https://10.182.246.123:8008/patroni', 'host': '10.182.246.123', 'port': 5432, 'timeline': 1}, {'name': 'postgresql-1', 'role': 'sync_standby', 'state': 'running', 'api_url': 'https://10.182.246.163:8008/patroni', 'host': '10.182.246.163', 'port': 5432, 'timeline': 1, 'lag': 0}, {'name': 'postgresql-2', 'role': 'sync_standby', 'state': 'running', 'api_url': 'https://10.182.246.246:8008/patroni', 'host': '10.182.246.246', 'port': 5432, 'timeline': 1, 'lag': 0}] > 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 __main__:Early exit primary_endpoint: Primary IP not in cached peer list > 2025-08-19 20:49:40 ERROR unit.postgresql/2.juju-log server.go:406 root:Uncaught exception while in charm code: > Traceback (most recent call last): > File "/var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py", line 2736, in <module> > main(PostgresqlOperatorCharm) > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/__init__.py", line 356, in __call__ > return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 502, in main > manager.run() > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 486, in run > self._emit() > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 421, in _emit > self._emit_charm_event(self.dispatcher.event_name) > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 465, in _emit_charm_event > event_to_emit.emit(*args, **kwargs) > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 351, in emit > framework._emit(event) > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 924, in _emit > self._reemit(event_path) > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 1030, in _reemit > custom_handler(event) > File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function > return callable(*args, **kwargs) # type: ignore > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py", line 1942, in _on_update_status > self.postgresql_client_relation.oversee_users() > File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function > return callable(*args, **kwargs) # type: ignore > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/src/relations/postgresql_provider.py", line 172, in oversee_users > user for user in self.charm.postgresql.list_users() if user.startswith("relation-") > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function > return callable(*args, **kwargs) # type: ignore > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/postgresql_k8s/v1/postgresql.py", line 959, in list_users > with self._connect_to_database( > ^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function > return callable(*args, **kwargs) # type: ignore > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/postgresql_k8s/v1/postgresql.py", line 273, in _connect_to_database > raise PostgreSQLUndefinedHostError("Host not set") > charms.postgresql_k8s.v1.postgresql.PostgreSQLUndefinedHostError: Host not set > 2025-08-19 20:49:40 ERROR juju.worker.uniter.operation runhook.go:180 hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1 * DPE-7726 Adopt unit test for the new code Tnx to dragomir.penev@ for unit tests fixes here! * DPE-7726 Increase free disk space cleanup timeout (1->3 minutes) It backports data-platform-workflow commit f1f8d27 to local integration test: > patch(integration_test_charm.yaml): Increase disk space step timeout (#301) Otherwise: > Disk usage before cleanup > Filesystem Size Used Avail Use% Mounted on > /dev/root 72G 46G 27G 64% / > tmpfs 7.9G 84K 7.9G 1% /dev/shm > tmpfs 3.2G 1.1M 3.2G 1% /run > tmpfs 5.0M 0 5.0M 0% /run/lock > /dev/sdb16 881M 60M 760M 8% /boot > /dev/sdb15 105M 6.2M 99M 6% /boot/efi > /dev/sda1 74G 4.1G 66G 6% /mnt > tmpfs 1.6G 12K 1.6G 1% /run/user/1001 > Error: The action 'Free up disk space' has timed out after 1 minutes.
1 parent 49d8bb5 commit f9c4aa4

File tree

10 files changed

+164
-109
lines changed

10 files changed

+164
-109
lines changed

.github/workflows/integration_test.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,10 +84,10 @@ jobs:
8484
needs:
8585
- collect-integration-tests
8686
runs-on: ${{ matrix.job.runner }}
87-
timeout-minutes: 217 # Sum of steps `timeout-minutes` + 5
87+
timeout-minutes: 219 # Sum of steps `timeout-minutes` + 5
8888
steps:
8989
- name: Free up disk space
90-
timeout-minutes: 1
90+
timeout-minutes: 3
9191
run: |
9292
printf '\nDisk usage before cleanup\n'
9393
df --human-readable

lib/charms/postgresql_k8s/v1/postgresql.py

Lines changed: 4 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -273,8 +273,11 @@ def _connect_to_database(
273273
raise PostgreSQLUndefinedHostError("Host not set")
274274
if not self.password:
275275
raise PostgreSQLUndefinedPasswordError("Password not set")
276+
277+
dbname = database if database else self.database
278+
logger.debug(f"New DB connection: dbname='{dbname}' user='{self.user}' host='{host}' connect_timeout=1")
276279
connection = psycopg2.connect(
277-
f"dbname='{database if database else self.database}' user='{self.user}' host='{host}'"
280+
f"dbname='{dbname}' user='{self.user}' host='{host}'"
278281
f"password='{self.password}' connect_timeout=1"
279282
)
280283
connection.autocommit = True
@@ -1322,23 +1325,6 @@ def update_user_password(
13221325
if connection is not None:
13231326
connection.close()
13241327

1325-
def is_restart_pending(self) -> bool:
1326-
"""Query pg_settings for pending restart."""
1327-
connection = None
1328-
try:
1329-
with self._connect_to_database() as connection, connection.cursor() as cursor:
1330-
cursor.execute("SELECT COUNT(*) FROM pg_settings WHERE pending_restart=True;")
1331-
return cursor.fetchone()[0] > 0
1332-
except psycopg2.OperationalError:
1333-
logger.warning("Failed to connect to PostgreSQL.")
1334-
return False
1335-
except psycopg2.Error as e:
1336-
logger.error(f"Failed to check if restart is pending: {e}")
1337-
return False
1338-
finally:
1339-
if connection:
1340-
connection.close()
1341-
13421328
def database_exists(self, db: str) -> bool:
13431329
"""Check whether specified database exists."""
13441330
connection = None

src/charm.py

Lines changed: 34 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ def unit_number(unit_name: str):
177177
# Lowest unit number is last to refresh
178178
last_unit_to_refresh = sorted(all_units, key=unit_number)[0].replace("/", "-")
179179
if self._charm._patroni.get_primary() == last_unit_to_refresh:
180-
logging.info(
180+
logger.info(
181181
f"Unit {last_unit_to_refresh} was already primary during pre-refresh check"
182182
)
183183
else:
@@ -187,7 +187,7 @@ def unit_number(unit_name: str):
187187
logger.warning(f"switchover failed with reason: {e}")
188188
raise charm_refresh.PrecheckFailed("Unable to switch primary")
189189
else:
190-
logging.info(
190+
logger.info(
191191
f"Switched primary to unit {last_unit_to_refresh} during pre-refresh check"
192192
)
193193

@@ -477,21 +477,6 @@ def patroni_scrape_config(self) -> list[dict]:
477477
}
478478
]
479479

480-
@property
481-
def app_units(self) -> set[Unit]:
482-
"""The peer-related units in the application."""
483-
if not self._peers:
484-
return set()
485-
486-
return {self.unit, *self._peers.units}
487-
488-
def scoped_peer_data(self, scope: SCOPES) -> dict | None:
489-
"""Returns peer data based on scope."""
490-
if scope == APP_SCOPE:
491-
return self.app_peer_data
492-
elif scope == UNIT_SCOPE:
493-
return self.unit_peer_data
494-
495480
@property
496481
def app_peer_data(self) -> dict:
497482
"""Application peer relation data object."""
@@ -628,7 +613,6 @@ def postgresql(self) -> PostgreSQL:
628613
"""Returns an instance of the object used to interact with the database."""
629614
password = str(self.get_secret(APP_SCOPE, f"{USER}-password"))
630615
if self._postgresql is None or self._postgresql.primary_host is None:
631-
logger.debug("Init class PostgreSQL")
632616
self._postgresql = PostgreSQL(
633617
primary_host=self.primary_endpoint,
634618
current_host=self._unit_ip,
@@ -655,15 +639,17 @@ def primary_endpoint(self) -> str | None:
655639
# Force a retry if there is no primary or the member that was
656640
# returned is not in the list of the current cluster members
657641
# (like when the cluster was not updated yet after a failed switchover).
658-
if not primary_endpoint or primary_endpoint not in self._units_ips:
659-
# TODO figure out why peer data is not available
660-
if primary_endpoint and len(self._units_ips) == 1 and len(self._peers.units) > 1:
661-
logger.warning(
662-
"Possibly incomplete peer data: Will not map primary IP to unit IP"
663-
)
664-
return primary_endpoint
665-
logger.debug("primary endpoint early exit: Primary IP not in cached peer list.")
642+
if not primary_endpoint:
643+
logger.warning(f"Missing primary IP for {primary}")
666644
primary_endpoint = None
645+
elif primary_endpoint not in self._units_ips:
646+
if len(self._peers.units) == 0:
647+
logger.info(f"The unit didn't join {PEER} relation? Using {primary_endpoint}")
648+
elif len(self._units_ips) == 1 and len(self._peers.units) > 1:
649+
logger.warning(f"Possibly incomplete peer data, keep using {primary_endpoint}")
650+
else:
651+
logger.debug("Early exit primary_endpoint: Primary IP not in cached peer list")
652+
primary_endpoint = None
667653
except RetryError:
668654
return None
669655
else:
@@ -1348,7 +1334,7 @@ def _on_cluster_topology_change(self, _):
13481334
logger.info("Cluster topology changed")
13491335
if self.primary_endpoint:
13501336
self._update_relation_endpoints()
1351-
self.set_unit_status(ActiveStatus())
1337+
self._set_primary_status_message()
13521338

13531339
def _on_install(self, event: InstallEvent) -> None:
13541340
"""Install prerequisites for the application."""
@@ -1460,10 +1446,8 @@ def _on_config_changed(self, event) -> None: # noqa: C901
14601446
return
14611447

14621448
if self.refresh is None:
1463-
logger.debug("Defer on_config_changed: Refresh could be in progress")
1464-
event.defer()
1465-
return
1466-
if self.refresh.in_progress:
1449+
logger.warning("Warning _on_config_changed: Refresh could be in progress")
1450+
elif self.refresh.in_progress:
14671451
logger.debug("Defer on_config_changed: Refresh in progress")
14681452
event.defer()
14691453
return
@@ -1585,10 +1569,8 @@ def _can_start(self, event: StartEvent) -> bool:
15851569

15861570
# Safeguard against starting while refreshing.
15871571
if self.refresh is None:
1588-
logger.debug("Defer on_start: Refresh could be in progress")
1589-
event.defer()
1590-
return False
1591-
if self.refresh.in_progress:
1572+
logger.warning("Warning on_start: Refresh could be in progress")
1573+
elif self.refresh.in_progress:
15921574
# TODO: we should probably start workload if scale up while refresh in progress
15931575
logger.debug("Defer on_start: Refresh in progress")
15941576
event.defer()
@@ -1657,7 +1639,7 @@ def _restart_metrics_service(self, postgres_snap: snap.Snap) -> None:
16571639
try:
16581640
snap_password = postgres_snap.get("exporter.password")
16591641
except snap.SnapError:
1660-
logger.warning("Early exit: Trying to reset metrics service with no configuration set")
1642+
logger.warning("Early exit: skipping exporter setup (no configuration set)")
16611643
return None
16621644

16631645
if snap_password != self.get_secret(APP_SCOPE, MONITORING_PASSWORD_KEY):
@@ -1963,6 +1945,7 @@ def _on_update_status(self, _) -> None:
19631945

19641946
if not self._patroni.member_started and self._patroni.is_member_isolated:
19651947
self._patroni.restart_patroni()
1948+
self._observer.start_observer()
19661949
return
19671950

19681951
# Update the sync-standby endpoint in the async replication data.
@@ -2092,8 +2075,9 @@ def _handle_processes_failures(self) -> bool:
20922075
logger.info("PostgreSQL data directory was not empty. Moved pg_wal")
20932076
return True
20942077
try:
2095-
self._patroni.restart_patroni()
20962078
logger.info("restarted PostgreSQL because it was not running")
2079+
self._patroni.restart_patroni()
2080+
self._observer.start_observer()
20972081
return True
20982082
except RetryError:
20992083
logger.error("failed to restart PostgreSQL after checking that it was not running")
@@ -2124,11 +2108,8 @@ def _set_primary_status_message(self) -> None:
21242108
danger_state = " (read-only)"
21252109
elif len(self._patroni.get_running_cluster_members()) < self.app.planned_units():
21262110
danger_state = " (degraded)"
2127-
self.set_unit_status(
2128-
ActiveStatus(
2129-
f"{'Standby' if self.is_standby_leader else 'Primary'}{danger_state}"
2130-
)
2131-
)
2111+
unit_status = "Standby" if self.is_standby_leader else "Primary"
2112+
self.set_unit_status(ActiveStatus(f"{unit_status}{danger_state}"))
21322113
elif self._patroni.member_started:
21332114
self.set_unit_status(ActiveStatus())
21342115
except (RetryError, ConnectionError) as e:
@@ -2335,12 +2316,13 @@ def _can_connect_to_postgresql(self) -> bool:
23352316
if not self.postgresql.password or not self.postgresql.current_host:
23362317
return False
23372318
try:
2338-
for attempt in Retrying(stop=stop_after_delay(30), wait=wait_fixed(3)):
2319+
for attempt in Retrying(stop=stop_after_delay(10), wait=wait_fixed(3)):
23392320
with attempt:
23402321
if not self.postgresql.get_postgresql_timezones():
2322+
logger.debug("Cannot connect to database (CannotConnectError)")
23412323
raise CannotConnectError
23422324
except RetryError:
2343-
logger.debug("Cannot connect to database")
2325+
logger.debug("Cannot connect to database (RetryError)")
23442326
return False
23452327
return True
23462328

@@ -2381,7 +2363,7 @@ def update_config(
23812363
parameters=pg_parameters,
23822364
no_peers=no_peers,
23832365
user_databases_map=self.relations_user_databases_map,
2384-
slots=replication_slots or None,
2366+
slots=replication_slots,
23852367
)
23862368
if no_peers:
23872369
return True
@@ -2489,18 +2471,13 @@ def _handle_postgresql_restart_need(self) -> None:
24892471
self._patroni.reload_patroni_configuration()
24902472
except Exception as e:
24912473
logger.error(f"Reload patroni call failed! error: {e!s}")
2492-
# Wait for some more time than the Patroni's loop_wait default value (10 seconds),
2493-
# which tells how much time Patroni will wait before checking the configuration
2494-
# file again to reload it.
2495-
try:
2496-
for attempt in Retrying(stop=stop_after_attempt(5), wait=wait_fixed(3)):
2497-
with attempt:
2498-
restart_postgresql = restart_postgresql or self.postgresql.is_restart_pending()
2499-
if not restart_postgresql:
2500-
raise Exception
2501-
except RetryError:
2502-
# Ignore the error, as it happens only to indicate that the configuration has not changed.
2503-
pass
2474+
2475+
restart_pending = self._patroni.is_restart_pending()
2476+
logger.debug(
2477+
f"Checking if restart pending: TLS={restart_postgresql} or API={restart_pending}"
2478+
)
2479+
restart_postgresql = restart_postgresql or restart_pending
2480+
25042481
self.unit_peer_data.update({"tls": "enabled" if self.is_tls_enabled else ""})
25052482
self.postgresql_client_relation.update_endpoints()
25062483

0 commit comments

Comments
 (0)