Skip to content

Conversation

@taurus-forever
Copy link
Contributor

@taurus-forever taurus-forever commented Jul 15, 2025

Issue

The previous is_restart_pending() waited for 15 seconds due to the
Patroni's loop_wait default value (10 seconds), which tells how much time
Patroni will wait before checking the configuration file again to reload it.

Solution

Instead of checking PostgreSQL pending_restart from pg_settings,
check Patroni API pending_restart=True/undefined.

While testing the new PR a lot of instabilities noticed, e.g. pending_restart=True flag flickering
or unsorted IPs in Patroni config triggered unnecessary Patroni rolling restarts.
So, included all those fixes here (consider to review each commit independently).

The controversy commit here is refresh v3, this is significantly slows down PostgreSQL charm...
See Carl remarks below. We can extract it from this PR (if necessary).

Checklist

  • I have added or updated any relevant documentation.
  • I have cleaned any remaining cloud resources from my accounts.

@codecov
Copy link

codecov bot commented Jul 15, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.10%. Comparing base (63846ad) to head (2a81476).
⚠️ Report is 4 commits behind head on 16/edge.

Files with missing lines Patch % Lines
src/charm.py 45.83% 11 Missing and 2 partials ⚠️
src/cluster.py 63.63% 8 Missing ⚠️
src/relations/async_replication.py 0.00% 6 Missing ⚠️

❌ Your project status has failed because the head coverage (65.10%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           16/edge    #1049      +/-   ##
===========================================
+ Coverage    64.91%   65.10%   +0.18%     
===========================================
  Files           17       17              
  Lines         4275     4281       +6     
  Branches       656      655       -1     
===========================================
+ Hits          2775     2787      +12     
+ Misses        1333     1319      -14     
- Partials       167      175       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@taurus-forever taurus-forever changed the title Use Patroni API for is_restart_pending() [DPE-7726] Use Patroni API for is_restart_pending() (instead of SQL select from pg_settings) Jul 15, 2025
@taurus-forever taurus-forever force-pushed the alutay/is_restart_pending branch 5 times, most recently from 9c17d76 to e31de74 Compare August 2, 2025 00:40
@taurus-forever taurus-forever force-pushed the alutay/is_restart_pending branch 2 times, most recently from 1703639 to ee8e44b Compare August 13, 2025 23:20
@taurus-forever taurus-forever force-pushed the alutay/is_restart_pending branch 2 times, most recently from 710efc1 to 0a7b549 Compare August 19, 2025 21:46
The previous is_restart_pending() waited for long due to the Patroni's
loop_wait default value (10 seconds), which tells how much time
Patroni will wait before checking the configuration file again to reload it.

Instead of checking PostgreSQL pending_restart from pg_settings,
let's check Patroni API pending_restart=True flag.
@taurus-forever taurus-forever force-pushed the alutay/is_restart_pending branch from 17e5454 to ae846e9 Compare August 20, 2025 12:50
@taurus-forever taurus-forever added the bug Something isn't working as expected label Aug 20, 2025
The current Patroni 3.2.2 has wired/flickering  behaviour:
it temporary flag pending_restart=True on many changes to REST API,
which is gone within a second but long enough to be cougth by charm.
Sleepping a bit is a necessary evil, until Patroni 3.3.0 upgrade.

The previous code sleept for 15 seconds waiting for pg_settings update.

Also, the unnecessary restarts could be triggered by missmatch of
Patroni config file and in-memory changes coming from REST API,
e.g. the slots were undefined in yaml file but set as an empty JSON {} => None.
Updating the default template to match the default API PATCHes and avoid restarts.
On topology observer event, the primary unit used to loose Primarly label.
Also:
* use commong logger everywhere
* and add several useful log messaged (e.g. DB connection)
* remove no longer necessary debug 'Init class PostgreSQL'
* align Patroni API requests style everhywhere
* add Patroni API duration to debug logs
The list of IPs were randomly sorted causing unnecessary Partroni
configuration re-generation with following Patroni restart/reload.
…anged

Those defers are necessary to support scale-up/scale-down during the refresh,
while they have significalty slowdown PostgreSQL 16 bootstrap (and other
daily related mainteinance tasks, like re-scaling, full node reboot/recovery, etc).

Muting them for now with the proper documentation record to
forbid rescaling during the refresh, untli we minimise amount of defers in PG16.
Throw and warning for us to recall this promiss.
The current PG16 logic relies on Juju update-status or on_topology_change
observer events, while in some cases we start Patroni without the Observer,
causing a long waiting story till the next update-status arrives.
It is hard (impossible?) to catch the Juju Primary label
manipulations from Juju debug-log. Logging it simplifyies troubleshooting.
We had to wait 30 seconds in case of lack of connection which is unnecessary long.

Also, add details for the reason of failed connection Retry/CannotConnect.
It speedups the sinble unit app deployments.
…sses()

Otherwise update-status event fails:

> unit-postgresql-0: relations.async_replication:Partner addresses: []
> unit-postgresql-0: cluster:Unable to get the state of the cluster
> Traceback (most recent call last):
>   File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 619, in online_cluster_members
>     cluster_status = self.cluster_status()
>                      ^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
>     return callable(*args, **kwargs)  # type: ignore
>            ^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 279, in cluster_status
>     raise RetryError(
> tenacity.RetryError: RetryError[<Future at 0xffddafe01160 state=finished raised Exception>]
…: Host not set.

Exception:

> 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 cluster:API get_patroni_health: <Response [200]> (0.057417)
> 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 cluster:API cluster_status: [{'name': 'postgresql-0', 'role': 'leader', 'state': 'running', 'api_url': 'https://10.182.246.123:8008/patroni', 'host': '10.182.246.123', 'port': 5432, 'timeline': 1}, {'name': 'postgresql-1', 'role': 'sync_standby', 'state': 'running', 'api_url': 'https://10.182.246.163:8008/patroni', 'host': '10.182.246.163', 'port': 5432, 'timeline': 1, 'lag': 0}, {'name': 'postgresql-2', 'role': 'sync_standby', 'state': 'running', 'api_url': 'https://10.182.246.246:8008/patroni', 'host': '10.182.246.246', 'port': 5432, 'timeline': 1, 'lag': 0}]
> 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 __main__:Early exit primary_endpoint: Primary IP not in cached peer list
> 2025-08-19 20:49:40 ERROR unit.postgresql/2.juju-log server.go:406 root:Uncaught exception while in charm code:
> Traceback (most recent call last):
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py", line 2736, in <module>
>     main(PostgresqlOperatorCharm)
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/__init__.py", line 356, in __call__
>     return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 502, in main
>     manager.run()
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 486, in run
>     self._emit()
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 421, in _emit
>     self._emit_charm_event(self.dispatcher.event_name)
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 465, in _emit_charm_event
>     event_to_emit.emit(*args, **kwargs)
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 351, in emit
>     framework._emit(event)
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 924, in _emit
>     self._reemit(event_path)
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 1030, in _reemit
>     custom_handler(event)
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
>     return callable(*args, **kwargs)  # type: ignore
>            ^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py", line 1942, in _on_update_status
>     self.postgresql_client_relation.oversee_users()
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
>     return callable(*args, **kwargs)  # type: ignore
>            ^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/src/relations/postgresql_provider.py", line 172, in oversee_users
>     user for user in self.charm.postgresql.list_users() if user.startswith("relation-")
>                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
>     return callable(*args, **kwargs)  # type: ignore
>            ^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/postgresql_k8s/v1/postgresql.py", line 959, in list_users
>     with self._connect_to_database(
>          ^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
>     return callable(*args, **kwargs)  # type: ignore
>            ^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/postgresql_k8s/v1/postgresql.py", line 273, in _connect_to_database
>     raise PostgreSQLUndefinedHostError("Host not set")
> charms.postgresql_k8s.v1.postgresql.PostgreSQLUndefinedHostError: Host not set
> 2025-08-19 20:49:40 ERROR juju.worker.uniter.operation runhook.go:180 hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1
Tnx to dragomir.penev@ for unit tests fixes here!
@taurus-forever taurus-forever force-pushed the alutay/is_restart_pending branch from ae846e9 to 0bd81cd Compare August 20, 2025 13:54
@taurus-forever taurus-forever marked this pull request as ready for review August 20, 2025 20:35
It backports data-platform-workflow commit f1f8d27 to local integration test:
> patch(integration_test_charm.yaml): Increase disk space step timeout (#301)

Otherwise:

> Disk usage before cleanup
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/root        72G   46G   27G  64% /
> tmpfs           7.9G   84K  7.9G   1% /dev/shm
> tmpfs           3.2G  1.1M  3.2G   1% /run
> tmpfs           5.0M     0  5.0M   0% /run/lock
> /dev/sdb16      881M   60M  760M   8% /boot
> /dev/sdb15      105M  6.2M   99M   6% /boot/efi
> /dev/sda1        74G  4.1G   66G   6% /mnt
> tmpfs           1.6G   12K  1.6G   1% /run/user/1001
> Error: The action 'Free up disk space' has timed out after 1 minutes.
Copy link
Member

@marceloneppel marceloneppel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the detailed description in each commit!

@taurus-forever taurus-forever merged commit f9c4aa4 into 16/edge Aug 21, 2025
150 of 159 checks passed
@taurus-forever taurus-forever deleted the alutay/is_restart_pending branch August 21, 2025 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working as expected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants