[DPE-2904] Reenable backup tests and revert to reusable workflow #301

dragomirp · 2023-11-29T19:55:54Z

Backup tests were missed when switching to self-hosted runners and are failing likely due to a proxy issue.

This PR re-enables all missed tests and switches back to the reusable workflow.

dragomirp · 2023-11-30T10:37:48Z

If I manage to get backups to work by EOD, I'll switch this PR back to self-hosted workflow, if not we should merge this to have all tests running.

marceloneppel

LGTM for now if we need more investigation or help from IS.

…onical#301) * Move dashboard legends to the bottom of the graph * Missed test marks

It backports data-platform-workflow commit f1f8d27 to local integration test: > patch(integration_test_charm.yaml): Increase disk space step timeout (#301) Otherwise: > Disk usage before cleanup > Filesystem Size Used Avail Use% Mounted on > /dev/root 72G 46G 27G 64% / > tmpfs 7.9G 84K 7.9G 1% /dev/shm > tmpfs 3.2G 1.1M 3.2G 1% /run > tmpfs 5.0M 0 5.0M 0% /run/lock > /dev/sdb16 881M 60M 760M 8% /boot > /dev/sdb15 105M 6.2M 99M 6% /boot/efi > /dev/sda1 74G 4.1G 66G 6% /mnt > tmpfs 1.6G 12K 1.6G 1% /run/user/1001 > Error: The action 'Free up disk space' has timed out after 1 minutes.

…elect from pg_settings) (#1049) * DPE-7726 Use Patroni API for is_restart_pending() The previous is_restart_pending() waited for long due to the Patroni's loop_wait default value (10 seconds), which tells how much time Patroni will wait before checking the configuration file again to reload it. Instead of checking PostgreSQL pending_restart from pg_settings, let's check Patroni API pending_restart=True flag. * DPE-7726 Avoid pending_restart=True flag flickering The current Patroni 3.2.2 has wired/flickering behaviour: it temporary flag pending_restart=True on many changes to REST API, which is gone within a second but long enough to be cougth by charm. Sleepping a bit is a necessary evil, until Patroni 3.3.0 upgrade. The previous code sleept for 15 seconds waiting for pg_settings update. Also, the unnecessary restarts could be triggered by missmatch of Patroni config file and in-memory changes coming from REST API, e.g. the slots were undefined in yaml file but set as an empty JSON {} => None. Updating the default template to match the default API PATCHes and avoid restarts. * DPE-7726 Fix topology obsert Primarly status removal On topology observer event, the primary unit used to loose Primarly label. * DPE-7726 Add Patroni API logging Also: * use commong logger everywhere * and add several useful log messaged (e.g. DB connection) * remove no longer necessary debug 'Init class PostgreSQL' * align Patroni API requests style everhywhere * add Patroni API duration to debug logs * DPE-7726 Avoid unnecessary Patroni reloads The list of IPs were randomly sorted causing unnecessary Partroni configuration re-generation with following Patroni restart/reload. * DPE-7726 Remove unnecessary property app_units() and scoped_peer_data() Housekeeping cleanup. * DPE-7726 Stop deffering for non-joined peers on on_start/on_config_changed Those defers are necessary to support scale-up/scale-down during the refresh, while they have significalty slowdown PostgreSQL 16 bootstrap (and other daily related mainteinance tasks, like re-scaling, full node reboot/recovery, etc). Muting them for now with the proper documentation record to forbid rescaling during the refresh, untli we minimise amount of defers in PG16. Throw and warning for us to recall this promiss. * DPE-7726 Start observer on non-Primary Patroni start to speedup re-join The current PG16 logic relies on Juju update-status or on_topology_change observer events, while in some cases we start Patroni without the Observer, causing a long waiting story till the next update-status arrives. * DPE-7726 Log Patroni start/stop/restart (to undestand charm behavior) * DPE-7726 Log unit status change to notice Primary label loose It is hard (impossible?) to catch the Juju Primary label manipulations from Juju debug-log. Logging it simplifyies troubleshooting. * DPE-7726 Fixup logs polishing * DPE-7726 Decrease waiting for DB connection timeout We had to wait 30 seconds in case of lack of connection which is unnecessary long. Also, add details for the reason of failed connection Retry/CannotConnect. * DPE-7726 Stop propogating primary_endpoint=None for single unit app It speedups the sinble unit app deployments. * DPE-7726 Handling get primary cluster RetryError on get_partner_addresses() Otherwise update-status event fails: > unit-postgresql-0: relations.async_replication:Partner addresses: [] > unit-postgresql-0: cluster:Unable to get the state of the cluster > Traceback (most recent call last): > File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 619, in online_cluster_members > cluster_status = self.cluster_status() > ^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function > return callable(*args, **kwargs) # type: ignore > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 279, in cluster_status > raise RetryError( > tenacity.RetryError: RetryError[<Future at 0xffddafe01160 state=finished raised Exception>] * DPE-7726 Fix exception on update-status. PostgreSQLUndefinedHostError: Host not set. Exception: > 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 cluster:API get_patroni_health: <Response [200]> (0.057417) > 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 cluster:API cluster_status: [{'name': 'postgresql-0', 'role': 'leader', 'state': 'running', 'api_url': 'https://10.182.246.123:8008/patroni', 'host': '10.182.246.123', 'port': 5432, 'timeline': 1}, {'name': 'postgresql-1', 'role': 'sync_standby', 'state': 'running', 'api_url': 'https://10.182.246.163:8008/patroni', 'host': '10.182.246.163', 'port': 5432, 'timeline': 1, 'lag': 0}, {'name': 'postgresql-2', 'role': 'sync_standby', 'state': 'running', 'api_url': 'https://10.182.246.246:8008/patroni', 'host': '10.182.246.246', 'port': 5432, 'timeline': 1, 'lag': 0}] > 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 __main__:Early exit primary_endpoint: Primary IP not in cached peer list > 2025-08-19 20:49:40 ERROR unit.postgresql/2.juju-log server.go:406 root:Uncaught exception while in charm code: > Traceback (most recent call last): > File "/var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py", line 2736, in <module> > main(PostgresqlOperatorCharm) > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/__init__.py", line 356, in __call__ > return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 502, in main > manager.run() > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 486, in run > self._emit() > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 421, in _emit > self._emit_charm_event(self.dispatcher.event_name) > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 465, in _emit_charm_event > event_to_emit.emit(*args, **kwargs) > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 351, in emit > framework._emit(event) > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 924, in _emit > self._reemit(event_path) > File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 1030, in _reemit > custom_handler(event) > File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function > return callable(*args, **kwargs) # type: ignore > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py", line 1942, in _on_update_status > self.postgresql_client_relation.oversee_users() > File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function > return callable(*args, **kwargs) # type: ignore > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/src/relations/postgresql_provider.py", line 172, in oversee_users > user for user in self.charm.postgresql.list_users() if user.startswith("relation-") > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function > return callable(*args, **kwargs) # type: ignore > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/postgresql_k8s/v1/postgresql.py", line 959, in list_users > with self._connect_to_database( > ^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function > return callable(*args, **kwargs) # type: ignore > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/postgresql_k8s/v1/postgresql.py", line 273, in _connect_to_database > raise PostgreSQLUndefinedHostError("Host not set") > charms.postgresql_k8s.v1.postgresql.PostgreSQLUndefinedHostError: Host not set > 2025-08-19 20:49:40 ERROR juju.worker.uniter.operation runhook.go:180 hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1 * DPE-7726 Adopt unit test for the new code Tnx to dragomir.penev@ for unit tests fixes here! * DPE-7726 Increase free disk space cleanup timeout (1->3 minutes) It backports data-platform-workflow commit f1f8d27 to local integration test: > patch(integration_test_charm.yaml): Increase disk space step timeout (#301) Otherwise: > Disk usage before cleanup > Filesystem Size Used Avail Use% Mounted on > /dev/root 72G 46G 27G 64% / > tmpfs 7.9G 84K 7.9G 1% /dev/shm > tmpfs 3.2G 1.1M 3.2G 1% /run > tmpfs 5.0M 0 5.0M 0% /run/lock > /dev/sdb16 881M 60M 760M 8% /boot > /dev/sdb15 105M 6.2M 99M 6% /boot/efi > /dev/sda1 74G 4.1G 66G 6% /mnt > tmpfs 1.6G 12K 1.6G 1% /run/user/1001 > Error: The action 'Free up disk space' has timed out after 1 minutes.

dragomirp added 2 commits November 28, 2023 16:09

Move dashboard legends to the bottom of the graph

99a16d5

Merge branch 'main' into dpe-2622-legends

9d36f22

github-actions bot added the Libraries: OK label Nov 29, 2023

dragomirp added 2 commits November 29, 2023 22:02

Missed test marks

69fc719

Merge branch 'main' into revert_workflow

5a58dae

dragomirp force-pushed the revert_workflow branch from 2f9fbf8 to 5a58dae Compare November 29, 2023 20:03

dragomirp changed the title ~~[] Reenable backup tests and revert to reusable workflow~~ [DPE-2904] Reenable backup tests and revert to reusable workflow Nov 29, 2023

dragomirp marked this pull request as ready for review November 30, 2023 10:36

dragomirp requested review from marceloneppel and taurus-forever November 30, 2023 10:36

marceloneppel approved these changes Nov 30, 2023

View reviewed changes

taurus-forever approved these changes Nov 30, 2023

View reviewed changes

dragomirp merged commit b2889e0 into main Nov 30, 2023

dragomirp deleted the revert_workflow branch November 30, 2023 17:34

BON4 pushed a commit to BON4/postgresql-operator that referenced this pull request Apr 23, 2024

[DPE-2904] Reenable backup tests and revert to reusable workflow (can…

2b5c02e

…onical#301) * Move dashboard legends to the bottom of the graph * Missed test marks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DPE-2904] Reenable backup tests and revert to reusable workflow #301

[DPE-2904] Reenable backup tests and revert to reusable workflow #301

Uh oh!

dragomirp commented Nov 29, 2023 •

edited

Loading

Uh oh!

dragomirp commented Nov 30, 2023

Uh oh!

marceloneppel left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[DPE-2904] Reenable backup tests and revert to reusable workflow #301

[DPE-2904] Reenable backup tests and revert to reusable workflow #301

Uh oh!

Conversation

dragomirp commented Nov 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dragomirp commented Nov 30, 2023

Uh oh!

marceloneppel left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dragomirp commented Nov 29, 2023 •

edited

Loading