Skip to content

Conversation

@dragomirp
Copy link
Contributor

@dragomirp dragomirp commented Nov 29, 2023

Backup tests were missed when switching to self-hosted runners and are failing likely due to a proxy issue.

This PR re-enables all missed tests and switches back to the reusable workflow.

@dragomirp dragomirp changed the title [] Reenable backup tests and revert to reusable workflow [DPE-2904] Reenable backup tests and revert to reusable workflow Nov 29, 2023
@dragomirp dragomirp marked this pull request as ready for review November 30, 2023 10:36
@dragomirp
Copy link
Contributor Author

If I manage to get backups to work by EOD, I'll switch this PR back to self-hosted workflow, if not we should merge this to have all tests running.

Copy link
Member

@marceloneppel marceloneppel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for now if we need more investigation or help from IS.

@dragomirp dragomirp merged commit b2889e0 into main Nov 30, 2023
@dragomirp dragomirp deleted the revert_workflow branch November 30, 2023 17:34
BON4 pushed a commit to BON4/postgresql-operator that referenced this pull request Apr 23, 2024
…onical#301)

* Move dashboard legends to the bottom of the graph

* Missed test marks
taurus-forever added a commit that referenced this pull request Aug 20, 2025
It backports data-platform-workflow commit f1f8d27 to local integration test:
> patch(integration_test_charm.yaml): Increase disk space step timeout (#301)

Otherwise:

> Disk usage before cleanup
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/root        72G   46G   27G  64% /
> tmpfs           7.9G   84K  7.9G   1% /dev/shm
> tmpfs           3.2G  1.1M  3.2G   1% /run
> tmpfs           5.0M     0  5.0M   0% /run/lock
> /dev/sdb16      881M   60M  760M   8% /boot
> /dev/sdb15      105M  6.2M   99M   6% /boot/efi
> /dev/sda1        74G  4.1G   66G   6% /mnt
> tmpfs           1.6G   12K  1.6G   1% /run/user/1001
> Error: The action 'Free up disk space' has timed out after 1 minutes.
taurus-forever added a commit that referenced this pull request Aug 21, 2025
…elect from pg_settings) (#1049)

* DPE-7726 Use Patroni API for is_restart_pending()

The previous is_restart_pending() waited for long due to the Patroni's
loop_wait default value (10 seconds), which tells how much time
Patroni will wait before checking the configuration file again to reload it.

Instead of checking PostgreSQL pending_restart from pg_settings,
let's check Patroni API pending_restart=True flag.

* DPE-7726 Avoid pending_restart=True flag flickering

The current Patroni 3.2.2 has wired/flickering  behaviour:
it temporary flag pending_restart=True on many changes to REST API,
which is gone within a second but long enough to be cougth by charm.
Sleepping a bit is a necessary evil, until Patroni 3.3.0 upgrade.

The previous code sleept for 15 seconds waiting for pg_settings update.

Also, the unnecessary restarts could be triggered by missmatch of
Patroni config file and in-memory changes coming from REST API,
e.g. the slots were undefined in yaml file but set as an empty JSON {} => None.
Updating the default template to match the default API PATCHes and avoid restarts.

* DPE-7726 Fix topology obsert Primarly status removal

On topology observer event, the primary unit used to loose Primarly label.

* DPE-7726 Add Patroni API logging

Also:
* use commong logger everywhere
* and add several useful log messaged (e.g. DB connection)
* remove no longer necessary debug 'Init class PostgreSQL'
* align Patroni API requests style everhywhere
* add Patroni API duration to debug logs

* DPE-7726 Avoid unnecessary Patroni reloads

The list of IPs were randomly sorted causing unnecessary Partroni
configuration re-generation with following Patroni restart/reload.

* DPE-7726 Remove unnecessary property app_units() and scoped_peer_data()

Housekeeping cleanup.

* DPE-7726 Stop deffering for non-joined peers on on_start/on_config_changed

Those defers are necessary to support scale-up/scale-down during the refresh,
while they have significalty slowdown PostgreSQL 16 bootstrap (and other
daily related mainteinance tasks, like re-scaling, full node reboot/recovery, etc).

Muting them for now with the proper documentation record to
forbid rescaling during the refresh, untli we minimise amount of defers in PG16.
Throw and warning for us to recall this promiss.

* DPE-7726 Start observer on non-Primary Patroni start to speedup re-join

The current PG16 logic relies on Juju update-status or on_topology_change
observer events, while in some cases we start Patroni without the Observer,
causing a long waiting story till the next update-status arrives.

* DPE-7726 Log Patroni start/stop/restart (to undestand charm behavior)

* DPE-7726 Log unit status change to notice Primary label loose

It is hard (impossible?) to catch the Juju Primary label
manipulations from Juju debug-log. Logging it simplifyies troubleshooting.

* DPE-7726 Fixup logs polishing

* DPE-7726 Decrease waiting for DB connection timeout

We had to wait 30 seconds in case of lack of connection which is unnecessary long.

Also, add details for the reason of failed connection Retry/CannotConnect.

* DPE-7726 Stop propogating primary_endpoint=None for single unit app

It speedups the sinble unit app deployments.

* DPE-7726 Handling get primary cluster RetryError on get_partner_addresses()

Otherwise update-status event fails:

> unit-postgresql-0: relations.async_replication:Partner addresses: []
> unit-postgresql-0: cluster:Unable to get the state of the cluster
> Traceback (most recent call last):
>   File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 619, in online_cluster_members
>     cluster_status = self.cluster_status()
>                      ^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
>     return callable(*args, **kwargs)  # type: ignore
>            ^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-0/charm/src/cluster.py", line 279, in cluster_status
>     raise RetryError(
> tenacity.RetryError: RetryError[<Future at 0xffddafe01160 state=finished raised Exception>]

* DPE-7726 Fix exception on update-status. PostgreSQLUndefinedHostError: Host not set.

Exception:

> 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 cluster:API get_patroni_health: <Response [200]> (0.057417)
> 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 cluster:API cluster_status: [{'name': 'postgresql-0', 'role': 'leader', 'state': 'running', 'api_url': 'https://10.182.246.123:8008/patroni', 'host': '10.182.246.123', 'port': 5432, 'timeline': 1}, {'name': 'postgresql-1', 'role': 'sync_standby', 'state': 'running', 'api_url': 'https://10.182.246.163:8008/patroni', 'host': '10.182.246.163', 'port': 5432, 'timeline': 1, 'lag': 0}, {'name': 'postgresql-2', 'role': 'sync_standby', 'state': 'running', 'api_url': 'https://10.182.246.246:8008/patroni', 'host': '10.182.246.246', 'port': 5432, 'timeline': 1, 'lag': 0}]
> 2025-08-19 20:49:40 DEBUG unit.postgresql/2.juju-log server.go:406 __main__:Early exit primary_endpoint: Primary IP not in cached peer list
> 2025-08-19 20:49:40 ERROR unit.postgresql/2.juju-log server.go:406 root:Uncaught exception while in charm code:
> Traceback (most recent call last):
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py", line 2736, in <module>
>     main(PostgresqlOperatorCharm)
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/__init__.py", line 356, in __call__
>     return _main.main(charm_class=charm_class, use_juju_for_storage=use_juju_for_storage)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 502, in main
>     manager.run()
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 486, in run
>     self._emit()
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 421, in _emit
>     self._emit_charm_event(self.dispatcher.event_name)
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/_main.py", line 465, in _emit_charm_event
>     event_to_emit.emit(*args, **kwargs)
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 351, in emit
>     framework._emit(event)
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 924, in _emit
>     self._reemit(event_path)
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/lib/python3.12/site-packages/ops/framework.py", line 1030, in _reemit
>     custom_handler(event)
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
>     return callable(*args, **kwargs)  # type: ignore
>            ^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py", line 1942, in _on_update_status
>     self.postgresql_client_relation.oversee_users()
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
>     return callable(*args, **kwargs)  # type: ignore
>            ^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/src/relations/postgresql_provider.py", line 172, in oversee_users
>     user for user in self.charm.postgresql.list_users() if user.startswith("relation-")
>                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
>     return callable(*args, **kwargs)  # type: ignore
>            ^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/postgresql_k8s/v1/postgresql.py", line 959, in list_users
>     with self._connect_to_database(
>          ^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py", line 1116, in wrapped_function
>     return callable(*args, **kwargs)  # type: ignore
>            ^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/var/lib/juju/agents/unit-postgresql-2/charm/lib/charms/postgresql_k8s/v1/postgresql.py", line 273, in _connect_to_database
>     raise PostgreSQLUndefinedHostError("Host not set")
> charms.postgresql_k8s.v1.postgresql.PostgreSQLUndefinedHostError: Host not set
> 2025-08-19 20:49:40 ERROR juju.worker.uniter.operation runhook.go:180 hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1

* DPE-7726 Adopt unit test for the new code

Tnx to dragomir.penev@ for unit tests fixes here!

* DPE-7726 Increase free disk space cleanup timeout (1->3 minutes)

It backports data-platform-workflow commit f1f8d27 to local integration test:
> patch(integration_test_charm.yaml): Increase disk space step timeout (#301)

Otherwise:

> Disk usage before cleanup
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/root        72G   46G   27G  64% /
> tmpfs           7.9G   84K  7.9G   1% /dev/shm
> tmpfs           3.2G  1.1M  3.2G   1% /run
> tmpfs           5.0M     0  5.0M   0% /run/lock
> /dev/sdb16      881M   60M  760M   8% /boot
> /dev/sdb15      105M  6.2M   99M   6% /boot/efi
> /dev/sda1        74G  4.1G   66G   6% /mnt
> tmpfs           1.6G   12K  1.6G   1% /run/user/1001
> Error: The action 'Free up disk space' has timed out after 1 minutes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants