Skip to content

Conversation

marceloneppel
Copy link
Member

@marceloneppel marceloneppel commented Apr 10, 2023

Issue

Jira tickets: DPE-544, DPE-551 and DPE-1654

There were no tests to simulate the situation where the DB process is killed in one instance or all the instances have their DB processes restarted or killed at the same time.

Solution

Add the tests.

In tests/integration/ha_tests/application-charm/src/continuous_writes.py:

  • Changed the continuous writes logic to have a timeout that ends the current connection and creates a new one to continue the writes (this was needed because sometimes when the PostgreSQL DB process is frozen - in freeze DB process test - the application hasn't closed the connection until the DB process was unfrozen, preventing new writes to the new primary that was elected).

In tests/integration/ha_tests/conftest.py:

  • Added a fixture to change the loop wait setting (that prevents Patroni from restarting PostgreSQL too fast in the tests).
  • Added fixture to change restart values on pebble and reset them after each test that uses it (full cluster restart and full cluster crash tests).

In tests/integration/ha_tests/helpers.py:

  • Added helper functions to update the restart values on pebble and to check whether the OS process is not running.
  • Added check_cluster_is_updated helper function to avoid duplicate code in the tests.
  • Changed pgkill to use -x instead of -f (which enables the test to use SIGTERM instead of SIGINT).
  • Improved the function that sends a signal to the OS processes to retry it when the process doesn't handle the signal correctly.

In tests/integration/ha_tests/manifests/*:

  • Added pebble layers to support the full cluster restart and full cluster crash tests.

In tests/integration/ha_tests/test_self_healing.py:

  • Changed the OS processes to the program names (they work in pkill when using -x)
  • Added kill DB process tests (PostgreSQL process).
  • Removed SIGINT and put SIGTERM back in the restart DB process test.
  • Added full cluster restart and full cluster crash tests (Patroni and PostgreSQL processes).

poetry.lock, pyproject.toml, renovate.jsonandrequirements.txtwere updated to pin the version of thepackaging` package (the latest versions was causing a failure when installing the test dependencies through poetry).

@marceloneppel marceloneppel marked this pull request as ready for review April 16, 2023 14:44
Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptional!

@marceloneppel
Copy link
Member Author

Renamed the boolean helper functions to follow the best practices on f7f52b8, as suggested by Mykola.

@marceloneppel marceloneppel merged commit 09dcfc7 into main Apr 17, 2023
@marceloneppel marceloneppel deleted the ha-shakedown-tests branch April 17, 2023 20:17
BON4 pushed a commit to BON4/postgresql-k8s-operator that referenced this pull request May 20, 2024
…rash tests (canonical#136)

* Enable sync mode

* Remove custom conf

* Add new expected k8s endpoint to helper function

* Fix TLS test

* Add pebble health check

* Fix freeze db test

* Enable restart db test

* Remove commented code

* Fix unit tests

* Add missing tests

* Improve TLS management

* Add health check update

* Remove health check

* Add health check

* Prevent multiple primaries

* Remove unused code

* Check max number written

* Check writes on all instances

* Remove master_start_timeout setting

* Improve logs retrieval

* Add CA chain to trusted certificates

* Revert "Remove master_start_timeout setting"

This reverts commit 84d810d.

* Use the right PG process in the HA tests

* Fix order in test

* Remove unused call to SIGCONT

* Enable restart DB test

* Add initial full cluster crash test

* Add remaining cluster restart and crash tests

* Kill right process

* Create helper function for duplicate code

* Add telemetry step

* Increase replan timeout

* Stabilize tests

* Change tester application connection

* Add timeout to the database operations

* Remove telemetry step

* Add prints

* Add back the last written value file

* Pin packaging version

* Fix PostgreSQL DB process full cluster restart/crash

* Fix helper for freeze DB process test

* Add sleep

* Reduce extended pebble restart delay

* Remove prints

* Remove sleep

* Fix poetry lock

* Separate signal logic

* Improve loop wait setting change

* Change calls order

* Improve description about timeout

* Rename boolean functions according the best practices

---------

Co-authored-by: Dragomir Penev <dragomir.penev@canonical.com>
github-actions bot added a commit to canonical/test-runners-2-is-x64-postgresql-k8s-operator that referenced this pull request Aug 6, 2024
github-actions bot added a commit to canonical/test-runners-2-github-x64-postgresql-k8s-operator that referenced this pull request Aug 7, 2024
github-actions bot added a commit to canonical/test-runners-2-is-arm64-postgresql-k8s-operator that referenced this pull request Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants