Skip to content

Conversation

marceloneppel
Copy link
Member

@marceloneppel marceloneppel commented Oct 18, 2022

Issue

  • Jira issues: DPE-546
  • The PostgreSQL charm should self heal the workload when restarting the DB process without data or transaction logs (SST test).

Solution

  • Add a test that deletes the PostgreSQL data directory files (including transaction logs), restarts the DB processe and later check that the workload recovers itself from that situation.

Context

  • This test can also be run on an existing cluster.

Testing

  • The test was added on tests/integration/ha_tests/test_self_healing.py.

Release Notes

  • Add SST test.

@codecov
Copy link

codecov bot commented Oct 18, 2022

Codecov Report

Merging #43 (3497cf3) into main (855dab5) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main      #43   +/-   ##
=======================================
  Coverage   64.05%   64.05%           
=======================================
  Files           6        6           
  Lines         818      818           
  Branches      121      121           
=======================================
  Hits          524      524           
  Misses        264      264           
  Partials       30       30           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

# Copy data dir content removal script.
await ops_test.juju(
"scp", "tests/integration/ha_tests/clean-data-dir.sh", f"{primary_name}:/tmp"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe that for this test, we need to remove the data directory while the postgress process is down (we need to extend the systemd restart timeout like mongodb here)

an excerpt of a message from mykola:

for SST test (in MySQL):
1) stop pebble/systemd on one member, remove all files in /var/lib/mysql [data directory] (simulate HDD failure)
2) write data to new primary
3) rotate binlog and remove rotated binlog ON ALL alive members. literally remove data written on step 2 (with such we simulate looong period of downtime)
4) run mysql on member from step 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional details, Shayan! I'll update the PR to have a similar approach on PostgreSQL.

Copy link
Member Author

@marceloneppel marceloneppel Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @shayancanonical! I updated the code to have the right steps that simulate the needed scenario.

I haven't changed the systemd service restart timeout as after stopping the service it was not being restarted until I request to start it again.

I also added a check to ensure the WAL files (the equivalent to MysQL binlog) are correctly rotated (a new one is created - in fact, more than one new WAL file is kept due to some settings that enabled the old ones to be removed).

Copy link
Contributor

@shayancanonical shayancanonical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@marceloneppel marceloneppel merged commit d94460d into main Dec 1, 2022
@marceloneppel marceloneppel deleted the sst-test branch December 1, 2022 09:20
BON4 pushed a commit to BON4/postgresql-operator that referenced this pull request Apr 23, 2024
* Add SST test

* Enable previous tests

* fix early tls deployment by only reloading patroni config if it's already running

* Improve code

* Remove duplicate check

* Remove unused import

* added unit test for reloading patroni

* lint

* removing postgres restart check

* Pin Juju agent version on CI

* adding series flags to test apps

* adding series flags to test apps

* made series into a list

* Update test_new_relations.py

* Add retrying

* updating test to better emulate bundle deploymen

* Remove unused code

* Change processes list

* Add logic for ensuring all units down

* Change delay to only one unit

* Add WAL switch

* Updates related to WAL removal

* Small improvements

* Add comments

* Change the way service is stopped

* Remove slot removal

* Small fixes

* Remove unussed parameter

Co-authored-by: WRFitch <will.fitch@canonical.com>
Co-authored-by: Will Fitch <WRFitch@outlook.com>
github-actions bot added a commit to canonical/test-runners-2-github-x64-postgresql-operator that referenced this pull request May 22, 2024
github-actions bot added a commit to canonical/test-runners-2-azure-arm64-postgresql-operator that referenced this pull request May 23, 2024
github-actions bot added a commit to canonical/test-runners-2-is-x64-postgresql-operator that referenced this pull request May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants