-
Notifications
You must be signed in to change notification settings - Fork 27
SST test #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SST test #43
Conversation
Codecov Report
@@ Coverage Diff @@
## main #43 +/- ##
=======================================
Coverage 64.05% 64.05%
=======================================
Files 6 6
Lines 818 818
Branches 121 121
=======================================
Hits 524 524
Misses 264 264
Partials 30 30 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
…al/postgresql-operator into fix-early-tls-deployment
da3656a
to
6772702
Compare
# Copy data dir content removal script. | ||
await ops_test.juju( | ||
"scp", "tests/integration/ha_tests/clean-data-dir.sh", f"{primary_name}:/tmp" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe that for this test, we need to remove the data directory while the postgress process is down (we need to extend the systemd restart timeout like mongodb here)
an excerpt of a message from mykola:
for SST test (in MySQL):
1) stop pebble/systemd on one member, remove all files in /var/lib/mysql [data directory] (simulate HDD failure)
2) write data to new primary
3) rotate binlog and remove rotated binlog ON ALL alive members. literally remove data written on step 2 (with such we simulate looong period of downtime)
4) run mysql on member from step 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional details, Shayan! I'll update the PR to have a similar approach on PostgreSQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @shayancanonical! I updated the code to have the right steps that simulate the needed scenario.
I haven't changed the systemd service restart timeout as after stopping the service it was not being restarted until I request to start it again.
I also added a check to ensure the WAL files (the equivalent to MysQL binlog) are correctly rotated (a new one is created - in fact, more than one new WAL file is kept due to some settings that enabled the old ones to be removed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
* Add SST test * Enable previous tests * fix early tls deployment by only reloading patroni config if it's already running * Improve code * Remove duplicate check * Remove unused import * added unit test for reloading patroni * lint * removing postgres restart check * Pin Juju agent version on CI * adding series flags to test apps * adding series flags to test apps * made series into a list * Update test_new_relations.py * Add retrying * updating test to better emulate bundle deploymen * Remove unused code * Change processes list * Add logic for ensuring all units down * Change delay to only one unit * Add WAL switch * Updates related to WAL removal * Small improvements * Add comments * Change the way service is stopped * Remove slot removal * Small fixes * Remove unussed parameter Co-authored-by: WRFitch <will.fitch@canonical.com> Co-authored-by: Will Fitch <WRFitch@outlook.com>
Issue
Solution
Context
Testing
tests/integration/ha_tests/test_self_healing.py
.Release Notes