Skip to content

Conversation

@marceloneppel
Copy link
Member

@marceloneppel marceloneppel commented Oct 10, 2022

Issue

  • Jira issues: DPE-538
  • The PostgreSQL charm should self heal the workload when freezing the DB process (and later unfreezing it).

Solution

  • Add a test that freezes the DB processes (from Patroni and from PostgreSQL), unfreezing it after some time and later check that the workload recovers itself from that situation.

Context

  • This test can also be run on an existing cluster like in the previous implemented test from Add kill db process test #35.
  • An mechanism to get the other instances REST API URL was created on src/cluster.py to enable the charm to find the primary (that is used to update the relation application databag with the primary and the replicas list) when the unit that had the process frozen is the leader.

Testing

  • The test was added on tests/integration/ha_tests/test_self_healing.py. It tests the charm in a very similar way the Charmed MongoDB Operator is tested.
  • The test freezes both Patroni and PostgreSQL OS process, one at a time.

Release Notes

  • Add freeze db process test.

@codecov
Copy link

codecov bot commented Oct 10, 2022

Codecov Report

Merging #39 (29eae5b) into main (c611ab6) will increase coverage by 2.13%.
The diff coverage is 90.90%.

@@            Coverage Diff             @@
##             main      #39      +/-   ##
==========================================
+ Coverage   60.32%   62.45%   +2.13%     
==========================================
  Files           6        6              
  Lines         809      815       +6     
  Branches      119      122       +3     
==========================================
+ Hits          488      509      +21     
+ Misses        293      276      -17     
- Partials       28       30       +2     
Impacted Files Coverage Δ
src/cluster.py 67.78% <90.90%> (+11.84%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@marceloneppel marceloneppel changed the title Freeze db process test Freeze DB process test Oct 11, 2022
@marceloneppel marceloneppel marked this pull request as ready for review October 11, 2022 15:20
Copy link
Contributor

@shayancanonical shayancanonical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

Copy link
Contributor

@MiaAltieri MiaAltieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow so clean and very well implemented 🤩! Great job Marcelo! Minor comments only

src/cluster.py Outdated
Comment on lines 172 to 175
if member["name"] == member_name:
ip = member["host"]
break
return ip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if member["name"] == member_name:
ip = member["host"]
break
return ip
if member["name"] == member_name:
return member["host"]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated on 29eae5b.

src/cluster.py Outdated
Comment on lines 198 to 199
break
return primary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
break
return primary
return primary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated on 29eae5b.

src/cluster.py Outdated
return primary

def _get_alternative_server_url(self, attempt: AttemptManager) -> str:
"""Get an alternative URL from another member each time."""
Copy link
Contributor

@MiaAltieri MiaAltieri Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is an alternative URL and why is it needed? Sorry I am a PG noob 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Mia. Now I added an improved description. The alternative server URL method is just a way to call the REST API from another cluster member when, for example, the current member DB process is frozen (not able to receive requests). In that case the API is still working in the other members, so the URL from one of them is used. Updated on 2108c4a.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fascinating! Great descriptions thanks Marcelo! :)

await send_signal_to_process(ops_test, primary_name, process, "SIGCONT")

# Verify that the database service got restarted and is ready in the old primary.
assert await postgresql_ready(ops_test, primary_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add checks that verify:

  • that the old primary is now the secondary
  • all units are in the same replica set

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion. Added both on 6a1e39b. I also added those checks on the kill DB process test.

Copy link
Contributor

@MiaAltieri MiaAltieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really awesome work here Marcelo!

else:
raise MemberNotUpdatedOnClusterError()
except RetryError:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great function 🤩

@marceloneppel marceloneppel merged commit 4994516 into main Oct 12, 2022
@marceloneppel marceloneppel deleted the freeze-db-process-test branch October 12, 2022 10:55
BON4 pushed a commit to BON4/postgresql-operator that referenced this pull request Apr 23, 2024
* Add alternative servers for primary and members retrieval

* Test working

* Test working

* Cleanup the code

* More cleanup

* Small adjustments

* Add unit tests

* Improve comments

* Use down unit

* Improve alternative URL description

* Add additional checks

* Improve returns
github-actions bot added a commit to canonical/test-runners-2-github-x64-postgresql-operator that referenced this pull request May 21, 2024
github-actions bot added a commit to canonical/test-runners-2-is-x64-postgresql-operator that referenced this pull request May 22, 2024
github-actions bot added a commit to canonical/test-runners-2-azure-arm64-postgresql-operator that referenced this pull request May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants