Skip to content

Conversation

dragomirp
Copy link
Contributor

@dragomirp dragomirp commented Jul 25, 2024

Mostly a port of canonical/postgresql-operator#534

  • Pauses patroni while trying to trigger rewind
  • Split the TLS test so that the rewind testing will also be executed on arm64

Copy link

codecov bot commented Jul 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.77%. Comparing base (3e48255) to head (691133e).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #588      +/-   ##
==========================================
+ Coverage   68.60%   68.77%   +0.17%     
==========================================
  Files          10       10              
  Lines        2978     2985       +7     
  Branches      564      565       +1     
==========================================
+ Hits         2043     2053      +10     
+ Misses        816      814       -2     
+ Partials      119      118       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dragomirp dragomirp marked this pull request as ready for review July 26, 2024 12:02
Copy link

@lucasgameiroborges lucasgameiroborges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Member

@marceloneppel marceloneppel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

Comment on lines +197 to +200

@pytest.mark.group(1)
async def test_remove_tls(ops_test: OpsTest) -> None:
async with ops_test.fast_forward():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting off TLS removal as well.

Comment on lines 188 to 205
await ops_test.model.wait_for_idle(apps=[DATABASE_APP_NAME], status="active", timeout=1000)
await ops_test.model.wait_for_idle(apps=[DATABASE_APP_NAME], status="active", timeout=1500)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeouted 3 times in a row for Juju 2

Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tnx!

@dragomirp dragomirp marked this pull request as draft August 1, 2024 10:02
Comment on lines +1333 to +1336
try:
container.restart(self._postgresql_service)
except ChangeError:
logger.exception("Failed to restart patroni")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a dangling Postgresql process, Patroni will be unable to start

Comment on lines +155 to +161
try:
await run_command_on_unit(ops_test, primary, "/charm/bin/pebble stop postgresql")
except Exception as e:
# pebble stop on juju 2 errors out and leaves dangling PG processes
if juju_major_version > 2:
raise e
await run_command_on_unit(ops_test, primary, "pkill --signal SIGTERM -x postgres")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Juju 2 if the pebble stop command fails, the Patroni process is terminated, but a dangling Postgresql process remains and prevents Patroni from starting again.

Alternatively, we can do this check and termination in the charm code, but I would prefer not to add Juju 2 specific live code. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 2.9 if patroni stop failed: kill -SIGTERM $postgresql_pid ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 2.9 if patroni stop failed: kill -SIGTERM $postgresql_pid ?

Pretty much. We can do this in the ChangeError in the restart check. Question is do we want to add that code for Juju 2's sake?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer that we keep this SIGTERM call in the test code and not in the charm code. Even if it is a SIGTERM, I'd prefer either to wait for the database process to gracefully finish by itself or analyze the situation manually before doing any action to avoid any data corruption.

@dragomirp dragomirp marked this pull request as ready for review August 2, 2024 11:21
@dragomirp
Copy link
Contributor Author

Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one educational question.

primary_name = await get_primary(ops_test, app)
unit_ip = await get_unit_address(ops_test, primary_name)
configuration_info = requests.get(f"http://{unit_ip}:8008/config")
configuration_info = requests.get(f"{schema}://{unit_ip}:8008/config", verify=not tls)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please share the reason for verify=not tls?

Comment on lines +155 to +161
try:
await run_command_on_unit(ops_test, primary, "/charm/bin/pebble stop postgresql")
except Exception as e:
# pebble stop on juju 2 errors out and leaves dangling PG processes
if juju_major_version > 2:
raise e
await run_command_on_unit(ops_test, primary, "pkill --signal SIGTERM -x postgres")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 2.9 if patroni stop failed: kill -SIGTERM $postgresql_pid ?

Copy link

@lucasgameiroborges lucasgameiroborges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the changes, but why so many failures on install hook on CI? :(

@dragomirp
Copy link
Contributor Author

LGTM for the changes, but why so many failures on install hook on CI? :(

Test app is borked due to charmcraft 3 release.

Comment on lines +155 to +161
try:
await run_command_on_unit(ops_test, primary, "/charm/bin/pebble stop postgresql")
except Exception as e:
# pebble stop on juju 2 errors out and leaves dangling PG processes
if juju_major_version > 2:
raise e
await run_command_on_unit(ops_test, primary, "pkill --signal SIGTERM -x postgres")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer that we keep this SIGTERM call in the test code and not in the charm code. Even if it is a SIGTERM, I'd prefer either to wait for the database process to gracefully finish by itself or analyze the situation manually before doing any action to avoid any data corruption.

@dragomirp dragomirp merged commit 978d0ec into main Aug 2, 2024
@dragomirp dragomirp deleted the dpe-4533-flaky-test branch August 2, 2024 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants