Skip to content

Conversation

BON4
Copy link
Contributor

@BON4 BON4 commented Mar 29, 2024

Issue #441

All this cases should be covered:

  • Test Application. VM resources is removed after application removal. Install app, remove app (removal test), check no garbage left.
  • Test Unit. Re-Installation. Remove unit with --force --no-wait, storage should be detached.
  • Test Unit. Garbage ignorance. Charm should deploy in dirty environment with garbage storage.
  • Test Application. Application should deploy in dirty environment with garbage storage from another application.

marceloneppel and others added 30 commits August 29, 2022 13:41
* Change charm database user

* Fix unit tests

* Fix integration test call

* Fix user name in library

* Fix user

* Add default postgres user creation

* Change action name
* Change charm database user

* Fix unit tests

* Fix integration test call

* Fix user name in library

* Fix user

* Add default postgres user creation

* Change action name

* Rework secrets management
* Change charm database user

* Fix unit tests

* Fix integration test call

* Fix user name in library

* Fix user

* Add default postgres user creation

* Change action name

* Rework secrets management

* Add password rotation logic

* Add user to the action parameters

* Add pytest marks

* Fix action description

* Fix method docstring

* Fix pytest mark

* Update src/charm.py

Co-authored-by: Raúl Zamora Martínez <76525382+zmraul@users.noreply.github.com>

Co-authored-by: Raúl Zamora Martínez <76525382+zmraul@users.noreply.github.com>
ops 1.5.2 introduced a fix to the harness' handling for departing unit
tracking for events.  This uncovered a subtle error in the tests here.
remove_relation was triggering the "departing" relation data key to be
inserted into the local charm's data because the local charm had been
added to the relation units list (which is not really how that is
intended to be used).  This was causing the delete_user method to be
called earlier than intended and caused assertion failures in the
on_relation_broken test.  The fix is to not add the local unit to the
relation units.  But then the relation departed test that checks for the
"departing" data key wasn't working.  So for now, we can add the local
unit to the relation in that test only.
* Change charm database user

* Fix unit tests

* Fix integration test call

* Fix user name in library

* Fix user

* Add default postgres user creation

* Change action name

* Rework secrets management

* Add password rotation logic

* Add user to the action parameters

* Add separate environments for different tests

* Add all dependencies

* Add pytest marks

* Fix action description

* Fix method docstring

* Fix pytest mark

* Fix pytest markers

* Register pytest markers
* Change charm database user

* Fix unit tests

* Fix integration test call

* Fix user name in library

* Fix user

* Add default postgres user creation

* Change action name

* Rework secrets management

* Add password rotation logic

* Add user to the action parameters

* Add separate environments for different tests

* Add all dependencies

* Add pytest marks

* Fix action description

* Fix method docstring

* Fix pytest mark

* Fix pytest markers

* Register pytest markers

* Import files

* Update libraries

* Update library

* Fix PostgreSQL library

* Add jsonschema as a binary dependency

* Reorganize dependencies and aff f-string

* Reorganize user definition
* Add TLS implementation

* Delete file

* Update library

* Fix PostgreSQL library

* Add jsonschema as a binary dependency

* Change hostname to unit ip

* Add unit test dependency

* Call certificate update on config change

* Fix docstring

* Change log call
* Add integration test

* Update library

* Fix PostgreSQL library

* Add relation broken test

* Add jsonschema as a binary dependency

* Change hostname to unit ip

* Add unit test dependency

* Add check for encrypted connections

* Fix space and None check
* Add workaround for Patroni REST API TLS

* Improve TLS status retrieval

* Readd checks

* Add additional check
* Add TLS to README

* Add password rotation to README
* Add continuous writes test charm

* Add comment
* Add kill db process test

* Change timeout

* Ignore PyCharm folder and LXD profile file

* Add continuous writes test charm

* Improve workflow files
Avoid a long-running integration test in case of failing gatekeeping tests.
It will slightly increase the complete tests scope runtime
but will save (a lot?) of electricity/money for Canonical as often
new pull requests have some initial typos/issues to be polished.
DPE-781 Run integration tests for passed lint/unit tests only
* Add alternative servers for primary and members retrieval

* Test working

* Test working

* Cleanup the code

* More cleanup

* Small adjustments

* Add unit tests

* Improve comments

* Use down unit

* Improve alternative URL description

* Add additional checks

* Improve returns
* fix early tls deployment by only reloading patroni config if it's already running

* added unit test for reloading patroni

* lint

* removing postgres restart check

* adding series flags to test apps

* adding series flags to test apps

* made series into a list

* Update test_new_relations.py

* updating test to better emulate bundle deploymen

* updated tls lib

* Fix TLS IP address on CSR

* removed failing test for now. This will be fixed before merge

* Fix TLS

* Update TLS lib

* Remove unneeded series

* Add comment

* Add Juju agent version bootstrap constraint

* Add test for restart method

* Update TLS lib

* Add test for update config method

* Update TLS lib

* Improve comment

Co-authored-by: WRFitch <will.fitch@canonical.com>
Co-authored-by: Will Fitch <WRFitch@outlook.com>
* Add restart db process test

* Change restart delay

* Update test code

* Change processes that should be tested

* Re enable tests

* Remove unused code

* Remove unused import

* Remove unused code

* Add docstring

* Remove additional tests

* Rollback patroni version
* Add SST test

* Enable previous tests

* fix early tls deployment by only reloading patroni config if it's already running

* Improve code

* Remove duplicate check

* Remove unused import

* added unit test for reloading patroni

* lint

* removing postgres restart check

* Pin Juju agent version on CI

* adding series flags to test apps

* adding series flags to test apps

* made series into a list

* Update test_new_relations.py

* Add retrying

* updating test to better emulate bundle deploymen

* Remove unused code

* Change processes list

* Add logic for ensuring all units down

* Change delay to only one unit

* Add WAL switch

* Updates related to WAL removal

* Small improvements

* Add comments

* Change the way service is stopped

* Remove slot removal

* Small fixes

* Remove unussed parameter

Co-authored-by: WRFitch <will.fitch@canonical.com>
Co-authored-by: Will Fitch <WRFitch@outlook.com>
* Add restart db process test

* Change restart delay

* Pin Juju agent version on CI

* Update test code

* Change processes that should be tested

* Re enable tests

* Remove unused code

* Remove unused import

* Remove unused code

* Add docstring

* Fix Patroni API calls

* Fix kill db process test error

* Improve comments

* Remove unused code

* Revert change

* Unify tests

* Replace ps aux | grep with pgrep

* Readd changes removed on merge
* Prevent relation extensions

* Integration test
* Pin OS on release workflow

* Change whitelist_externals to allowlist_externals
* Pin OS on release workflow

* Change whitelist_externals to allowlist_externals

* Update PostgreSQL library

* Add integration test for the fix from the library
…cal#32)

* Add pgbackrest

* Add initial backup settings

* Add additional backup settings

* Remove settings

* Rename user

* Add test for TLS being used on pg_rewind connections

* Remove table creation

* Readd write to the database

* Change bootstrap contraints

* Change the way the service is stopped

* Add one more call to service stop

* Change the way the service is stopped

* Change systemd unit

* Increase test timeout

* Remove test code

* Read code

* Readd code

* Change WAL trigger mechanism

* Fix check

* Improve code

* Remove instance promotion

* Readd instance promotion

* Change test retry logic

* Remove debug calls

* Add replica reinitialization

* Change checks order

* Add reinitialize call

* Improve reinitialize call

* Remove unused code

* Pin OS on release workflow

* Change whitelist_externals to allowlist_externals

* Change whitelist_externals to allowlist_externals

* Add API request timeout

* Add unit tests

* Remove log

* Revert timeout

* Extract endpoint from URL to constant
* Unpinning libjuju

* Update pinned flake8 version

* Test tweak

* Pinning juju 2.9.38.1

* Tweaking CSR check
…cal#52)

* More resilient database name check and clearing of extensions blocked
status

* Unpinning libjuju test

* Revert "Unpinning libjuju test"

This reverts commit 7c6ed45.

* Move clearing of blocked status to _on_relation_broken

* Unit tests
* Fix service restart at boot

* Add fix for TLS files and postgres user

* Add integration test

* Fix unit test

* Fix ubuntu version and update charming actions

* Move test

* Remove previous test

* Revert change on unit test

* Minor fixes

* Fix unit tests

* Modify condition

* Update Juju version

* Setup tmate session

* Add Waiting Status

* Add defer call

* Change reboot command

* Add waiting status

* Remove incorrect defer call

* Add replica related changes

* Simplify test

* Simplify code and test

* Add comment about latest juju 2

* Update charming actions

* Make test optional

* Add unit tests

* Add docstring and type hints

* Some fixes

* Minor fixes

* Remove prints

* Fix bundle deployment

* Add additional unit tests
* Fix unit tests

* Update postgresql_tls lib

* Readding test_on_relation_broken_extensions_keep_block

* Code review improvements
* Fix error that is raised when calling change WAL settings

* Remove unused import

* Add reformatted file
* ops and charm libs version bump

* Update flake8
renovate bot and others added 2 commits March 28, 2024 09:16
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
@marceloneppel
Copy link
Member

Hi @BON4! Thanks for the contribution.

Please add a PR description so we can understand what you're proposing.

Thanks.

Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed today, please add link to the GH/Jira issue. We always must link commits to bugtracker.

Please add PR description: What? Why? How?

Please improve the test to check charm and not juju :-)

Tnx!

# Check if storage exists after application deployed
assert await is_storage_exists(ops_test, storage_id_str)

await ops_test.model.remove_application(APPLICATION_NAME, block_until_done=True, force=True, no_wait=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, this will remove storage (as it is removed by default on LXD with hostpath storage type (default)), so the next text will be deployed again without any storage. Consider to add destroy_storage=False (but I am not sure if it help).

# Create test database to check there is no resouces conflicts
await create_db(ops_test, APPLICATION_NAME, TEST_DATABASE_RELATION_NAME)

# Deploy duplicaate charm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on the sync today: we are testing Juju here, but we need to test charm.
Juju guaranties that independent applications will have interdependent storages, no need to test it here, IMHO. But you can keep it as already created as it should be fast.

Our goal is to test charm:

  • charm A can start with storage from A (cleanly). Even if there are some garbage lock files on storage.
  • charm A can start with storage from B (with warning in log, about foreign storage, but deploying anyway as it can be an app recovery process).

P.S. the goal of smoke test is a fast result, so try to keep the test fast (in time).

@BON4 BON4 changed the title Smoke testing. Garbage ignorance. No resources conflicts. Add test_config_parameters test Apr 15, 2024
@BON4 BON4 changed the title Add test_config_parameters test Add test_smoke test Apr 15, 2024
@BON4 BON4 requested a review from taurus-forever April 15, 2024 09:23
Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one comment to polish comments... triggering tests to see the test results.

Comment on lines 138 to 139
# Check that test database is not exists for duplicate application
assert not await check_db(ops_test, APPLICATION_NAME, TEST_DATABASE_RELATION_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says duplicate application but used APPLICATION_NAME and not DUP_APPLICATION_NAME. Is it OK? IMHO, comment is outdated and confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

logger.info(f"Application is not in blocked state. Checking logs...")

# Since application have postgresql db in storage from external application it should not be able to connect due to new password
assert not await check_password_auth(ops_test, ops_test.model.applications[DUP_APPLICATION_NAME].units[0].name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFMI: does charm start with foreign storage/data or just stuck?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This charm stalls forever in Init:CrashLoopBackOff loop with "waiting, installing agent". As I know, this commit will fix this issue, and deployment will be in blocked state instead of falling in to forever CrashLoopBackOff due to database from another application.

@BON4
Copy link
Contributor Author

BON4 commented Apr 17, 2024

@dragomirp @marceloneppel bumping

@dragomirp
Copy link
Contributor

Hi, @BON4, please sign the CLA and sync up with the main branch.


@pytest.mark.group(1)
@pytest.mark.abort_on_fail
async def test_app_force_removal(ops_test: OpsTest, charm: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO there is no need to test this either. With force removal the charm code shouldn't even fire, so this would be handled by Juju entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It makes more sense only to test whether the detached storage left from this test can be reused in a new deployment, as you did in the next tests.

Copy link
Contributor Author

@BON4 BON4 Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taurus-forever suggested to leave test_app_force_removal just for the sake of testing the correctness of storage detaching (charm will be removed with force correctly). Since this is smoke test, it should be fast.

Copy link
Member

@marceloneppel marceloneppel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, @BON4! I requested some changes. I also agree with Dragomir's comments.

storage_id,
"--format=json",
]
print(f"command: {complete_command}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to logger.info or logger.debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is removed.

@pytest.mark.group(1)
@pytest.mark.abort_on_fail
async def test_app_removal(ops_test: OpsTest, charm: str):
"""Test all recoureces is removed after application removal."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Test all recoureces is removed after application removal."""
"""Test all resources are removed after application removal."""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

garbadge_storage = await get_any_deatached_storage(ops_test)
assert garbadge_storage is not None

assert garbadge_storage is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it fail in L164 and stop the test? I think you can use reraise=True in the Retrying object to show the assertion error in the logs. I think the same applies to L127.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Used reraise=True in all Retrying loops.


@pytest.mark.group(1)
@pytest.mark.abort_on_fail
async def test_app_recoures_conflicts(ops_test: OpsTest, charm: str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def test_app_recoures_conflicts(ops_test: OpsTest, charm: str):
async def test_app_resources_conflicts(ops_test: OpsTest, charm: str):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIxed


try:
await ops_test.model.wait_for_idle(
apps=[DUP_APPLICATION_NAME], timeout=500, status="blocked"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what's the message of this blocked status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not yet implemented in operator but this commit will bring changes, so that application will fall in blocked state with "Failed to start postgresql. The storage belongs to a third-party cluster" message. I added check for operator auth down below just to double check and in case this test will be merged before the commit for blocked status.

# Since application have postgresql db in storage from external application it should not be able to connect due to new password
assert not await check_password_auth(
    ops_test, ops_test.model.applications[DUP_APPLICATION_NAME].units[0].name
)


@pytest.mark.group(1)
@pytest.mark.abort_on_fail
async def test_app_force_removal(ops_test: OpsTest, charm: str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It makes more sense only to test whether the detached storage left from this test can be reused in a new deployment, as you did in the next tests.

@BON4
Copy link
Contributor Author

BON4 commented Apr 20, 2024

@dragomirp @marceloneppel @taurus-forever Hello team, bumping for review and re-run checks, after changes applied.

@BON4 BON4 closed this Apr 23, 2024
@BON4 BON4 force-pushed the test-charmed-postgresql-smoke-test branch from 383970a to db46208 Compare April 23, 2024 11:47
@BON4 BON4 mentioned this pull request Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.