Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGDSTRM-10325: Delete Strimzi cluster operator leadership lease on deployment deletion #851

Conversation

robobario
Copy link
Contributor

@robobario robobario commented Jan 17, 2023

Why:
As part of enabling StrimziPodSets in the cluster operator we are also increasing availability of the operator by running 2 replicas and using a k8s lease to elect one of them as a leader, the other will be in standby in case it needs to take over reconciliation. The lease is not able to be automatically cleaned up when we tear down a cluster operator deployment, so we are making the Fleetshard Operator responsible for deleting the lease in reaction to a deployment deletion.

This is the upstream issue in the fabric8 client fabric8io/kubernetes-client#4638, once this is resolved we can remove this work from Fleetshard and configure an owner reference in the Strimzi cluster operator

Note that the lease is named ${deploymentName}-leadership-token by convention to enable fleetshard to clean it.

https://issues.redhat.com/browse/MGDSTRM-10325

@github-actions github-actions bot added the operator changes related to operator label Jan 17, 2023
@robobario robobario force-pushed the clean-up-cluster-operator-leadership-lease branch from 118acbc to 4e5eab2 Compare January 17, 2023 02:52
@@ -58,10 +66,11 @@ public class StrimziManagerTest {

@BeforeEach
public void beforeEach() {
// before each test clean Kubernetes server (no Deployments from other runs)
this.server.before();
try (NamespacedKubernetesClient client = this.server.getClient().inAnyNamespace()) {
Copy link
Contributor Author

@robobario robobario Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change the test was failing because the StrimziManagerTest instance and the StrimziManager instance had clients pointing at different mock kubernetes servers on different ports. The test created a lease in one mock server and the implementation tried to delete it from another which was a no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds like we've got really fragile test code. @shawkins are you aware of this problem? Is there a nicer way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to call before. Until we get to a later version of fabric8, you mostly need to clean the state of the mock server yourself in-between tests. Due to a quarkus / okhttp issue we cannot scope the mock server to test classes - it is scoped to the whole module. In fabric8 6.x there is a KuberentesMockServer reset method that should make things easy regardless.

assertNotNull(actual);
List<StrimziVersionStatus> strimziVersions = this.strimziManager.getStrimziVersions();
assertTrue(checkStrimziVersion(strimziVersions, "strimzi-cluster-operator.v1", true));
strimziManager.deleteLeadershipLeaseIfPresent(deployment);
Copy link
Contributor Author

@robobario robobario Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure how to invoke the ResourceEventHandler which the test passes into a MockResourceInformerFactory, maybe with some re-working we could keep a reference somewhere in the mock. Maybe a future improvement to make them testable. For now we are calling the delete method directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went and had a shot at this, keeping references to all handlers created per type so that the test can access and call them directly. Maybe it would be better if StrimziManager implemented ResourceEventHandler<Deployment> and we didn't us an anonymous inner class, then we could call the methods directly on StrimziManager.

@robobario robobario force-pushed the clean-up-cluster-operator-leadership-lease branch 2 times, most recently from 798d011 to 81e41db Compare January 17, 2023 05:31
Why:
As part of enabling StrimziPodSets in the cluster operator we are also increasing
availability of the operator by running 2 replicas and using a k8s lease to elect
one of them as a leader, the other will be in standby in case it needs to take over
reconciliation. The lease is not able to be automatically cleaned up when we tear
down a cluster operator deployment, so we are making the Fleetshard Operator responsible
for deleting the lease in reaction to a deployment deletion.

Note that the lease is named ${deploymentName}-leadership-token by convention to
enable fleetshard to clean it.

To get the test working I noticed that the StrimziManager and StrimziManagerTest were
communicating with different instances of the mock server on different ports. The
StrimziManager is wired with a client before the `@BeforeEach` is invoked. The beforeEach
triggered a new mock server to be created and used in the tests.
@robobario robobario force-pushed the clean-up-cluster-operator-leadership-lease branch from 81e41db to 35c1271 Compare January 17, 2023 05:43
Copy link
Contributor

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Only note is it looks like the delete DSL in the fabric8 client has a different return type in the latest version, so when kas-fleetshard upgrades their fabric8 client the code will likely need a slight change, but for now it looks good to me.

@robobario robobario changed the title Delete Strimzi cluster operator leadership lease on deployment deletion MGDSTRM-10325: Delete Strimzi cluster operator leadership lease on deployment deletion Jan 17, 2023
@k-wall k-wall requested a review from shawkins January 18, 2023 11:08
@k-wall
Copy link
Contributor

k-wall commented Jan 18, 2023

@robobario the approach looks good to me.

@k-wall k-wall added this to the 0.33.0 milestone Jan 18, 2023
@shawkins
Copy link
Contributor

Only note is it looks like the delete DSL in the fabric8 client has a different return type in the latest version, so when kas-fleetshard upgrades their fabric8 client the code will likely need a slight change, but for now it looks good to me.

Yes - the boolean return value isn't generally useful and confusing for some of the multi-delete scenarios. Later versions return a list of items marked for deletion and the ability to await all of those deletions to be processed.

Copy link
Contributor

@k-wall k-wall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

The MockResourceInformerFactory keeps references to all handlers added per
type so tests can retrieve them and call handler methods directly.
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

83.3% 83.3% Coverage
0.0% 0.0% Duplication

@robobario robobario merged commit 2bdf80d into bf2fc6cc711aee1a0c2a:main Jan 26, 2023
@robobario robobario deleted the clean-up-cluster-operator-leadership-lease branch January 26, 2023 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operator changes related to operator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants