[DPE-5637, DPE-5276] Implement expose-external config option with values false (clusterip), nodeport and loadbalancer #328

shayancanonical · 2024-10-10T23:45:18Z

Issue

Solution

…erip), nodeport and loadbalancer

…er type

carlcsaposs-canonical · 2024-10-17T14:37:24Z

src/kubernetes_charm.py

+        # Delete and re-create until https://bugs.launchpad.net/juju/+bug/2084711 resolved
+        if service_exists:
+            logger.info(f"Deleting service {service_type=}")
+            self._lightkube_client.delete(
+                res=lightkube.resources.core_v1.Service,
+                name=self._service_name,
+                namespace=self.model.name,
+            )
+            logger.info(f"Deleted service {service_type=}")
+
+        logger.info(f"Applying service {service_type=}")
+        self._lightkube_client.apply(service, field_manager=self.app.name)
+        logger.info(f"Applied service {service_type=}")


did you find more information about this? I don't think we should need to delete and re-create the service

filed a juju bug: https://bugs.launchpad.net/juju/+bug/2084711, which has been triaged

essentially, we have included deletion + recreation of service as a workaround until we get help from juju to determine what may be happening

it seems like this might be a misconfiguration of metallb and not a juju bug—I don't see how patching a k8s service not created by juju would cause the juju cli to have issues

did you try the multiple ips that @taurus-forever mentioned?

i did try using multiple IPs for metallb. unfortunately, that did not work. additionally, in the bug report, i was able to confirm the issue using microk8s.kubectl without any charm code

did you test with EKS or GKE?

no, i did not yet test with EKS or GKE. i dont believe that testing with these platforms should necessarily be a blocker for this PR

we tested on AKS, and this issue did not manifest itself in AKS

this issue did not manifest itself in AKS

it sounds like it might be a metallb+microk8s issue then? in that case I think we should consider patching the service instead of deleting + re-creating

i agree, but we are unable to run integration tests without deleting and recreating (as tests experience flicker of the juju client and fail with a Bad file descriptor error). @paulomach please share your thoughts when you are able

As @shayancanonical , I validated the flaky behavior independently of a charm, but saw no issue in AKS.
Independently, this does not seems to be an issue in the charm/lightkube.
So let's not block the PR, since this can be refactored once we have better understanding or a fix.

src/kubernetes_charm.py

src/abstract_charm.py

src/kubernetes_charm.py

tests/unit/conftest.py

carlcsaposs-canonical · 2024-10-17T15:08:19Z

.github/workflows/ci.yaml

-    uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@v22.0.0
+    uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@feature/metallb


guessing this is temporary for testing? is there a dpw pr that needs review?

yes, will open a PR in dpw shortly

carlcsaposs-canonical · 2024-10-17T15:12:17Z

src/kubernetes_charm.py

+
+    def _get_node_hosts(self) -> list[str]:
+        """Return the node ports of nodes where units of this app are scheduled."""
+        peer_relation = self.model.get_relation(self._PEER_RELATION_NAME)


how was this done before without the peer relation?

should self._*endpoint be renamed to endpoints?

prior to this PR, we were only providing the host:node_port of one unit in the application. upon discussions, we realized that we would need to provide all host:node_ports where the units are scheduled. we are unable to determine nodes where units are deployed without the peer relation which provides all the available/active units

taurus-forever · 2024-10-18T09:41:45Z

.github/workflows/ci.yaml

@@ -96,7 +96,7 @@ jobs:
      - lint
      - unit-test
      - build
-    uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@v23.0.2
+    uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@feature/metallb


Please revert before merging.

taurus-forever · 2024-10-18T09:51:14Z

src/kubernetes_charm.py

@@ -51,11 +61,18 @@
 class KubernetesRouterCharm(abstract_charm.MySQLRouterCharm):
    """MySQL Router Kubernetes charm"""

+    _PEER_RELATION_NAME = "mysql-router-peers"
+    _SERVICE_PATCH_TIMEOUT = 5 * 60


Sleeping up to 5 minutes might produce more issues.
Do we have other options here? Time for Pebble notices?

…e connectable

paulomach

I like the tests. If feasible it would be nice to have proper config validation, but that can be done later, given the timing we want to achieve.
There are some other non-blocking comments

paulomach · 2024-10-21T09:47:39Z

config.yaml

+  expose-external:
+    description: |
+      String to determine how to expose the MySQLRouter externally from the Kubernetes cluster.
+      Possible values: 'false', 'nodeport', 'loadbalancer'


nit: default false may imply true is valid value. Change to something else? no none

@paulomach any ideas for alternatives? expose-external: none? expose-external: clusterip?

additionally, the possible values are false and nodeport in kafka-k8s. it may not be a good idea to introduce inconsistencies

oh bummer, ok maybe for another time.

Interestingly, it seems it was a Marc nitpick also that Mykola did not responded to in the original kafka spec

paulomach · 2024-10-21T09:58:27Z

src/kubernetes_charm.py

@@ -80,11 +97,136 @@ def _upgrade(self) -> typing.Optional[kubernetes_upgrade.Upgrade]:
        except upgrade.PeerRelationNotReady:
            pass

+    @property
+    def _status(self) -> ops.StatusBase:


Although I understand (single config, router context) why this was done as input validation, I don't like it.
I prefer to do proper config validation as we are doing in the server and other charms.

Not a blocker for now, but it should get changed.

imo, this is "proper config validation"

if we had several config values or more complex validation, I agree it'd be worth adopting the approach in server
but given how simple the validation is I don't think the extra dependency & abstraction is worth it—imo, KISS

yes, hence not blocking it. Once we reach (if ever) more configs (>3) we should do it

src/kubernetes_charm.py

carlcsaposs-canonical · 2024-10-22T14:15:08Z

src/abstract_charm.py

+                            exposed_read_write_endpoint=self._exposed_read_write_endpoint,
+                            exposed_read_only_endpoint=self._exposed_read_only_endpoint,
+                        )
+                    elif self._check_service_connectivity():


Suggested change

elif self._check_service_connectivity():

if self._check_service_connectivity():

nit

carlcsaposs-canonical · 2024-10-22T14:15:42Z

src/abstract_charm.py


        Only applies to Kubernetes charm
        """

+    @abc.abstractmethod
+    def _check_service_connectivity(self) -> bool:
+        """Check if the service is available (connectable with a socket)"""


is this k8s only?

carlcsaposs-canonical · 2024-10-22T14:21:33Z

src/relations/database_provides.py

-    def external_connectivity(self, event) -> bool:
-        """Whether any of the relations are marked as external."""


is this still in use by vm charm? https://github.com/canonical/mysql-router-operator/blob/60ad0549c590d48d77d37f83fe8d105f5a182d4a/src/machine_charm.py#L114

carlcsaposs-canonical · 2024-10-22T14:23:53Z

src/relations/tls.py

-                f"{unit_name}.{self._charm.app.name}",
-                f"{unit_name}.{self._charm.app.name}.{self._charm.model_service_domain}",
+                f"{service_name}.{self._charm.app.name}",
+                f"{service_name}.{self._charm.app.name}.{self._charm.model_service_domain}",


if user tries to connect with juju's service, should that be possible?

also, can you double-check the changes to sans with @delgod if you haven't already?

carlcsaposs-canonical · 2024-10-22T14:25:24Z

tests/unit/scenario_/database_relations/test_database_relations.py

@@ -19,6 +19,12 @@ def model_service_domain(monkeypatch, request):
    monkeypatch.setattr(
        "kubernetes_charm.KubernetesRouterCharm.model_service_domain", request.param
    )
+    monkeypatch.setattr(


question: why is this being patched again here?

does it need to be?

theoctober19th · 2024-11-14T08:14:07Z

Great work @shayancanonical!

I'm about to do a similar feature in kyuubi-k8s and going through this PR I had a few queries. Looking at the spec, it is my understanding that whenever the service needs to be created / deleted ("reconciled", in other words), the config-changed hook would simply return back after kubectl.apply and not wait for the service to be completely created.

When do we check whether the "reconciliation" was successful? In the PR I saw this has been checked on the config-changed hook itself, but does that mean the endpoints would not be updated until the config is changed sometime again in the future? Shouldn't we check for whether reconciliation was successful or not more frequently than that? (possibly in other hooks as well as update-status hook)

shayancanonical · 2024-11-14T12:45:59Z

@theoctober19th usually kubectl.apply (or lightkube.apply) can take a long time (normally, about 5-10mins) to create a service of type loadbalancer on various cloud provides as the cloud provider needs to provision certain resources. as a result, the following is the intended behavior for this PR (the code as it currently exists will likely need to be modified to accomplish this):

upon config-changed we will make a request to K8s to create the service of the desired type if necessary. we will likely set the charm in MaintenanceStatus and return from the hook so other hooks can run
upon a future hook, the reconcile approach of this charm allows us to check if the K8s service created in (1) is reachable. if it is reachable, we will update the endpoints in the relation databag
while the K8s service is being created, we will avoid touching the databag at all. if the K8s service is being created for the first time, the endpoints in the databag will be empty. if not the first time, the endpoints will remain stale until the new K8s service is reachable (we can optimize here by making use of pebble notices for more frequent checks, or wait until the next hook - at most update-status-hook-interval)

theoctober19th · 2024-11-14T13:21:09Z

Thanks for the explanation, @shayancanonical.

We had discussed this in our team, and thanks to @welpaolo we had discussed some additional possibility of having a flag set to false in peer relation databag whenever a service is being created / deleted, which would then trigger peer-relation-changed event, and in that event we check for the availablity of the service and either a) reset the flag, and update the endpoints or b) defer the peer-relation-changed event hook and then basically repeat this process when the deferred hook gets fired later.

This can also be combined with checking the status of service in other event hooks (including update-status hook). This will effectively check service availablity during either peer-relation-changed or other event hooks, whichever occurs early.

shayancanonical added 9 commits October 10, 2024 23:44

WIP: Implement expose-external config option with values false (clust…

5931e15

…erip), nodeport and loadbalancer

Manually tested expose-external config + some code cleanup

7cac8af

Fix some issues + implement integration tests for expose-external

c1459f2

Add missing config file

f471813

Remove redundant action return code check

902509f

Add workaround for juju client errors when patching svc to loadbalanc…

d115742

…er type

Merge branch 'main' into feature/expose-external

681a5c1

Standardize abstract_charm.py with the VM charm

b60e5e3

Update outdated tracing charm libs

0c048db

shayancanonical mentioned this pull request Oct 16, 2024

Standardize abstract_charm.py file with the K8s charm canonical/mysql-router-operator#182

Open

shayancanonical requested review from paulomach, taurus-forever, dragomirp and carlcsaposs-canonical October 16, 2024 20:56

carlcsaposs-canonical reviewed Oct 17, 2024

View reviewed changes

shayancanonical added 2 commits October 17, 2024 20:55

Address PR feedback

ffc728a

Merge branch 'main' into feature/expose-external

d76b325

taurus-forever reviewed Oct 18, 2024

View reviewed changes

shayancanonical added 2 commits October 18, 2024 19:44

Avoid waiting for service and let further events update endpoints onc…

ee7bccb

…e connectable

Appropriately set app status when invalid expose-external value

ab49776

shayancanonical requested review from taurus-forever and carlcsaposs-canonical October 18, 2024 19:55

shayancanonical mentioned this pull request Oct 18, 2024

compatible(integration_test_charm): Add input that would enable metallb for integration tests canonical/data-platform-workflows#244

Open

paulomach approved these changes Oct 21, 2024

View reviewed changes

shayancanonical added 3 commits October 21, 2024 17:16

Address PR feedback + update outdate charm tracing lib

f938351

Add check for None when declaring variable

7254856

Update tests to retry connections to endpoints after it fails

4d33cbb

carlcsaposs-canonical reviewed Oct 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-5637, DPE-5276] Implement expose-external config option with values false (clusterip), nodeport and loadbalancer #328

[DPE-5637, DPE-5276] Implement expose-external config option with values false (clusterip), nodeport and loadbalancer #328

shayancanonical commented Oct 10, 2024

carlcsaposs-canonical Oct 17, 2024

shayancanonical Oct 17, 2024 •

edited

Loading

carlcsaposs-canonical Oct 18, 2024

shayancanonical Oct 18, 2024 •

edited

Loading

carlcsaposs-canonical Oct 21, 2024

shayancanonical Oct 21, 2024

carlcsaposs-canonical Oct 21, 2024

shayancanonical Oct 21, 2024

paulomach Oct 21, 2024

carlcsaposs-canonical Oct 17, 2024

shayancanonical Oct 17, 2024

carlcsaposs-canonical Oct 17, 2024

shayancanonical Oct 17, 2024

taurus-forever Oct 18, 2024

taurus-forever Oct 18, 2024

paulomach left a comment •

edited

Loading

paulomach Oct 21, 2024

shayancanonical Oct 21, 2024

shayancanonical Oct 21, 2024

paulomach Oct 21, 2024

paulomach Oct 21, 2024

paulomach Oct 21, 2024

carlcsaposs-canonical Oct 21, 2024

paulomach Oct 21, 2024

carlcsaposs-canonical Oct 22, 2024

carlcsaposs-canonical Oct 22, 2024

carlcsaposs-canonical Oct 22, 2024

carlcsaposs-canonical Oct 22, 2024

carlcsaposs-canonical Oct 22, 2024

theoctober19th commented Nov 14, 2024

shayancanonical commented Nov 14, 2024

theoctober19th commented Nov 14, 2024 •

edited

Loading

		uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@v22.0.0
		uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@feature/metallb

	elif self._check_service_connectivity():
	if self._check_service_connectivity():

		def external_connectivity(self, event) -> bool:
		"""Whether any of the relations are marked as external."""

[DPE-5637, DPE-5276] Implement expose-external config option with values false (clusterip), nodeport and loadbalancer #328

Are you sure you want to change the base?

[DPE-5637, DPE-5276] Implement expose-external config option with values false (clusterip), nodeport and loadbalancer #328

Conversation

shayancanonical commented Oct 10, 2024

Issue

Solution

Choose a reason for hiding this comment

shayancanonical Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shayancanonical Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulomach left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theoctober19th commented Nov 14, 2024

shayancanonical commented Nov 14, 2024

theoctober19th commented Nov 14, 2024 • edited Loading

shayancanonical Oct 17, 2024 •

edited

Loading

shayancanonical Oct 18, 2024 •

edited

Loading

paulomach left a comment •

edited

Loading

theoctober19th commented Nov 14, 2024 •

edited

Loading