Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-5637, DPE-5276] Implement expose-external config option with values false (clusterip), nodeport and loadbalancer #328

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

shayancanonical
Copy link
Contributor

Issue

Solution

Comment on lines 166 to 178
# Delete and re-create until https://bugs.launchpad.net/juju/+bug/2084711 resolved
if service_exists:
logger.info(f"Deleting service {service_type=}")
self._lightkube_client.delete(
res=lightkube.resources.core_v1.Service,
name=self._service_name,
namespace=self.model.name,
)
logger.info(f"Deleted service {service_type=}")

logger.info(f"Applying service {service_type=}")
self._lightkube_client.apply(service, field_manager=self.app.name)
logger.info(f"Applied service {service_type=}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you find more information about this? I don't think we should need to delete and re-create the service

Copy link
Contributor Author

@shayancanonical shayancanonical Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filed a juju bug: https://bugs.launchpad.net/juju/+bug/2084711, which has been triaged

essentially, we have included deletion + recreation of service as a workaround until we get help from juju to determine what may be happening

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like this might be a misconfiguration of metallb and not a juju bug—I don't see how patching a k8s service not created by juju would cause the juju cli to have issues

did you try the multiple ips that @taurus-forever mentioned?

Copy link
Contributor Author

@shayancanonical shayancanonical Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did try using multiple IPs for metallb. unfortunately, that did not work. additionally, in the bug report, i was able to confirm the issue using microk8s.kubectl without any charm code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you test with EKS or GKE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i did not yet test with EKS or GKE. i dont believe that testing with these platforms should necessarily be a blocker for this PR

we tested on AKS, and this issue did not manifest itself in AKS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this issue did not manifest itself in AKS

it sounds like it might be a metallb+microk8s issue then? in that case I think we should consider patching the service instead of deleting + re-creating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree, but we are unable to run integration tests without deleting and recreating (as tests experience flicker of the juju client and fail with a Bad file descriptor error). @paulomach please share your thoughts when you are able

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @shayancanonical , I validated the flaky behavior independently of a charm, but saw no issue in AKS.
Independently, this does not seems to be an issue in the charm/lightkube.
So let's not block the PR, since this can be refactored once we have better understanding or a fix.

src/kubernetes_charm.py Outdated Show resolved Hide resolved
src/kubernetes_charm.py Outdated Show resolved Hide resolved
src/abstract_charm.py Outdated Show resolved Hide resolved
src/abstract_charm.py Show resolved Hide resolved
src/kubernetes_charm.py Outdated Show resolved Hide resolved
tests/unit/conftest.py Outdated Show resolved Hide resolved
tests/unit/conftest.py Outdated Show resolved Hide resolved
uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@v22.0.0
uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@feature/metallb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guessing this is temporary for testing? is there a dpw pr that needs review?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will open a PR in dpw shortly


def _get_node_hosts(self) -> list[str]:
"""Return the node ports of nodes where units of this app are scheduled."""
peer_relation = self.model.get_relation(self._PEER_RELATION_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was this done before without the peer relation?

should self._*endpoint be renamed to endpoints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prior to this PR, we were only providing the host:node_port of one unit in the application. upon discussions, we realized that we would need to provide all host:node_ports where the units are scheduled. we are unable to determine nodes where units are deployed without the peer relation which provides all the available/active units

@@ -96,7 +96,7 @@ jobs:
- lint
- unit-test
- build
uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@v23.0.2
uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@feature/metallb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert before merging.

@@ -51,11 +61,18 @@
class KubernetesRouterCharm(abstract_charm.MySQLRouterCharm):
"""MySQL Router Kubernetes charm"""

_PEER_RELATION_NAME = "mysql-router-peers"
_SERVICE_PATCH_TIMEOUT = 5 * 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleeping up to 5 minutes might produce more issues.
Do we have other options here? Time for Pebble notices?

Copy link
Collaborator

@paulomach paulomach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the tests. If feasible it would be nice to have proper config validation, but that can be done later, given the timing we want to achieve.
There are some other non-blocking comments

expose-external:
description: |
String to determine how to expose the MySQLRouter externally from the Kubernetes cluster.
Possible values: 'false', 'nodeport', 'loadbalancer'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: default false may imply true is valid value. Change to something else? no none

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulomach any ideas for alternatives? expose-external: none? expose-external: clusterip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additionally, the possible values are false and nodeport in kafka-k8s. it may not be a good idea to introduce inconsistencies

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh bummer, ok maybe for another time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, it seems it was a Marc nitpick also that Mykola did not responded to in the original kafka spec

@@ -80,11 +97,136 @@ def _upgrade(self) -> typing.Optional[kubernetes_upgrade.Upgrade]:
except upgrade.PeerRelationNotReady:
pass

@property
def _status(self) -> ops.StatusBase:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I understand (single config, router context) why this was done as input validation, I don't like it.
I prefer to do proper config validation as we are doing in the server and other charms.

Not a blocker for now, but it should get changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, this is "proper config validation"

if we had several config values or more complex validation, I agree it'd be worth adopting the approach in server
but given how simple the validation is I don't think the extra dependency & abstraction is worth it—imo, KISS

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, hence not blocking it. Once we reach (if ever) more configs (>3) we should do it

src/kubernetes_charm.py Outdated Show resolved Hide resolved
src/kubernetes_charm.py Outdated Show resolved Hide resolved
exposed_read_write_endpoint=self._exposed_read_write_endpoint,
exposed_read_only_endpoint=self._exposed_read_only_endpoint,
)
elif self._check_service_connectivity():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif self._check_service_connectivity():
if self._check_service_connectivity():

nit


Only applies to Kubernetes charm
"""

@abc.abstractmethod
def _check_service_connectivity(self) -> bool:
"""Check if the service is available (connectable with a socket)"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this k8s only?

Comment on lines -185 to -186
def external_connectivity(self, event) -> bool:
"""Whether any of the relations are marked as external."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f"{unit_name}.{self._charm.app.name}",
f"{unit_name}.{self._charm.app.name}.{self._charm.model_service_domain}",
f"{service_name}.{self._charm.app.name}",
f"{service_name}.{self._charm.app.name}.{self._charm.model_service_domain}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if user tries to connect with juju's service, should that be possible?

also, can you double-check the changes to sans with @delgod if you haven't already?

@@ -19,6 +19,12 @@ def model_service_domain(monkeypatch, request):
monkeypatch.setattr(
"kubernetes_charm.KubernetesRouterCharm.model_service_domain", request.param
)
monkeypatch.setattr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why is this being patched again here?

does it need to be?

@theoctober19th
Copy link
Member

Great work @shayancanonical!

I'm about to do a similar feature in kyuubi-k8s and going through this PR I had a few queries. Looking at the spec, it is my understanding that whenever the service needs to be created / deleted ("reconciled", in other words), the config-changed hook would simply return back after kubectl.apply and not wait for the service to be completely created.

When do we check whether the "reconciliation" was successful? In the PR I saw this has been checked on the config-changed hook itself, but does that mean the endpoints would not be updated until the config is changed sometime again in the future? Shouldn't we check for whether reconciliation was successful or not more frequently than that? (possibly in other hooks as well as update-status hook)

@shayancanonical
Copy link
Contributor Author

@theoctober19th usually kubectl.apply (or lightkube.apply) can take a long time (normally, about 5-10mins) to create a service of type loadbalancer on various cloud provides as the cloud provider needs to provision certain resources. as a result, the following is the intended behavior for this PR (the code as it currently exists will likely need to be modified to accomplish this):

  1. upon config-changed we will make a request to K8s to create the service of the desired type if necessary. we will likely set the charm in MaintenanceStatus and return from the hook so other hooks can run
  2. upon a future hook, the reconcile approach of this charm allows us to check if the K8s service created in (1) is reachable. if it is reachable, we will update the endpoints in the relation databag
  3. while the K8s service is being created, we will avoid touching the databag at all. if the K8s service is being created for the first time, the endpoints in the databag will be empty. if not the first time, the endpoints will remain stale until the new K8s service is reachable (we can optimize here by making use of pebble notices for more frequent checks, or wait until the next hook - at most update-status-hook-interval)

@theoctober19th
Copy link
Member

theoctober19th commented Nov 14, 2024

Thanks for the explanation, @shayancanonical.

We had discussed this in our team, and thanks to @welpaolo we had discussed some additional possibility of having a flag set to false in peer relation databag whenever a service is being created / deleted, which would then trigger peer-relation-changed event, and in that event we check for the availablity of the service and either a) reset the flag, and update the endpoints or b) defer the peer-relation-changed event hook and then basically repeat this process when the deferred hook gets fired later.

This can also be combined with checking the status of service in other event hooks (including update-status hook). This will effectively check service availablity during either peer-relation-changed or other event hooks, whichever occurs early.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants