-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-5637, DPE-5276] Implement expose-external config option with values false (clusterip), nodeport and loadbalancer #328
base: main
Are you sure you want to change the base?
Conversation
…erip), nodeport and loadbalancer
src/kubernetes_charm.py
Outdated
# Delete and re-create until https://bugs.launchpad.net/juju/+bug/2084711 resolved | ||
if service_exists: | ||
logger.info(f"Deleting service {service_type=}") | ||
self._lightkube_client.delete( | ||
res=lightkube.resources.core_v1.Service, | ||
name=self._service_name, | ||
namespace=self.model.name, | ||
) | ||
logger.info(f"Deleted service {service_type=}") | ||
|
||
logger.info(f"Applying service {service_type=}") | ||
self._lightkube_client.apply(service, field_manager=self.app.name) | ||
logger.info(f"Applied service {service_type=}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you find more information about this? I don't think we should need to delete and re-create the service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed a juju bug: https://bugs.launchpad.net/juju/+bug/2084711, which has been triaged
essentially, we have included deletion + recreation of service as a workaround until we get help from juju to determine what may be happening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like this might be a misconfiguration of metallb and not a juju bug—I don't see how patching a k8s service not created by juju would cause the juju cli to have issues
did you try the multiple ips that @taurus-forever mentioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did try using multiple IPs for metallb. unfortunately, that did not work. additionally, in the bug report, i was able to confirm the issue using microk8s.kubectl
without any charm code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you test with EKS or GKE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i did not yet test with EKS or GKE. i dont believe that testing with these platforms should necessarily be a blocker for this PR
we tested on AKS, and this issue did not manifest itself in AKS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this issue did not manifest itself in AKS
it sounds like it might be a metallb+microk8s issue then? in that case I think we should consider patching the service instead of deleting + re-creating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree, but we are unable to run integration tests without deleting and recreating (as tests experience flicker of the juju client and fail with a Bad file descriptor
error). @paulomach please share your thoughts when you are able
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @shayancanonical , I validated the flaky behavior independently of a charm, but saw no issue in AKS.
Independently, this does not seems to be an issue in the charm/lightkube.
So let's not block the PR, since this can be refactored once we have better understanding or a fix.
.github/workflows/ci.yaml
Outdated
uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@v22.0.0 | ||
uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@feature/metallb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guessing this is temporary for testing? is there a dpw pr that needs review?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will open a PR in dpw shortly
|
||
def _get_node_hosts(self) -> list[str]: | ||
"""Return the node ports of nodes where units of this app are scheduled.""" | ||
peer_relation = self.model.get_relation(self._PEER_RELATION_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was this done before without the peer relation?
should self._*endpoint be renamed to endpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prior to this PR, we were only providing the host:node_port
of one unit in the application. upon discussions, we realized that we would need to provide all host:node_port
s where the units are scheduled. we are unable to determine nodes where units are deployed without the peer relation which provides all the available/active units
@@ -96,7 +96,7 @@ jobs: | |||
- lint | |||
- unit-test | |||
- build | |||
uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@v23.0.2 | |||
uses: canonical/data-platform-workflows/.github/workflows/integration_test_charm.yaml@feature/metallb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert before merging.
src/kubernetes_charm.py
Outdated
@@ -51,11 +61,18 @@ | |||
class KubernetesRouterCharm(abstract_charm.MySQLRouterCharm): | |||
"""MySQL Router Kubernetes charm""" | |||
|
|||
_PEER_RELATION_NAME = "mysql-router-peers" | |||
_SERVICE_PATCH_TIMEOUT = 5 * 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleeping up to 5 minutes might produce more issues.
Do we have other options here? Time for Pebble notices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the tests. If feasible it would be nice to have proper config validation, but that can be done later, given the timing we want to achieve.
There are some other non-blocking comments
expose-external: | ||
description: | | ||
String to determine how to expose the MySQLRouter externally from the Kubernetes cluster. | ||
Possible values: 'false', 'nodeport', 'loadbalancer' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: default false
may imply true
is valid value. Change to something else? no
none
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulomach any ideas for alternatives? expose-external: none
? expose-external: clusterip
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additionally, the possible values are false
and nodeport
in kafka-k8s. it may not be a good idea to introduce inconsistencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh bummer, ok maybe for another time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, it seems it was a Marc nitpick also that Mykola did not responded to in the original kafka spec
@@ -80,11 +97,136 @@ def _upgrade(self) -> typing.Optional[kubernetes_upgrade.Upgrade]: | |||
except upgrade.PeerRelationNotReady: | |||
pass | |||
|
|||
@property | |||
def _status(self) -> ops.StatusBase: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I understand (single config, router context) why this was done as input validation, I don't like it.
I prefer to do proper config validation as we are doing in the server and other charms.
Not a blocker for now, but it should get changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo, this is "proper config validation"
if we had several config values or more complex validation, I agree it'd be worth adopting the approach in server
but given how simple the validation is I don't think the extra dependency & abstraction is worth it—imo, KISS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, hence not blocking it. Once we reach (if ever) more configs (>3) we should do it
exposed_read_write_endpoint=self._exposed_read_write_endpoint, | ||
exposed_read_only_endpoint=self._exposed_read_only_endpoint, | ||
) | ||
elif self._check_service_connectivity(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif self._check_service_connectivity(): | |
if self._check_service_connectivity(): |
nit
|
||
Only applies to Kubernetes charm | ||
""" | ||
|
||
@abc.abstractmethod | ||
def _check_service_connectivity(self) -> bool: | ||
"""Check if the service is available (connectable with a socket)""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this k8s only?
def external_connectivity(self, event) -> bool: | ||
"""Whether any of the relations are marked as external.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still in use by vm charm? https://github.com/canonical/mysql-router-operator/blob/60ad0549c590d48d77d37f83fe8d105f5a182d4a/src/machine_charm.py#L114
f"{unit_name}.{self._charm.app.name}", | ||
f"{unit_name}.{self._charm.app.name}.{self._charm.model_service_domain}", | ||
f"{service_name}.{self._charm.app.name}", | ||
f"{service_name}.{self._charm.app.name}.{self._charm.model_service_domain}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if user tries to connect with juju's service, should that be possible?
also, can you double-check the changes to sans with @delgod if you haven't already?
@@ -19,6 +19,12 @@ def model_service_domain(monkeypatch, request): | |||
monkeypatch.setattr( | |||
"kubernetes_charm.KubernetesRouterCharm.model_service_domain", request.param | |||
) | |||
monkeypatch.setattr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why is this being patched again here?
does it need to be?
Great work @shayancanonical! I'm about to do a similar feature in When do we check whether the "reconciliation" was successful? In the PR I saw this has been checked on the |
@theoctober19th usually
|
Thanks for the explanation, @shayancanonical. We had discussed this in our team, and thanks to @welpaolo we had discussed some additional possibility of having a flag set to false in peer relation databag whenever a service is being created / deleted, which would then trigger peer-relation-changed event, and in that event we check for the availablity of the service and either a) reset the flag, and update the endpoints or b) defer the peer-relation-changed event hook and then basically repeat this process when the deferred hook gets fired later. This can also be combined with checking the status of service in other event hooks (including update-status hook). This will effectively check service availablity during either peer-relation-changed or other event hooks, whichever occurs early. |
Issue
Solution