-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
redesign the CoreDNS Pod Deployment in kubeadm #1931
Comments
/priority backlog |
@ereslibre this is a duplicate of: please read the whole discussion there.
this is problematic because it would leave the second replica as pending. what should be done instead is to make CoreDNS a DS that lands Pods on all CP nodes. |
This should not happen with |
both replicas will land on the same CP node in the beginning and then then the "join" process has to amend the Deployment? |
Yes, the |
please add this as agenda for next week. i'm really in favor of breaking the users now and deploying it as DS and adding an action required in the release notes (in e.g. 1.18). |
/assign |
@rajansandeep i proposed that we should chat about this in the next office hours on Wednesday. |
@neolit123 Yes, I agree. Would like to discuss what would be the potential direction for this. |
today i played with deploying CoreDNS as a DS with this manifest (based on the existing Deployment manifest). it works fine and targets CP nodes only.
i think it's also time to deprecate kube-dns in 1.18 and apply a 3 cycle deprecation policy to it, unless we switch to an addon installer sooner than that. |
my proposal is to second class the big scale clusters (e.g. 5k nodes). for the common use case the DaemonSet that targets only CP nodes seems fine to me. there were objections that we should not move to a DS, but i'm not seeing any good ideas on how to continue using a Deployment and to fix the issue outlined in this ticket. i think @yastij spoke about something similar to this: yet if i recall correctly, from my tests it didn't work so well. |
My proposal stands:
This ensures we are not messing with the replica number, and we are only taking action if all CoreDNS pods are scheduled on the same node. I can put a PR together with a proof of concept if you want to try it. |
Does it make sense for kubeadm to support an autoscaler? |
I would twist the question to: should the default deployment with kubeadm make it harder to use an autoscaler? Answering to your question we are not "supporting" it, but in my opinion we shouldn't make it deliberately harder to use if folks want to. |
i'm going to test this again, if this works we don't have to make any other changes. |
I'm fine if this is where we think kubeadm bounds are. Note that even with what you propose we will have a period of time in which no CoreDNS pods will be answering internal dns requests if the first control plane goes down. Also, we should use In my opinion, for a proper solution we need to force the rescheduling upon a new node join. But I'm fine if that's not the expectation of the broader group. |
even if we reschedule if both of those Nodes become NotReady there will be no service.
i guess we can experiment with an aggressive timeout for this too - i.e. considering the Pods for coredns are critical. i think we might want to outline the different proposals and approaches in a google doc and get some votes going, otherwise this issue will not see any progress. |
I wrote this patch as an alternative: kubernetes/kubernetes@master...ereslibre:even-dns-deployment-on-join, if you think it's reasonable I can open a PR for further discussion. |
so adding the phase means binding more logic to the coredns deployment, yet the idea is to move the coredns deployment outside of kubeadm long term, which means that we have to deprecate the phase or just keep it "experimental". |
That is correct, or it could even be a hidden phase. |
yes, hidden phase sounds fine. |
i played with this patch:
results:
|
I created this document to follow up on the different alternatives and discuss all of them: https://docs.google.com/document/d/1shXvqFwcqH8hJF-tAYNlEnVn45QZChfTjxKzF_D3d8I/edit |
Sorry for getting into this conversation, but I was just observing the similar behavior when two coredns pods are crammed on single (first) control plane node up until that node is powered off, then pods are migrated after a certain timeout (about 5 minutes I guess), they get more evenly spread after this migration. What I didn't follow is why coredns is treated differently comparing to other kube master services, like apiserver or proxy or etcd? What is the purpose of having 2 pods on single node? Looks like it is the most important service since it has 2 pods by default, while actually it becomes least fault-tolerant one because of the odd placing. Why not just make it the same as other services and be done with it, especially if it is that important. |
@rdxmb the only thing that is preserved in the Deployment i believe is the number of replicas: |
@neolit123 thanks! |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Coredns gets scheduled in workernode after one of control plane node comes down, although it doesnt get ready in the worker node. to fix that it has to be manually cordoned off the worker node which shouldn't be an ideal scenario. it should have ideally not attempted to be schedule in non control plane nodes. |
Scheduling on worker nodes works fine.
Depends. Targetting only Cp nodes was discussed as not ideal either. For
customizing that, you could patch the coredns deployment or skip kubeadm
deploying corends and do that yourself.
|
Does descheduler make sense in this case ? |
i think we should move the coredns deployment to the addon operator eventually: kops is in the process of implementing it. after that we can investigate for kubeadm. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
@neolit123 I'm +1 for closing this, given that the goal should be to defer this to Addons provider; what we should do from the kubeadm PoV instead is to start planning a roadmap to pluggable Addons... |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
we haven't seen recent requests about modifying the coredns deployment as is. long term we want to move this addon to be external and not hardcoded in kubeadm code. |
/cc |
Is this a BUG REPORT or FEATURE REQUEST?
/kind feature
What happened?
When deploying a Kubernetes cluster using
kubeadm
we are creating CoreDNS with a default replica of 2.When performing the
kubeadm init
execution, the CoreDNS deployment will be fully scheduled on the only control plane node available at that time. When joining further control plane nodes or worker nodes both CoreDNS instances will remain on the first control plane node where they were scheduled at the time of creation.What you expected to happen?
This is to gather information about how would we feel about performing the following changes:
preferredDuringSchedulingIgnoredDuringExecution
anti-affinity rule based on node hostname, so the scheduler will favour nodes where there's not a CoreDNS pod running already.This on its own doesn't make any difference (also, note that it's
preferred
and notrequired
: otherwise a single control plane node without workers would never succeed to create the CoreDNS deployment, since requirements would never be met).The next step then would be that
kubeadm
whenkubeadm join
has finished, and the new node is registered, would perform the following checks:CoreDNS
deployment pods from the Kube API. If all pods are running on the same node, perform a kill of one pod, forcing Kubernetes to reschedule (at this point, thepreferredDuringSchedulingIgnoredDuringExecution
mentioned before will get into the game, and the rescheduled pod will be placed on the new node).Maybe this is something we don't want to do, as we would be making
kubeadm
tied to workload specific logic, and with the addon story coming it might not make sense. However, this is something that happens on everykubeadm
based deployment, and that will make CoreDNS fail temporarily if the first control plane node goes away in an HA environment until those pods are rescheduled somewhere else. Just because they were initially scheduled on the only control plane node available.What do you think? cc/ @neolit123 @rosti @fabriziopandini @yastij
google doc proposal
https://docs.google.com/document/d/1shXvqFwcqH8hJF-tAYNlEnVn45QZChfTjxKzF_D3d8I/edit
The text was updated successfully, but these errors were encountered: