redesign the CoreDNS Pod Deployment in kubeadm #1931

ereslibre · 2019-11-22T16:33:04Z

Is this a BUG REPORT or FEATURE REQUEST?

/kind feature

What happened?

When deploying a Kubernetes cluster using kubeadm we are creating CoreDNS with a default replica of 2.

When performing the kubeadm init execution, the CoreDNS deployment will be fully scheduled on the only control plane node available at that time. When joining further control plane nodes or worker nodes both CoreDNS instances will remain on the first control plane node where they were scheduled at the time of creation.

What you expected to happen?

This is to gather information about how would we feel about performing the following changes:

Introduce a preferredDuringSchedulingIgnoredDuringExecution anti-affinity rule based on node hostname, so the scheduler will favour nodes where there's not a CoreDNS pod running already.

This on its own doesn't make any difference (also, note that it's preferred and not required: otherwise a single control plane node without workers would never succeed to create the CoreDNS deployment, since requirements would never be met).

The next step then would be that kubeadm when kubeadm join has finished, and the new node is registered, would perform the following checks:

Read the CoreDNS deployment pods from the Kube API. If all pods are running on the same node, perform a kill of one pod, forcing Kubernetes to reschedule (at this point, the preferredDuringSchedulingIgnoredDuringExecution mentioned before will get into the game, and the rescheduled pod will be placed on the new node).

Maybe this is something we don't want to do, as we would be making kubeadm tied to workload specific logic, and with the addon story coming it might not make sense. However, this is something that happens on every kubeadm based deployment, and that will make CoreDNS fail temporarily if the first control plane node goes away in an HA environment until those pods are rescheduled somewhere else. Just because they were initially scheduled on the only control plane node available.

What do you think? cc/ @neolit123 @rosti @fabriziopandini @yastij

google doc proposal
https://docs.google.com/document/d/1shXvqFwcqH8hJF-tAYNlEnVn45QZChfTjxKzF_D3d8I/edit

The text was updated successfully, but these errors were encountered:

ereslibre · 2019-11-22T16:33:17Z

/priority backlog

neolit123 · 2019-11-22T16:37:53Z

@ereslibre this is a duplicate of:
#1657

please read the whole discussion there.

preferredDuringSchedulingIgnoredDuringExecution

this is problematic because it would leave the second replica as pending.
if someone is running the e2e test suite on a single CP node cluster it will fail.

what should be done instead is to make CoreDNS a DS that lands Pods on all CP nodes.
but this will break all users that currently patch the CoreDNS deployment.

ereslibre · 2019-11-22T16:39:18Z

this is problematic because it would leave the second replica as pending.

This should not happen with preferredDuringSchedulingIgnoredDuringExecution, but with requiredDuringSchedulingIgnoredDuringExecution. I can check though, but I'm fairly sure with preferred it will just work even with a single node.

neolit123 · 2019-11-22T16:40:46Z

I can check though, but I'm fairly sure with preferred it will just work even with a single node.

both replicas will land on the same CP node in the beginning and then then the "join" process has to amend the Deployment?

ereslibre · 2019-11-22T16:41:22Z

Yes, the join process would "kubectl delete" one of the pods through the API, I specify the details on the body of the issue.

neolit123 · 2019-11-22T16:42:59Z

please add this as agenda for next week.
the future coredns operator (as kubeadm will no longer include coredns in the future, by default) will run it as a DaemonSet.

i'm really in favor of breaking the users now and deploying it as DS and adding an action required in the release notes (in e.g. 1.18).

rajansandeep · 2019-11-22T22:09:49Z

/assign

neolit123 · 2019-11-22T22:41:38Z

@rajansandeep i proposed that we should chat about this in the next office hours on Wednesday.
potentially decide if we want to proceed with changes for 1.18.

rajansandeep · 2019-11-22T23:13:22Z

@neolit123 Yes, I agree. Would like to discuss what would be the potential direction for this.

neolit123 · 2019-11-24T01:36:22Z

today i played with deploying CoreDNS as a DS with this manifest (based on the existing Deployment manifest). it works fine and targets CP nodes only.

apiVersion: apps/v1
  kind: DaemonSet
  metadata:
    name: {{ .DeploymentName }}
    namespace: kube-system
    labels:
      k8s-app: kube-dns
  spec:
    selector:
      matchLabels:
        k8s-app: kube-dns
    template:
      metadata:
        labels:
          k8s-app: kube-dns
      spec:
        priorityClassName: system-cluster-critical
        serviceAccountName: coredns
        tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - key: {{ .ControlPlaneTaintKey }}
          effect: NoSchedule
        nodeSelector:
          beta.kubernetes.io/os: linux
          {{ .ControlPlaneTaintKey }}: ""
        containers:
        - name: coredns
          image: {{ .Image }}
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              memory: 170Mi
            requests:
              cpu: 100m
              memory: 70Mi
          args: [ "-conf", "/etc/coredns/Corefile" ]
          volumeMounts:
          - name: config-volume
            mountPath: /etc/coredns
            readOnly: true
          ports:
          - containerPort: 53
            name: dns
            protocol: UDP
          - containerPort: 53
            name: dns-tcp
            protocol: TCP
          - containerPort: 9153
            name: metrics
            protocol: TCP
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 60
            timeoutSeconds: 5
            successThreshold: 1
            failureThreshold: 5
          readinessProbe:
            httpGet:
              path: /ready
              port: 8181
              scheme: HTTP
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              add:
              - NET_BIND_SERVICE
              drop:
              - all
            readOnlyRootFilesystem: true
        dnsPolicy: Default
        volumes:
          - name: config-volume
            configMap:
              name: coredns
              items:
              - key: Corefile
                path: Corefile

i think it's also time to deprecate kube-dns in 1.18 and apply a 3 cycle deprecation policy to it, unless we switch to an addon installer sooner than that.

neolit123 · 2019-11-27T18:30:03Z

my proposal is to second class the big scale clusters (e.g. 5k nodes).
if those have demands for the coredns setup today they need to enable the autoscaler and possibly do other customizations already.

for the common use case the DaemonSet that targets only CP nodes seems fine to me.

there were objections that we should not move to a DS, but i'm not seeing any good ideas on how to continue using a Deployment and to fix the issue outlined in this ticket.

i think @yastij spoke about something similar to this:
kubernetes/kubernetes#85108 (comment)

yet if i recall correctly, from my tests it didn't work so well.

ereslibre · 2019-11-27T18:34:12Z

My proposal stands:

Set anti affinity preferredDuringSchedulingIgnoredDuringExecution on CoreDNS.
Default the Deployment to 2 replicas, as it is now.
When a node joins:
- Check the CoreDNS pods: if all of them are running on the same node, kill half of the CoreDNS pods.

This ensures we are not messing with the replica number, and we are only taking action if all CoreDNS pods are scheduled on the same node.

I can put a PR together with a proof of concept if you want to try it.

rajansandeep · 2019-11-27T18:37:54Z

Does it make sense for kubeadm to support an autoscaler?

ereslibre · 2019-11-27T18:38:48Z

Does it make sense for kubeadm to support an autoscaler?

I would twist the question to: should the default deployment with kubeadm make it harder to use an autoscaler?

Answering to your question we are not "supporting" it, but in my opinion we shouldn't make it deliberately harder to use if folks want to.

neolit123 · 2019-11-27T18:49:16Z

kubernetes/kubernetes#85108 (comment)
yet if i recall correctly, from my tests it didn't work so well.

i'm going to test this again, if this works we don't have to make any other changes.

ereslibre · 2019-11-27T18:55:53Z

i'm going to test this again, if this works we don't have to make any other changes.

I'm fine if this is where we think kubeadm bounds are.

Note that even with what you propose we will have a period of time in which no CoreDNS pods will be answering internal dns requests if the first control plane goes down.

Also, we should use preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution, for the issues mentioned earlier during the meeting.

In my opinion, for a proper solution we need to force the rescheduling upon a new node join. But I'm fine if that's not the expectation of the broader group.

neolit123 · 2019-11-27T19:42:40Z

In my opinion, for a proper solution we need to force the rescheduling upon a new node join. But I'm fine if that's not the expectation of the broader group.

even if we reschedule if both of those Nodes become NotReady there will be no service.

Note that even with what you propose we will have a period of time in which no CoreDNS pods will be answering internal dns requests if the first control plane goes down.

i guess we can experiment with an aggressive timeout for this too - i.e. considering the Pods for coredns are critical.

i think we might want to outline the different proposals and approaches in a google doc and get some votes going, otherwise this issue will not see any progress.

ereslibre · 2019-11-27T19:44:55Z

I wrote this patch as an alternative: kubernetes/kubernetes@master...ereslibre:even-dns-deployment-on-join, if you think it's reasonable I can open a PR for further discussion.

neolit123 · 2019-11-27T19:50:36Z

so adding the phase means binding more logic to the coredns deployment, yet the idea is to move the coredns deployment outside of kubeadm long term, which means that we have to deprecate the phase or just keep it "experimental".

ereslibre · 2019-11-27T19:52:58Z

so adding the phase means binding more logic to the coredns deployment, yet the idea is to move the coredns deployment outside of kubeadm long term, which means that we have to deprecate the phase or just keep it "experimental".

That is correct, or it could even be a hidden phase.

neolit123 · 2019-11-27T19:56:43Z

yes, hidden phase sounds fine.
still better to enumerate the different options before PRs, though.

neolit123 · 2019-11-27T23:28:42Z

i played with this patch:

diff --git a/cmd/kubeadm/app/phases/addons/dns/manifests.go b/cmd/kubeadm/app/phases/addons/dns/manifests.go
index 0ff61431705..5442b29ed88 100644
--- a/cmd/kubeadm/app/phases/addons/dns/manifests.go
+++ b/cmd/kubeadm/app/phases/addons/dns/manifests.go
@@ -243,6 +243,14 @@ spec:
         operator: Exists
       - key: {{ .ControlPlaneTaintKey }}
         effect: NoSchedule
+      - key: "node.kubernetes.io/unreachable"
+        operator: "Exists"
+        effect: "NoExecute"
+        tolerationSeconds: 5
+      - key: "node.kubernetes.io/not-ready"
+        operator: "Exists"
+        effect: "NoExecute"
+        tolerationSeconds: 5
       nodeSelector:
         beta.kubernetes.io/os: linux
       containers:

results:

create a cluster with multiple nodes
"stop" (e.g. shutdown a VM) the primary CP node where coredns landed
the Node becomes NotReady after a few minutes (is there a way to control this?)
5 seconds after the Node is NotReady, the coredns pods on this Node enter Terminating state
the Pods are scheduled on new Nodes
if the primary CP Node is brought back to Ready the Terminating Pods terminate and are deleted.
if the primary CP Node is deleted the same happens.
if no action is taken on the primary CP Node the Pods remain in terminating idefinitely.
(related to Pods stuck on terminating kubernetes#51835?)

ereslibre · 2019-11-29T13:33:43Z

I created this document to follow up on the different alternatives and discuss all of them: https://docs.google.com/document/d/1shXvqFwcqH8hJF-tAYNlEnVn45QZChfTjxKzF_D3d8I/edit

non7top · 2020-01-07T03:18:59Z

Sorry for getting into this conversation, but I was just observing the similar behavior when two coredns pods are crammed on single (first) control plane node up until that node is powered off, then pods are migrated after a certain timeout (about 5 minutes I guess), they get more evenly spread after this migration.

What I didn't follow is why coredns is treated differently comparing to other kube master services, like apiserver or proxy or etcd? What is the purpose of having 2 pods on single node? Looks like it is the most important service since it has 2 pods by default, while actually it becomes least fault-tolerant one because of the odd placing.

Why not just make it the same as other services and be done with it, especially if it is that important.

neolit123 · 2020-09-02T17:12:36Z

@rdxmb
kubeadm imports a coredns migration library but that works only on the Corefile in the coredns ConfigMap during upgrade.
you still have to patch the Deployment or outer ConfigMap.

the only thing that is preserved in the Deployment i believe is the number of replicas:
#1954
kubernetes/kubernetes#85837

rdxmb · 2020-09-02T17:57:11Z

@neolit123 thanks!

fejta-bot · 2020-12-01T18:15:21Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

rdxmb · 2020-12-01T18:30:10Z

/remove-lifecycle stale

ghost · 2021-01-01T19:26:57Z

Coredns gets scheduled in workernode after one of control plane node comes down, although it doesnt get ready in the worker node. to fix that it has to be manually cordoned off the worker node which shouldn't be an ideal scenario. it should have ideally not attempted to be schedule in non control plane nodes.

neolit123 · 2021-01-04T11:51:44Z

Scheduling on worker nodes works fine. Depends. Targetting only Cp nodes was discussed as not ideal either. For customizing that, you could patch the coredns deployment or skip kubeadm deploying corends and do that yourself.

pacoxu · 2021-02-05T13:57:02Z

Does descheduler make sense in this case ?
If we add a negative descheduler logic for coredns(for instance create a crd to declare which deployment need to be descheduled), it would be nice to have. Hence, descheduler will handle similar things on coredns or other applications.

neolit123 · 2021-02-05T14:28:04Z

i think we should move the coredns deployment to the addon operator eventually:
https://github.com/kubernetes-sigs/cluster-addons/tree/master/coredns

kops is in the process of implementing it. after that we can investigate for kubeadm.

fejta-bot · 2021-05-09T18:14:01Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fabriziopandini · 2021-05-10T12:28:12Z

@neolit123 I'm +1 for closing this, given that the goal should be to defer this to Addons provider; what we should do from the kubeadm PoV instead is to start planning a roadmap to pluggable Addons...

fejta-bot · 2021-06-09T12:45:22Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

neolit123 · 2021-06-09T17:48:53Z

we haven't seen recent requests about modifying the coredns deployment as is.
i have seen users skip kubeadm addons and deploy their custom ones, which allows them to configure them completely.

long term we want to move this addon to be external and not hardcoded in kubeadm code.

wangyysde · 2021-06-10T02:17:11Z

/cc

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 22, 2019

k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Nov 22, 2019

neolit123 added the kind/design Categorizes issue or PR as related to design. label Nov 22, 2019

neolit123 added this to the v1.18 milestone Nov 22, 2019

k8s-ci-robot assigned rajansandeep Nov 22, 2019

innobead mentioned this issue Nov 28, 2019

Dex and Gangway replica off-balanced (bsc#1157337) SUSE/skuba#860

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2020

neolit123 modified the milestones: v1.20, v1.21 Dec 2, 2020

neolit123 modified the milestones: v1.21, v1.22 Feb 8, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 9, 2021

neolit123 closed this as completed Jun 9, 2021

neolit123 mentioned this issue Oct 24, 2021

kubeadm: coredns: Set up topologySpreadConstraints to spread across hosts kubernetes/kubernetes#105865

Closed

itspngu mentioned this issue Dec 3, 2021

CAPI-managed CoreDNS deployment prevents cluster-autoscaler scale-down kubernetes-sigs/cluster-api#5606

Closed

sbueringer mentioned this issue Oct 30, 2023

CoreDNS pods running on the same Node after initial cluster creation kubernetes-sigs/cluster-api#9639

Open

redesign the CoreDNS Pod Deployment in kubeadm #1931

redesign the CoreDNS Pod Deployment in kubeadm #1931

Comments

ereslibre commented Nov 22, 2019 • edited by neolit123 Loading

Is this a BUG REPORT or FEATURE REQUEST?

What happened?

What you expected to happen?

ereslibre commented Nov 22, 2019

neolit123 commented Nov 22, 2019 • edited Loading

ereslibre commented Nov 22, 2019

neolit123 commented Nov 22, 2019 • edited Loading

ereslibre commented Nov 22, 2019 • edited Loading

neolit123 commented Nov 22, 2019 • edited Loading

rajansandeep commented Nov 22, 2019

neolit123 commented Nov 22, 2019

rajansandeep commented Nov 22, 2019

neolit123 commented Nov 24, 2019

neolit123 commented Nov 27, 2019 • edited Loading

ereslibre commented Nov 27, 2019 • edited Loading

rajansandeep commented Nov 27, 2019

ereslibre commented Nov 27, 2019

neolit123 commented Nov 27, 2019

ereslibre commented Nov 27, 2019

neolit123 commented Nov 27, 2019

ereslibre commented Nov 27, 2019

neolit123 commented Nov 27, 2019

ereslibre commented Nov 27, 2019 • edited Loading

neolit123 commented Nov 27, 2019

neolit123 commented Nov 27, 2019

ereslibre commented Nov 29, 2019

non7top commented Jan 7, 2020

neolit123 commented Sep 2, 2020

rdxmb commented Sep 2, 2020

fejta-bot commented Dec 1, 2020

rdxmb commented Dec 1, 2020

ghost commented Jan 1, 2021 • edited by ghost Loading

neolit123 commented Jan 4, 2021 via email

pacoxu commented Feb 5, 2021 • edited Loading

neolit123 commented Feb 5, 2021

fejta-bot commented May 9, 2021

fabriziopandini commented May 10, 2021

fejta-bot commented Jun 9, 2021

neolit123 commented Jun 9, 2021

wangyysde commented Jun 10, 2021

ereslibre commented Nov 22, 2019 •

edited by neolit123

Loading

neolit123 commented Nov 22, 2019 •

edited

Loading

neolit123 commented Nov 22, 2019 •

edited

Loading

ereslibre commented Nov 22, 2019 •

edited

Loading

neolit123 commented Nov 22, 2019 •

edited

Loading

neolit123 commented Nov 27, 2019 •

edited

Loading

ereslibre commented Nov 27, 2019 •

edited

Loading

ereslibre commented Nov 27, 2019 •

edited

Loading

ghost commented Jan 1, 2021 •

edited by ghost

Loading

pacoxu commented Feb 5, 2021 •

edited

Loading