Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redesign the CoreDNS Pod Deployment in kubeadm #1931

Closed
ereslibre opened this issue Nov 22, 2019 · 45 comments
Closed

redesign the CoreDNS Pod Deployment in kubeadm #1931

ereslibre opened this issue Nov 22, 2019 · 45 comments
Assignees
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@ereslibre
Copy link
Contributor

ereslibre commented Nov 22, 2019

Is this a BUG REPORT or FEATURE REQUEST?

/kind feature

What happened?

When deploying a Kubernetes cluster using kubeadm we are creating CoreDNS with a default replica of 2.

When performing the kubeadm init execution, the CoreDNS deployment will be fully scheduled on the only control plane node available at that time. When joining further control plane nodes or worker nodes both CoreDNS instances will remain on the first control plane node where they were scheduled at the time of creation.

What you expected to happen?

This is to gather information about how would we feel about performing the following changes:

  • Introduce a preferredDuringSchedulingIgnoredDuringExecution anti-affinity rule based on node hostname, so the scheduler will favour nodes where there's not a CoreDNS pod running already.

This on its own doesn't make any difference (also, note that it's preferred and not required: otherwise a single control plane node without workers would never succeed to create the CoreDNS deployment, since requirements would never be met).

The next step then would be that kubeadm when kubeadm join has finished, and the new node is registered, would perform the following checks:

  • Read the CoreDNS deployment pods from the Kube API. If all pods are running on the same node, perform a kill of one pod, forcing Kubernetes to reschedule (at this point, the preferredDuringSchedulingIgnoredDuringExecution mentioned before will get into the game, and the rescheduled pod will be placed on the new node).

Maybe this is something we don't want to do, as we would be making kubeadm tied to workload specific logic, and with the addon story coming it might not make sense. However, this is something that happens on every kubeadm based deployment, and that will make CoreDNS fail temporarily if the first control plane node goes away in an HA environment until those pods are rescheduled somewhere else. Just because they were initially scheduled on the only control plane node available.

What do you think? cc/ @neolit123 @rosti @fabriziopandini @yastij


google doc proposal
https://docs.google.com/document/d/1shXvqFwcqH8hJF-tAYNlEnVn45QZChfTjxKzF_D3d8I/edit

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 22, 2019
@ereslibre
Copy link
Contributor Author

/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Nov 22, 2019
@neolit123
Copy link
Member

neolit123 commented Nov 22, 2019

@ereslibre this is a duplicate of:
#1657

please read the whole discussion there.

preferredDuringSchedulingIgnoredDuringExecution

this is problematic because it would leave the second replica as pending.
if someone is running the e2e test suite on a single CP node cluster it will fail.

what should be done instead is to make CoreDNS a DS that lands Pods on all CP nodes.
but this will break all users that currently patch the CoreDNS deployment.

@ereslibre
Copy link
Contributor Author

this is problematic because it would leave the second replica as pending.

This should not happen with preferredDuringSchedulingIgnoredDuringExecution, but with requiredDuringSchedulingIgnoredDuringExecution. I can check though, but I'm fairly sure with preferred it will just work even with a single node.

@neolit123
Copy link
Member

neolit123 commented Nov 22, 2019

I can check though, but I'm fairly sure with preferred it will just work even with a single node.

both replicas will land on the same CP node in the beginning and then then the "join" process has to amend the Deployment?

@ereslibre
Copy link
Contributor Author

ereslibre commented Nov 22, 2019

Yes, the join process would "kubectl delete" one of the pods through the API, I specify the details on the body of the issue.

@neolit123
Copy link
Member

neolit123 commented Nov 22, 2019

please add this as agenda for next week.
the future coredns operator (as kubeadm will no longer include coredns in the future, by default) will run it as a DaemonSet.

i'm really in favor of breaking the users now and deploying it as DS and adding an action required in the release notes (in e.g. 1.18).

@neolit123 neolit123 added the kind/design Categorizes issue or PR as related to design. label Nov 22, 2019
@neolit123 neolit123 added this to the v1.18 milestone Nov 22, 2019
@rajansandeep
Copy link

/assign

@neolit123
Copy link
Member

@rajansandeep i proposed that we should chat about this in the next office hours on Wednesday.
potentially decide if we want to proceed with changes for 1.18.

@rajansandeep
Copy link

@neolit123 Yes, I agree. Would like to discuss what would be the potential direction for this.

@neolit123
Copy link
Member

today i played with deploying CoreDNS as a DS with this manifest (based on the existing Deployment manifest). it works fine and targets CP nodes only.

apiVersion: apps/v1
  kind: DaemonSet
  metadata:
    name: {{ .DeploymentName }}
    namespace: kube-system
    labels:
      k8s-app: kube-dns
  spec:
    selector:
      matchLabels:
        k8s-app: kube-dns
    template:
      metadata:
        labels:
          k8s-app: kube-dns
      spec:
        priorityClassName: system-cluster-critical
        serviceAccountName: coredns
        tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - key: {{ .ControlPlaneTaintKey }}
          effect: NoSchedule
        nodeSelector:
          beta.kubernetes.io/os: linux
          {{ .ControlPlaneTaintKey }}: ""
        containers:
        - name: coredns
          image: {{ .Image }}
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              memory: 170Mi
            requests:
              cpu: 100m
              memory: 70Mi
          args: [ "-conf", "/etc/coredns/Corefile" ]
          volumeMounts:
          - name: config-volume
            mountPath: /etc/coredns
            readOnly: true
          ports:
          - containerPort: 53
            name: dns
            protocol: UDP
          - containerPort: 53
            name: dns-tcp
            protocol: TCP
          - containerPort: 9153
            name: metrics
            protocol: TCP
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 60
            timeoutSeconds: 5
            successThreshold: 1
            failureThreshold: 5
          readinessProbe:
            httpGet:
              path: /ready
              port: 8181
              scheme: HTTP
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              add:
              - NET_BIND_SERVICE
              drop:
              - all
            readOnlyRootFilesystem: true
        dnsPolicy: Default
        volumes:
          - name: config-volume
            configMap:
              name: coredns
              items:
              - key: Corefile
                path: Corefile

i think it's also time to deprecate kube-dns in 1.18 and apply a 3 cycle deprecation policy to it, unless we switch to an addon installer sooner than that.

@neolit123
Copy link
Member

neolit123 commented Nov 27, 2019

my proposal is to second class the big scale clusters (e.g. 5k nodes).
if those have demands for the coredns setup today they need to enable the autoscaler and possibly do other customizations already.

for the common use case the DaemonSet that targets only CP nodes seems fine to me.

there were objections that we should not move to a DS, but i'm not seeing any good ideas on how to continue using a Deployment and to fix the issue outlined in this ticket.

i think @yastij spoke about something similar to this:
kubernetes/kubernetes#85108 (comment)

yet if i recall correctly, from my tests it didn't work so well.

@ereslibre
Copy link
Contributor Author

ereslibre commented Nov 27, 2019

My proposal stands:

  • Set anti affinity preferredDuringSchedulingIgnoredDuringExecution on CoreDNS.
  • Default the Deployment to 2 replicas, as it is now.
  • When a node joins:
    • Check the CoreDNS pods: if all of them are running on the same node, kill half of the CoreDNS pods.

This ensures we are not messing with the replica number, and we are only taking action if all CoreDNS pods are scheduled on the same node.

I can put a PR together with a proof of concept if you want to try it.

@rajansandeep
Copy link

Does it make sense for kubeadm to support an autoscaler?

@ereslibre
Copy link
Contributor Author

Does it make sense for kubeadm to support an autoscaler?

I would twist the question to: should the default deployment with kubeadm make it harder to use an autoscaler?

Answering to your question we are not "supporting" it, but in my opinion we shouldn't make it deliberately harder to use if folks want to.

@neolit123
Copy link
Member

kubernetes/kubernetes#85108 (comment)
yet if i recall correctly, from my tests it didn't work so well.

i'm going to test this again, if this works we don't have to make any other changes.

@ereslibre
Copy link
Contributor Author

i'm going to test this again, if this works we don't have to make any other changes.

I'm fine if this is where we think kubeadm bounds are.

Note that even with what you propose we will have a period of time in which no CoreDNS pods will be answering internal dns requests if the first control plane goes down.

Also, we should use preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution, for the issues mentioned earlier during the meeting.

In my opinion, for a proper solution we need to force the rescheduling upon a new node join. But I'm fine if that's not the expectation of the broader group.

@neolit123
Copy link
Member

In my opinion, for a proper solution we need to force the rescheduling upon a new node join. But I'm fine if that's not the expectation of the broader group.

even if we reschedule if both of those Nodes become NotReady there will be no service.

Note that even with what you propose we will have a period of time in which no CoreDNS pods will be answering internal dns requests if the first control plane goes down.

i guess we can experiment with an aggressive timeout for this too - i.e. considering the Pods for coredns are critical.

i think we might want to outline the different proposals and approaches in a google doc and get some votes going, otherwise this issue will not see any progress.

@ereslibre
Copy link
Contributor Author

I wrote this patch as an alternative: kubernetes/kubernetes@master...ereslibre:even-dns-deployment-on-join, if you think it's reasonable I can open a PR for further discussion.

@neolit123
Copy link
Member

so adding the phase means binding more logic to the coredns deployment, yet the idea is to move the coredns deployment outside of kubeadm long term, which means that we have to deprecate the phase or just keep it "experimental".

@ereslibre
Copy link
Contributor Author

ereslibre commented Nov 27, 2019

so adding the phase means binding more logic to the coredns deployment, yet the idea is to move the coredns deployment outside of kubeadm long term, which means that we have to deprecate the phase or just keep it "experimental".

That is correct, or it could even be a hidden phase.

@neolit123
Copy link
Member

yes, hidden phase sounds fine.
still better to enumerate the different options before PRs, though.

@neolit123
Copy link
Member

i played with this patch:

diff --git a/cmd/kubeadm/app/phases/addons/dns/manifests.go b/cmd/kubeadm/app/phases/addons/dns/manifests.go
index 0ff61431705..5442b29ed88 100644
--- a/cmd/kubeadm/app/phases/addons/dns/manifests.go
+++ b/cmd/kubeadm/app/phases/addons/dns/manifests.go
@@ -243,6 +243,14 @@ spec:
         operator: Exists
       - key: {{ .ControlPlaneTaintKey }}
         effect: NoSchedule
+      - key: "node.kubernetes.io/unreachable"
+        operator: "Exists"
+        effect: "NoExecute"
+        tolerationSeconds: 5
+      - key: "node.kubernetes.io/not-ready"
+        operator: "Exists"
+        effect: "NoExecute"
+        tolerationSeconds: 5
       nodeSelector:
         beta.kubernetes.io/os: linux
       containers:

results:

  • create a cluster with multiple nodes
  • "stop" (e.g. shutdown a VM) the primary CP node where coredns landed
  • the Node becomes NotReady after a few minutes (is there a way to control this?)
  • 5 seconds after the Node is NotReady, the coredns pods on this Node enter Terminating state
  • the Pods are scheduled on new Nodes
  • if the primary CP Node is brought back to Ready the Terminating Pods terminate and are deleted.
  • if the primary CP Node is deleted the same happens.
  • if no action is taken on the primary CP Node the Pods remain in terminating idefinitely.
    (related to Pods stuck on terminating kubernetes#51835?)

@ereslibre
Copy link
Contributor Author

I created this document to follow up on the different alternatives and discuss all of them: https://docs.google.com/document/d/1shXvqFwcqH8hJF-tAYNlEnVn45QZChfTjxKzF_D3d8I/edit

@non7top
Copy link

non7top commented Jan 7, 2020

Sorry for getting into this conversation, but I was just observing the similar behavior when two coredns pods are crammed on single (first) control plane node up until that node is powered off, then pods are migrated after a certain timeout (about 5 minutes I guess), they get more evenly spread after this migration.

What I didn't follow is why coredns is treated differently comparing to other kube master services, like apiserver or proxy or etcd? What is the purpose of having 2 pods on single node? Looks like it is the most important service since it has 2 pods by default, while actually it becomes least fault-tolerant one because of the odd placing.

Why not just make it the same as other services and be done with it, especially if it is that important.

@neolit123
Copy link
Member

@rdxmb
kubeadm imports a coredns migration library but that works only on the Corefile in the coredns ConfigMap during upgrade.
you still have to patch the Deployment or outer ConfigMap.

the only thing that is preserved in the Deployment i believe is the number of replicas:
#1954
kubernetes/kubernetes#85837

@rdxmb
Copy link

rdxmb commented Sep 2, 2020

@neolit123 thanks!

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2020
@rdxmb
Copy link

rdxmb commented Dec 1, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2020
@neolit123 neolit123 modified the milestones: v1.20, v1.21 Dec 2, 2020
@ghost
Copy link

ghost commented Jan 1, 2021

Coredns gets scheduled in workernode after one of control plane node comes down, although it doesnt get ready in the worker node. to fix that it has to be manually cordoned off the worker node which shouldn't be an ideal scenario. it should have ideally not attempted to be schedule in non control plane nodes.

@neolit123
Copy link
Member

neolit123 commented Jan 4, 2021 via email

@pacoxu
Copy link
Member

pacoxu commented Feb 5, 2021

Does descheduler make sense in this case ?
If we add a negative descheduler logic for coredns(for instance create a crd to declare which deployment need to be descheduled), it would be nice to have. Hence, descheduler will handle similar things on coredns or other applications.

@neolit123
Copy link
Member

i think we should move the coredns deployment to the addon operator eventually:
https://github.com/kubernetes-sigs/cluster-addons/tree/master/coredns

kops is in the process of implementing it. after that we can investigate for kubeadm.

@neolit123 neolit123 modified the milestones: v1.21, v1.22 Feb 8, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2021
@fabriziopandini
Copy link
Member

@neolit123 I'm +1 for closing this, given that the goal should be to defer this to Addons provider; what we should do from the kubeadm PoV instead is to start planning a roadmap to pluggable Addons...

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 9, 2021
@neolit123
Copy link
Member

we haven't seen recent requests about modifying the coredns deployment as is.
i have seen users skip kubeadm addons and deploy their custom ones, which allows them to configure them completely.

long term we want to move this addon to be external and not hardcoded in kubeadm code.

@wangyysde
Copy link
Member

/cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests