Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeadm reset command whether remove etcd member on the master #1211

Closed
pytimer opened this issue Nov 3, 2018 · 7 comments · Fixed by kubernetes/kubernetes#74112
Closed
Assignees
Labels
area/HA help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/needs-information Indicates an issue needs more information in order to work on it.
Milestone

Comments

@pytimer
Copy link

pytimer commented Nov 3, 2018

Is this a BUG REPORT or FEATURE REQUEST?

FEATURE REQUEST

Versions

kubeadm version (use kubeadm version): kubeadm master branch

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Kernel (e.g. uname -a):
    Linux master1 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Others:

What happened?

Hi, i use kubeadm reset command on the one of masters, it not remove etcd member in the etcd cluster. I use local etcd in the init.

What you expected to happen?

I look at the master branch code, but i'm not find about this. I hope if reset on the master, kubeadm can remove etcd member.

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

@fabriziopandini
Copy link
Member

@pytimer could you kindly provide more info about your cluster (how it was created, the kubeadm-gce-master.yaml)

@fabriziopandini fabriziopandini added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. triage/needs-information Indicates an issue needs more information in order to work on it. labels Nov 5, 2018
@pytimer
Copy link
Author

pytimer commented Nov 27, 2018

@fabianofranz sorry, some things during this time.

I use 1.13.0-beta.2 on the virtual machine to test this issue. I init and join control plane successfully. But when i run kubeadm reset -f on the control plane, the cluster is not work.

Etcd container on the first init node always restart, and i found logs output this etcd still connect reset node etcd member.

etcd logs:

2018-11-27 09:22:12.828159 W | rafthttp: health check for peer fa6cc2324326d403 could not connect: dial tcp 10.33.46.213:2380: getsockopt: connection refused

reproduce

  1. kubeadm init --config kubeadm.yaml
  2. kubeadm apply -f flannel.yaml
  3. run kubeadm join --experimental-control-plane --config kubeadm.yaml on the other node.
  4. kubectl get nodes
[root@master213 ~]# kubectl get nodes
NAME        STATUS   ROLES    AGE   VERSION
master212   Ready    master   14h   v1.13.0-beta.2
master213   Ready    master   13h   v1.13.0-beta.2
  1. run kubeadm reset -f on the master213
kubeadm init yaml:
apiVersion: kubeadm.k8s.io/v1beta1
kind: InitConfiguration
bootstrapTokens:
- groups:
  - system:bootstrappers:kubeadm:default-node-token
  ttl: 24h0m0s
  usages:
  - signing
  - authentication
localAPIEndpoint:
  advertiseAddress: 0.0.0.0
  bindPort: 6443
nodeRegistration:
  criSocket: /var/run/dockershim.sock
  name: master212
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
---
apiServer:
  timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controlPlaneEndpoint: "10.33.46.215"
controllerManager: {}
dns:
  type: CoreDNS
etcd:
  local:
    serverCertSANs:
    - "10.33.46.215"
    extraArgs:
      cipher-suites: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
    dataDir: /var/lib/etcd
imageRepository: k8s.gcr.io
kubernetesVersion: v1.13.0-beta.2
networking:
  dnsDomain: cluster.local
  podSubnet: "10.244.0.0/16"
  serviceSubnet: 10.96.0.0/12
kubeadm join yaml:
apiVersion: kubeadm.k8s.io/v1beta1
kind: JoinConfiguration
caCertPath: /etc/kubernetes/pki/ca.crt
discovery:
  bootstrapToken:
    apiServerEndpoint: 10.33.46.215:6443
    token: 1jvhzl.37osma939vn5q1uh
    unsafeSkipCAVerification: true
  timeout: 5m0s
  tlsBootstrapToken: 1jvhzl.37osma939vn5q1uh
controlPlane:
  localAPIEndpoint:
    advertiseAddress: 0.0.0.0
    bindPort: 6443
nodeRegistration:
  criSocket: /var/run/dockershim.sock
  name: master213

@fabriziopandini
Copy link
Member

@pytimer
If I got it right, you create two control plane instances with a local etcd, so you get an etcd cluster with two etcd members. Then you reset one of the instances and the remaining etcd gets stuck.

The reason behind that is that the etcd cluster loses quorum

@timothysc opinions about if/how to handle this use case?

@pytimer
Copy link
Author

pytimer commented Nov 27, 2018

Yes, you said right.

So i think if run kubeadm reset on the hosting control plane node, kubeadm should delete member from etcd cluster and then do other things.

@neolit123
Copy link
Member

neolit123 commented Nov 29, 2018

we should try to make the remaining etcd nodes to not get stuck.
otherwise this breaks our HA guaranties.

edit: also having related tests in the future would be great.

@pytimer
Copy link
Author

pytimer commented Dec 23, 2018

I add remove etcd member feature when reset the control plane node, it's works for me.
It is my fork repository commit Remove etcd member when reset the control plane node.

I am not sure if this workflow should join the kubeadm reset?

@yagonobre
Copy link
Member

/remove-priority awaiting-more-evidence
/priority important-soon
/lifecycle active

@k8s-ci-robot k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Feb 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/HA help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants