Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Promote fail, cluster stays in Provisioning phase #2191

Closed
bk201 opened this issue Apr 25, 2022 · 8 comments
Closed

[BUG] Promote fail, cluster stays in Provisioning phase #2191

bk201 opened this issue Apr 25, 2022 · 8 comments
Assignees
Labels
area/backend Harvester control plane area/rancher Rancher related including internal and external kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue priority/0 Must be fixed in this release require/release-note
Milestone

Comments

@bk201
Copy link
Member

bk201 commented Apr 25, 2022

Describe the bug

This was spotted when debugging #2187.
After deleing a server node, a worker node can't become control plane node.

To Reproduce
Steps to reproduce the behavior:

  1. Create a 4-node Harvester cluster.
  2. Wait for 3 nodes to become control plane nodes (role is control-plane,etcd,master).
  3. Find which node the rancher-webhook pod is on. Assume nodeX.
  4. Delete nodeX.
  5. Harvester should promote the remaining work node, but the job keep waiting:
machine.cluster.x-k8s.io/custom-6bce219ef5d1 labeled
secret/custom-6bce219ef5d1-machine-plan labeled
rkebootstrap.rke.cattle.io/custom-6bce219ef5d1 labeled
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...
Waiting for promotion...

The CAPI cluster keeps staying in Provisioning phase:

$ kubectl get cluster -n fleet-local -o yaml
apiVersion: v1
items:
- apiVersion: provisioning.cattle.io/v1
  kind: Cluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","metadata":{"annotations":{},"labels":{"rke.cattle.io/init-node-machine-id":"xkhlp79g4cg8rgdgfsxsbm26ftvhglvzst28r9cr87spst2hcldxdq"},"name":"local","namespace":"fleet-local"},"spec":{"kubernetesVersion":"v1.21.11+rke2r1","rkeConfig":{"controlPlaneConfig":null}}}
      objectset.rio.cattle.io/applied: H4sIAAAAAAAA/4yQzU7DMBCEXwXt2Slt079Y4oAQ4sCVF9jYS2Ow15G9CYfK746SVqJC4udo78xovjlBIEGLgqBPgMxRUFzkPD1j+0ZGMskiubgwKOJp4eKts6ChT3F02UV2fKyMH7JQqkwiFAL1ozV+MKXqOL6DhoCMRwrEciUYa3Xz7NjePZwj/8xiDAQafDTo/yXOPZrJAUXB3NdFfnGBsmDoQfPgvQKPLflfR+gwd6Bhu9ztt3XdUGNwc7Crdr9u6jW1y/pg91vb2LXdbHarA6jzYpbSVwho6DCNNIMWBd9Yrtu+eiKpzpeiIPdkpnbzx2Wq+0G6R7Z9dCygT2WSCcpwwciURrJPxJRmZtDLUj4DAAD//5CVWGcAAgAA
      objectset.rio.cattle.io/id: provisioning-cluster-create
      objectset.rio.cattle.io/owner-gvk: management.cattle.io/v3, Kind=Cluster
      objectset.rio.cattle.io/owner-name: local
      objectset.rio.cattle.io/owner-namespace: ""
    creationTimestamp: "2022-04-11T08:15:00Z"
    finalizers:
    - wrangler.cattle.io/provisioning-cluster-remove
    - wrangler.cattle.io/rke-cluster-remove
    generation: 2
    labels:
      objectset.rio.cattle.io/hash: 50675339e9ca48d1b72932eb038d75d9d2d44618
      provider.cattle.io: harvester
      rke.cattle.io/init-node-machine-id: xkhlp79g4cg8rgdgfsxsbm26ftvhglvzst28r9cr87spst2hcldxdq
    name: local
    namespace: fleet-local
    resourceVersion: "20689"
    uid: 45e02df5-5f70-4845-ae77-0954a4b68fa8
  spec:
    kubernetesVersion: v1.21.11+rke2r1
    localClusterAuthEndpoint: {}
    rkeConfig: {}
  status:
    clientSecretName: local-kubeconfig
    clusterName: local
    conditions:
    - status: "True"
      type: Ready
    - status: Unknown
      type: DefaultProjectCreated
    - status: Unknown
      type: SystemProjectCreated
    - lastUpdateTime: "2022-04-11T08:15:00Z"
      status: "False"
      type: Reconciling
    - lastUpdateTime: "2022-04-11T08:15:00Z"
      status: "False"
      type: Stalled
    - lastUpdateTime: "2022-04-11T08:15:49Z"
      status: "True"
      type: Created
    - lastUpdateTime: "2022-04-25T08:04:12Z"
      status: "True"
      type: RKECluster
    - lastUpdateTime: "2022-04-25T08:04:12Z"
      message: 'Operation cannot be fulfilled on secrets "custom-6aa860f10259-machine-plan":
        the object has been modified; please apply your changes to the latest version
        and try again'
      reason: Error
      status: "False"
      type: Provisioned
    observedGeneration: 2
    ready: true
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

node2:~ # kubectl get machines -A
NAMESPACE     NAME                  CLUSTER   NODENAME   PROVIDERID     PHASE     AGE   VERSION
fleet-local   custom-6aa860f10259   local     node2      rke2://node2   Running   14d
fleet-local   custom-6bce219ef5d1   local     node4      rke2://node4   Running   14d
fleet-local   custom-78e6431db553   local     node3      rke2://node3   Running   14d

Expected behavior

The worker node should be promoted.

Support bundle

Environment:

  • Harvester ISO version: v1.0.1, Rancher v2.6.4
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): KVM VMs

Additional context
Add any other context about the problem here.

@bk201 bk201 added kind/bug Issues that are defects reported by users or that we know have reached a real release area/backend Harvester control plane labels Apr 25, 2022
@bk201
Copy link
Member Author

bk201 commented Apr 25, 2022

Create a Rancher issue: rancher/rancher#37462

@bk201
Copy link
Member Author

bk201 commented May 3, 2022

Workaround

  • Say we have 4 nodes and delete the first control plane node

    $ kubectl get nodes                                                                  127 ↵
    NAME    STATUS   ROLES                       AGE   VERSION
    node1   Ready    control-plane,etcd,master   21d   v1.21.11+rke2r1
    node2   Ready    control-plane,etcd,master   21d   v1.21.11+rke2r1
    node3   Ready    control-plane,etcd,master   21d   v1.21.11+rke2r1
    node4   Ready    <none>                      21d   v1.21.11+rke2r1
    
    $ kubectl delete node node1
    
  • If the promotion fails, the promoting node should be stuck with SchedulingDisabled:

    $ kubectl get nodes
    NAME    STATUS                     ROLES                       AGE   VERSION
    node2   Ready                      control-plane,etcd,master   21d   v1.21.11+rke2r1
    node3   Ready                      control-plane,etcd,master   21d   v1.21.11+rke2r1
    node4   Ready,SchedulingDisabled   <none>
    

    And the cluster should have condition like:

    $kubectl describe clusters.cluster.x-k8s.io local -n fleet-local
    
    ...
    Status:
      Conditions:
        Last Transition Time:  2022-05-03T06:16:18Z
        Message:               Operation cannot be fulfilled on secrets "custom-6aa860f10259-machine-plan": the object has been modified; please apply your changes to the latest version and try again
        Reason:                Error
        Status:                False
        Type:                  Ready
        Last Transition Time:  2022-04-11T08:15:49Z
        Status:                True
        Type:                  ControlPlaneInitialized
        Last Transition Time:  2022-05-03T06:16:18Z
        Message:               Operation cannot be fulfilled on secrets "custom-6aa860f10259-machine-plan": the object has been modified; please apply your changes to the latest version and try again
        Reason:                Error
        Status:                False
        Type:                  ControlPlaneReady
        Last Transition Time:  2022-05-03T06:16:18Z
        Message:               Operation cannot be fulfilled on secrets "custom-6aa860f10259-machine-plan": the object has been modified; please apply your changes to the latest version and try again
        Reason:                Error
        Status:                False
        Type:                  InfrastructureReady
      Control Plane Ready:     true
      Infrastructure Ready:    false
      Observed Generation:     19
      Phase:                   Provisio
    
  • Find the machine ID from remaining control plan node

    $ kubectl get machines.cluster.x-k8s.io -A
    NAMESPACE     NAME                  CLUSTER   NODENAME   PROVIDERID     PHASE     AGE   VERSION
    fleet-local   custom-6aa860f10259   local     node2      rke2://node2   Running   21d
    fleet-local   custom-6bce219ef5d1   local     node4      rke2://node4   Running   21d
    fleet-local   custom-78e6431db553   local     node3      rke2://node3   Running   21d
    
    # let's use `node2`
    $ kubectl get secret custom-6aa860f10259-machine-plan -n fleet-local -o yaml | yq e '.metadata.labels."rke.cattle.io/machine-id"'
    744fd38865034267bb7c582ea8d11aed6b124bea2159c1d440d65b1c895e888
    
  • Set the machine ID to CAPI cluster object:

    kubectl label clusters.provisioning.cattle.io local -n fleet-local rke.cattle.io/init-node-machine-id=744fd38865034267bb7c582ea8d11aed6b124bea2159c1d440d65b1c895e888 --overwrite
    
  • The cluster should eventually reconcile and the promoting node should become a control plan node.

    $ kubectl get nodes
    NAME    STATUS   ROLES                       AGE   VERSION
    node2   Ready    control-plane,etcd,master   21d   v1.21.11+rke2r1
    node3   Ready    control-plane,etcd,master   21d   v1.21.11+rke2r1
    node4   Ready    control-plane,etcd,master   21d   v1.21.11+rke2r1
    

@bk201
Copy link
Member Author

bk201 commented Aug 24, 2022

Hit the issue when bumping to v2.6.7: rancher/rancher#38706

@bk201
Copy link
Member Author

bk201 commented Sep 30, 2022

Note we also need to handle systems upgraded from previous versions.
The Rancherd change doesn't take effect in those systems.

@bk201 bk201 added the not-require/test-plan Skip to create a e2e automation test issue label Oct 20, 2022
@bk201
Copy link
Member Author

bk201 commented Oct 20, 2022

The upgrade path is handled in #2918

@harvesterhci-io-github-bot
Copy link

harvesterhci-io-github-bot commented Oct 20, 2022

Pre Ready-For-Testing Checklist

  • [] If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted?
    The HEP PR is at:

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: [BUG] Promote fail, cluster stays in Provisioning phase #2191 (comment)

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at: [BUG] Promote fail, cluster stays in Provisioning phase #2191 (comment)

  • Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)?
    The PR is at:

    • Does the PR include the explanation for the fix or the feature?

    • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
      The PR for the YAML change is at:
      The PR for the chart change is at:

  • If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at:

  • If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged?
    The documentation/KB PR is at:

  • If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at:

@bk201
Copy link
Member Author

bk201 commented Oct 20, 2022

Test plan

  • Create a 4-node Harvester cluster.
  • Wait for three nodes to become control plane nodes (role is control-plane,etcd,master).
  • Delete one of the control plane nodes.
  • The remaining worker node should be promoted to a control plane node (role is control-plane,etcd,master).

@TachunLin TachunLin self-assigned this Oct 20, 2022
@TachunLin
Copy link

Verified fixed on master-f96827b2-head (10/21). Close this issue.

Result

After deleting one of the control-plane node machine of 4 nodes Harvester cluster, the remaining worker node can be correctly upgraded to control-plane role

  • 4 nodes Harvester cluster status, before delete one of the control-plane node

    n1-221021:/etc # kubectl get nodes
    NAME        STATUS   ROLES                       AGE     VERSION
    n1-221021   Ready    control-plane,etcd,master   17h     v1.24.7+rke2r1
    n2-221021   Ready    control-plane,etcd,master   16h     v1.24.7+rke2r1
    n3-221021   Ready    control-plane,etcd,master   15h     v1.24.7+rke2r1
    n4-221021   Ready    <none>                      4m10s   v1.24.7+rke2r1
    
  • Delete the third control-plane node, the 4th node can be promoted to control-plane role

    n1-221021:/etc # kubectl get nodes
    NAME        STATUS   ROLES                       AGE   VERSION
    n1-221021   Ready    control-plane,etcd,master   17h   v1.24.7+rke2r1
    n2-221021   Ready    control-plane,etcd,master   16h   v1.24.7+rke2r1
    n4-221021   Ready    control-plane,etcd,master   11m   v1.24.7+rke2r1
    
    

    image

    n1-221021:/etc # kubectl get machines -A
    NAMESPACE     NAME                  CLUSTER   NODENAME    PROVIDERID         PHASE     AGE   VERSION
    fleet-local   custom-00c844d92e49   local     n4-221021   rke2://n4-221021   Running   12m   
    fleet-local   custom-580f66b2735d   local     n2-221021   rke2://n2-221021   Running   16h   
    fleet-local   custom-6c0eaa9d5d67   local     n1-221021   rke2://n1-221021   Running   17h
    

Test Information

  • Test Environment: 4 nodes harvester on bare machines
  • Harvester version: master-f96827b2-head (10/21)

Verify Steps

  1. Create a 4-node Harvester cluster.
  2. Wait for three nodes to become control plane nodes (role is control-plane,etcd,master).
  3. Delete one of the control plane nodes.
  4. The remaining worker node should be promoted to a control plane node (role is control-plane,etcd,master).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend Harvester control plane area/rancher Rancher related including internal and external kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue priority/0 Must be fixed in this release require/release-note
Projects
None yet
Development

No branches or pull requests

5 participants