[BUG] multi-node weaviate cluster pod CrashLoopBackOff using cmpd only #727

haowen159 · 2024-06-25T07:02:59Z

Describe the bug
A clear and concise description of what the bug is.
multi-node weaviate cluster pod CrashLoopBackOff

kbcli version
Kubernetes: v1.27.9
KubeBlocks: 0.9.0-beta.36
kbcli: 0.9.0-beta.27
WARNING: version difference between kbcli (0.9.0-beta.27) and kubeblocks (0.9.0-beta.36)

To Reproduce
1.create cluster
yaml:

apiVersion: apps.kubeblocks.io/v1alpha1
kind: Cluster
metadata:
  name: weaviate-cluster
  namespace: default
spec:
  # Specifies the behavior when a Cluster is deleted.
  # - `DoNotTerminate`: Prevents deletion of the Cluster. This policy ensures that all resources remain intact.
  # - `Halt`: Deletes Cluster resources like Pods and Services but retains Persistent Volume Claims (PVCs), allowing for data preservation while stopping other operations.
  # - `Delete`: Extends the `Halt` policy by also removing PVCs, leading to a thorough cleanup while removing all persistent data.
  # - `WipeOut`: An aggressive policy that deletes all Cluster resources, including volume snapshots and backups in external storage. This results in complete data removal and should be used cautiously, primarily in non-production environments to avoid irreversible data loss.
  terminationPolicy: Delete
  # Specifies a list of ClusterComponentSpec objects used to define the individual components that make up a Cluster. 
  componentSpecs:
    # Specifies the name of the Component. This name is also part of the Service DNS name and must comply with the IANA service naming rule. 
  - name: weaviate
    # References the name of a ComponentDefinition. The ComponentDefinition specifies the behavior and characteristics of the Component. If both `componentDefRef` and `componentDef` are provided, the `componentDef` will take precedence over `componentDefRef`.
    componentDef: weaviate
    # Specifies a group of affinity scheduling rules for the Component. It allows users to control how the Component's Pods are scheduled onto nodes in the cluster.
    affinity:
      podAntiAffinity: Preferred
      topologyKeys:
      - kubernetes.io/hostname
      tenancy: SharedNode
    # Allows the Component to be scheduled onto nodes with matching taints. 
    tolerations:
    - key: kb-data
      operator: Equal
      value: 'true'
      effect: NoSchedule
    # Determines whether the metrics exporter needs to be published to the service endpoint.  
    disableExporter: true
    # Specifies the name of the ServiceAccount required by the running Component.   
    serviceAccountName: kb-weaviate-cluster
    # Each component supports running multiple replicas to provide high availability and persistence. This field can be used to specify the desired number of replicas.
    replicas: 2
    # Specifies the resources required by the Component. It allows defining the CPU, memory requirements and limits for the Component's containers.
    resources:
      limits:
        cpu: '0.5'
        memory: 0.5Gi
      requests:
        cpu: '0.5'
        memory: 0.5Gi
    # Specifies a list of PersistentVolumeClaim templates that define the storage requirements for the Component. 
    volumeClaimTemplates:
    - name: data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 20Gi

see error

k get pod -l app.kubernetes.io/instance=weaviate-cluster
NAME                          READY   STATUS             RESTARTS        AGE
weaviate-cluster-weaviate-0   0/1     CrashLoopBackOff   19 (119s ago)   57m
k describe pod weaviate-cluster-weaviate-0
Name:             weaviate-cluster-weaviate-0
Namespace:        default
Priority:         0
Service Account:  kb-weaviate-cluster
Node:             aks-cicdamdpool-42454392-vmss000003/10.224.0.8
Start Time:       Tue, 25 Jun 2024 14:03:25 +0800
Labels:           app.kubernetes.io/component=weaviate
                  app.kubernetes.io/instance=weaviate-cluster
                  app.kubernetes.io/managed-by=kubeblocks
                  app.kubernetes.io/name=weaviate
                  app.kubernetes.io/version=weaviate
                  apps.kubeblocks.io/cluster-uid=25f24b02-88fd-4654-b3c3-eac8b3b328c3
                  apps.kubeblocks.io/component-name=weaviate
                  apps.kubeblocks.io/pod-name=weaviate-cluster-weaviate-0
                  componentdefinition.kubeblocks.io/name=weaviate
                  controller-revision-hash=69d75c66dd
                  workloads.kubeblocks.io/instance=weaviate-cluster-weaviate
                  workloads.kubeblocks.io/managed-by=InstanceSet
Annotations:      apps.kubeblocks.io/component-replicas: 2
Status:           Running
IP:               10.244.4.85
IPs:
  IP:           10.244.4.85
Controlled By:  InstanceSet/weaviate-cluster-weaviate
Containers:
  weaviate:
    Container ID:  containerd://5ae669a72e386d14e0f3efec57f5dba748ef58f7b05a14c7279cf1fcdfb243e1
    Image:         docker.io/semitechnologies/weaviate:1.19.6
    Image ID:      docker.io/semitechnologies/weaviate@sha256:6bd9b062b8fe9a3dd33f3c0706f83f7ff28a2b4de7e3bc43971385ca838d4034
    Ports:         8080/TCP, 2112/TCP, 7000/TCP, 7001/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/sh
      -c
      replicas=$(eval echo ${KB_POD_LIST} | tr ',' '\n')
      # Initialize count
      replca_count=0
      # Use a for loop to iterate over each space-separated word
      for item in $replicas; do
          replca_count=$((replca_count + 1))
      done
      
      while true; do
        count=$(nslookup ${CLUSTER_JOIN} | awk '/^Address: / { print $2 }' | wc -l)
        if [ "$count" -eq ${replca_count} ]; then
          break
        fi
        echo "Waiting for all nodes to be running..."
        sleep 1
      done
      export $(cat /weaviate-env/envs | xargs)
      /bin/weaviate --host 0.0.0.0 --port "8080" --scheme http --config-file /weaviate-config/conf.yaml --read-timeout=60s --write-timeout=60s
      
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 25 Jun 2024 14:57:32 +0800
      Finished:     Tue, 25 Jun 2024 14:58:31 +0800
    Ready:          False
    Restart Count:  19
    Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:      500m
      memory:   512Mi
    Liveness:   http-get http://:8080/v1/.well-known/live delay=900s timeout=3s period=10s #success=1 #failure=30
    Readiness:  http-get http://:8080/v1/.well-known/ready delay=3s timeout=3s period=10s #success=1 #failure=3
    Startup:    http-get http://:8080/v1/.well-known/ready delay=0s timeout=3s period=10s #success=1 #failure=3
    Environment Variables from:
      weaviate-cluster-weaviate-env      ConfigMap  Optional: false
      weaviate-cluster-weaviate-rsm-env  ConfigMap  Optional: false
    Environment:
      KB_POD_NAME:                           weaviate-cluster-weaviate-0 (v1:metadata.name)
      KB_POD_UID:                             (v1:metadata.uid)
      KB_NAMESPACE:                          default (v1:metadata.namespace)
      KB_SA_NAME:                             (v1:spec.serviceAccountName)
      KB_NODENAME:                            (v1:spec.nodeName)
      KB_HOST_IP:                             (v1:status.hostIP)
      KB_POD_IP:                              (v1:status.podIP)
      KB_POD_IPS:                             (v1:status.podIPs)
      KB_HOSTIP:                              (v1:status.hostIP)
      KB_PODIP:                               (v1:status.podIP)
      KB_PODIPS:                              (v1:status.podIPs)
      KB_POD_FQDN:                           $(KB_POD_NAME).weaviate-cluster-weaviate-headless.$(KB_NAMESPACE).svc
      CLUSTER_DATA_BIND_PORT:                7001
      CLUSTER_GOSSIP_BIND_PORT:              7000
      GOGC:                                  100
      PROMETHEUS_MONITORING_ENABLED:         true
      PROMETHEUS_MONITORING_PORT:            2112
      QUERY_MAXIMUM_RESULTS:                 100000
      REINDEX_VECTOR_DIMENSIONS_AT_STARTUP:  false
      TRACK_VECTOR_DIMENSIONS:               false
      PERSISTENCE_DATA_PATH:                 /var/lib/weaviate
      DEFAULT_VECTORIZER_MODULE:             none
      CLUSTER_HOSTNAME:                      $(KB_POD_NAME)
      CLUSTER_JOIN:                          $(KB_CLUSTER_COMP_NAME)-node-discovery.$(KB_NAMESPACE).svc.cluster.local
    Mounts:
      /var/lib/weaviate from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h6qcn (ro)
      /weaviate-config from weaviate-config (rw)
      /weaviate-env from weaviate-env (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  weaviate-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      weaviate-cluster-weaviate-weaviate-config-template
    Optional:  false
  weaviate-env:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      weaviate-cluster-weaviate-weaviate-env-template
    Optional:  false
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-weaviate-cluster-weaviate-0
    ReadOnly:   false
  kube-api-access-h6qcn:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Guaranteed
Node-Selectors:               <none>
Tolerations:                  kb-data=true:NoSchedule
                              node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                              node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  kubernetes.io/hostname:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/instance=weaviate-cluster,apps.kubeblocks.io/component-name=weaviate
Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               57m                   default-scheduler        Successfully assigned default/weaviate-cluster-weaviate-0 to aks-cicdamdpool-42454392-vmss000003
  Normal   SuccessfulAttachVolume  57m                   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-34e28f6f-6739-49fd-81ad-8ad0e2f95db7"
  Normal   Killing                 54m (x3 over 56m)     kubelet                  Container weaviate failed startup probe, will be restarted
  Normal   Started                 54m (x4 over 57m)     kubelet                  Started container weaviate
  Normal   Created                 52m (x6 over 57m)     kubelet                  Created container weaviate
  Normal   Pulled                  31m (x12 over 57m)    kubelet                  Container image "docker.io/semitechnologies/weaviate:1.19.6" already present on machine
  Warning  Unhealthy               17m (x48 over 57m)    kubelet                  Startup probe failed: Get "http://10.244.4.85:8080/v1/.well-known/ready": dial tcp 10.244.4.85:8080: connect: connection refused
  Warning  BackOff                 2m4s (x175 over 51m)  kubelet                  Back-off restarting failed container weaviate in pod weaviate-cluster-weaviate-0_default(2518cd1c-42b6-469d-8f41-54be1e3047a2)

logs:

k logs -f weaviate-cluster-weaviate-0                   
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...
Waiting for all nodes to be running...

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

haowen159 changed the title ~~[BUG] multi-node weaviate cluster pod CrashLoopBackOff~~ [BUG] multi-node weaviate cmpd cluster pod CrashLoopBackOff Jun 25, 2024

haowen159 changed the title ~~[BUG] multi-node weaviate cmpd cluster pod CrashLoopBackOff~~ [BUG] multi-node weaviate cluster pod CrashLoopBackOff using cmpd only Jun 25, 2024

JashBook assigned iziang Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] multi-node weaviate cluster pod CrashLoopBackOff using cmpd only #727

[BUG] multi-node weaviate cluster pod CrashLoopBackOff using cmpd only #727

haowen159 commented Jun 25, 2024

[BUG] multi-node weaviate cluster pod CrashLoopBackOff using cmpd only #727

[BUG] multi-node weaviate cluster pod CrashLoopBackOff using cmpd only #727

Comments

haowen159 commented Jun 25, 2024