Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Litmus Chaos Tests not running on K8s v1.27 #4125

Open
amitpd opened this issue Aug 14, 2023 · 6 comments
Open

Litmus Chaos Tests not running on K8s v1.27 #4125

amitpd opened this issue Aug 14, 2023 · 6 comments
Labels

Comments

@amitpd
Copy link

amitpd commented Aug 14, 2023

What happened:
LitmusChaos tests not running properly on Kubernetes v1.27

What you expected to happen:
LitmusChaos tests should run properly on Kubernetes v1.27

Where can this issue be corrected? (optional)

The issue is probably in the source code of litmuschaos/go-runner:2.14.0

How to reproduce it (as minimally and precisely as possible):
Note: Followed the instructions as per https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-cpu-hog/.

Deploy litmus operator v2.14.0

kubectl create -f https://litmuschaos.github.io/litmus/litmus-operator-v2.14.0.yaml

Deploy below ChaosExperiment:

apiVersion: litmuschaos.io/v1alpha1
description:
  message: |
    Injects cpu consumption on pods belonging to an app deployment
kind: ChaosExperiment
metadata:
  labels:
    app.kubernetes.io/component: chaosexperiment
    app.kubernetes.io/part-of: litmus
    app.kubernetes.io/version: 2.14.0
    name: pod-cpu-hog
  name: pod-cpu-hog
  namespace: default
spec:
  definition:
    args:
    - -c
    - ./experiments -name pod-cpu-hog
    command:
    - /bin/bash
    env:
    - name: TOTAL_CHAOS_DURATION
      value: "60"
    - name: CHAOS_INTERVAL
      value: "10"
    - name: CPU_CORES
      value: "1"
    - name: CPU_LOAD
      value: "100"
    - name: PODS_AFFECTED_PERC
      value: ""
    - name: RAMP_TIME
      value: ""
    - name: LIB
      value: litmus
    - name: LIB_IMAGE
      value: litmuschaos/go-runner:2.14.0
    - name: SOCKET_PATH
      value: /var/run/docker.sock
    - name: LIB_IMAGE_PULL_POLICY
      value: IfNotPresent
    - name: TARGET_PODS
      value: ""
    - name: NODE_LABEL
      value: ""
    - name: SEQUENCE
      value: parallel
    image: litmuschaos/go-runner:2.14.0
    imagePullPolicy: IfNotPresent
    labels:
      app.kubernetes.io/component: experiment-job
      app.kubernetes.io/part-of: litmus
      app.kubernetes.io/version: 2.14.0
      name: pod-cpu-hog
    permissions:
    - apiGroups:
      - ""
      resources:
      - pods
      verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - deletecollection
    - apiGroups:
      - ""
      resources:
      - events
      verbs:
      - create
      - get
      - list
      - patch
      - update
    - apiGroups:
      - ""
      resources:
      - configmaps
      verbs:
      - get
      - list
    - apiGroups:
      - ""
      resources:
      - pods/log
      verbs:
      - get
      - list
      - watch
    - apiGroups:
      - ""
      resources:
      - pods/exec
      verbs:
      - get
      - list
      - create
    - apiGroups:
      - apps
      resources:
      - deployments
      - statefulsets
      - replicasets
      - daemonsets
      verbs:
      - list
      - get
    - apiGroups:
      - apps.openshift.io
      resources:
      - deploymentconfigs
      verbs:
      - list
      - get
    - apiGroups:
      - ""
      resources:
      - replicationcontrollers
      verbs:
      - get
      - list
    - apiGroups:
      - argoproj.io
      resources:
      - rollouts
      verbs:
      - list
      - get
    - apiGroups:
      - batch
      resources:
      - jobs
      verbs:
      - create
      - list
      - get
      - delete
      - deletecollection
    - apiGroups:
      - litmuschaos.io
      resources:
      - chaosengines
      - chaosexperiments
      - chaosresults
      verbs:
      - create
      - list
      - get
      - patch
      - update
      - delete
    scope: Namespaced

Create below RBAC:

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/part-of: litmus
    name: pod-cpu-hog-sa
  name: pod-cpu-hog-sa
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app.kubernetes.io/part-of: litmus
    name: pod-cpu-hog-sa
  name: pod-cpu-hog-sa
  namespace: default
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - deletecollection
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - get
  - list
  - patch
  - update
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
- apiGroups:
  - ""
  resources:
  - pods/log
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods/exec
  verbs:
  - get
  - list
  - create
- apiGroups:
  - apps
  resources:
  - deployments
  - statefulsets
  - replicasets
  - daemonsets
  verbs:
  - list
  - get
- apiGroups:
  - apps.openshift.io
  resources:
  - deploymentconfigs
  verbs:
  - list
  - get
- apiGroups:
  - ""
  resources:
  - replicationcontrollers
  verbs:
  - get
  - list
- apiGroups:
  - argoproj.io
  resources:
  - rollouts
  verbs:
  - list
  - get
- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - create
  - list
  - get
  - delete
  - deletecollection
- apiGroups:
  - litmuschaos.io
  resources:
  - chaosengines
  - chaosexperiments
  - chaosresults
  verbs:
  - create
  - list
  - get
  - patch
  - update
  - delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    app.kubernetes.io/part-of: litmus
    name: pod-cpu-hog-sa
  name: pod-cpu-hog-sa
  namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: pod-cpu-hog-sa
subjects:
- kind: ServiceAccount
  name: pod-cpu-hog-sa
  namespace: default

Deploy below ChaosEngine:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: chaosengine-pod-cpu-hog
  namespace: default
spec:
  annotationCheck: "true"
  appinfo:
    appkind: deployment
    applabel: app=nginx
    appns: default
  chaosServiceAccount: pod-cpu-hog-sa
  components:
    runner:
      image: litmuschaos/chaos-runner:2.14.0
      imagePullPolicy: IfNotPresent
  engineState: active
  experiments:
  - name: pod-cpu-hog
    spec:
      components:
        env:
        - name: CONTAINER_RUNTIME
          value: containerd
        - name: SOCKET_PATH
          value: /run/containerd/containerd.sock
        - name: TOTAL_CHAOS_DURATION
          value: "30"
        - name: CPU_CORES
          value: "1"
        - name: TARGET_CONTAINER
          value: nginx
  jobCleanUpPolicy: retain

Anything else we need to know?:
Log of pod-cpu-hog-vczplk-d5fsw pod created during experiemnt:

time="2023-08-14T09:45:53Z" level=info msg="Experiment Name: pod-cpu-hog"
time="2023-08-14T09:45:53Z" level=info msg="[PreReq]: Getting the ENV for the pod-cpu-hog experiment"
time="2023-08-14T09:45:55Z" level=info msg="[PreReq]: Updating the chaos result of pod-cpu-hog experiment (SOT)"
time="2023-08-14T09:45:57Z" level=info msg="The application information is as follows" Namespace=default Label="app=nginx" App Kind=deployment
time="2023-08-14T09:45:57Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)"
time="2023-08-14T09:45:57Z" level=info msg="[Status]: The Container status are as follows" container=nginx Pod=nginx-deployment-54bcfc567b-pjddz Readiness=true
time="2023-08-14T09:45:57Z" level=info msg="[Status]: The status of Pods are as follows" Pod=nginx-deployment-54bcfc567b-pjddz Status=Running
time="2023-08-14T09:45:57Z" level=info msg="[Status]: The Container status are as follows" container=nginx Pod=nginx-deployment-54bcfc567b-sm4ql Readiness=true
time="2023-08-14T09:45:57Z" level=info msg="[Status]: The status of Pods are as follows" Pod=nginx-deployment-54bcfc567b-sm4ql Status=Running
time="2023-08-14T09:45:59Z" level=info msg="[Info]: The chaos tunables are:" Sequence=parallel PodsAffectedPerc=0 CPU Core=1 CPU Load Percentage=100
time="2023-08-14T09:45:59Z" level=info msg="[Chaos]:Number of pods targeted: 1"
time="2023-08-14T09:45:59Z" level=info msg="[Info]: Target pods list for chaos, [nginx-deployment-54bcfc567b-pjddz]"
time="2023-08-14T09:45:59Z" level=info msg="[Info]: Details of application under chaos injection" PodName=nginx-deployment-54bcfc567b-pjddz NodeName=amit-vm-2 ContainerName=nginx
time="2023-08-14T09:45:59Z" level=info msg="[Status]: Checking the status of the helper pods"
time="2023-08-14T09:46:04Z" level=info msg="[Wait]: waiting till the completion of the helper pod"
time="2023-08-14T09:49:37Z" level=error msg="[Error]: CPU hog failed, err: helper pod failed, err: Unable to find the pods with matching labels"

Events from the Job that creates pod-cpu-hog-vczplk-d5fsw pod:

Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  10s   job-controller  Created pod: pod-cpu-hog-vczplk-d5fsw
  Normal  SuccessfulDelete  2s    job-controller  Deleted pod: pod-cpu-hog-helper-xrpvbv

It seems like the helper pod is getting deleted immediately after it is created.

@amitpd amitpd added the bug label Aug 14, 2023
@cooldev001
Copy link

I am also facing the same issue on Amazon EKS cluster (1.27), it works correctly on v 1.24.

@RobinSegura
Copy link
Contributor

Same here ! With litmus 3.0.0-beta8 (and reproduced on 3.0.0-beta7 too)
on EKS 1.27 and was working fine on 1.26
Might this be related to ---container-runtime deprecation since 1.24 and removed in 1.27 ?
here on kube release notes

@RobinSegura
Copy link
Contributor

Able to reproduce on Minikube + containerd + litmus 3.0.0-beta8

  • case 1 : witness group on kubernetes 1.26.8

Screenshot from 2023-08-30 17-52-43
All chaos experiments requiring container runtime working fine

case 2 : error group running kubernetes 1.27
Screenshot from 2023-08-30 17-53-30
Screenshot from 2023-08-30 17-53-22

Helper instantly killed

We'll stick our clusters on kube 1.26.X (less than 1.27) on our side for now
but please Harness/Litmus team have a look at https://kubernetes.io/blog/2023/03/17/upcoming-changes-in-kubernetes-v1-27/#removal-of-container-runtime-command-line-argument

@ksatchit
Copy link
Member

This is fixed in 3.00beta10 via litmuschaos/litmus-go#665

In 2.14.1 via litmuschaos/litmus-go#669

@rumstead
Copy link

rumstead commented Dec 13, 2023

This is fixed in 3.00beta10 via litmuschaos/litmus-go#665

In 2.14.1 via litmuschaos/litmus-go#669

Based on the PRs, how does deleting labels fix the issue? The release notes state a kubelet flag but I don't see how that would impact starting the helper pods via the k8s API?

EDIT: Or is it related to the standard labels that are added to pods since 1.27

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.27.md#api-change-4

Pods owned by a Job now uses the labels batch.kubernetes.io/job-name and batch.kubernetes.io/controller-uid. The legacy labels job-name and controller-uid are still added for compatibility. (#114930, @kannon92)

@sebay
Copy link

sebay commented Mar 2, 2024

@ksatchit can the 2.14.1 be pushed to dockerhub?
The only other solution is using 3.x which is big change (and I am yet to have it fully working..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants