Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Transient Degraded Status Change During Application Rolling Deployments in Argo CD 2.13.1 #21198

Open
atilsensalduz opened this issue Dec 16, 2024 · 4 comments
Assignees
Labels
bug Something isn't working component:health-check more-information-needed Further information is requested regression Bug is a regression, should be handled with high priority version:2.13 Latest confirmed affected version is 2.13

Comments

@atilsensalduz
Copy link
Contributor

After upgrading Argo CD to version 2.13.1, I've started observing unusual status behavior in my applications during rolling deployments or restarts. Specifically:
The application status briefly changes to Degraded, and then immediately back to Healthy, even though I can't identify any actual health issues in the deployment components.
I’ve reviewed the application events and didn’t find any readiness or liveness probe failures, or other indicators of degraded health in the resources.
This behavior triggers Degraded state notifications, but since there’s no option to send notifications for transitions from Degraded to Healthy, I end up receiving alerts for transient states that quickly resolve themselves.

Normal  ResourceUpdated  69s    argocd-application-controller  Updated health status: Healthy -> Progressing  
Normal  ResourceUpdated  42s    argocd-application-controller  Updated health status: Progressing -> Healthy  
Normal  ResourceUpdated  37s    argocd-application-controller  Updated health status: Healthy -> Degraded  
Normal  ResourceUpdated  37s    argocd-application-controller  Updated health status: Degraded -> Healthy  

This issue was not observed before the upgrade and only seems to happen during rolling deployments or restarts.
Has anyone encountered similar behavior or could this be a regression or configuration issue with health checks in 2.13.1? Any advice on fixing or mitigating this would be greatly appreciated!
Thanks in advance for your help!

@atilsensalduz atilsensalduz added the bug Something isn't working label Dec 16, 2024
@neiljain
Copy link

neiljain commented Dec 18, 2024

we've observed the same issue with 2.13.2 as well

{"application":"app***","dest-namespace":"default","dest-server":"https://kubernetes.default.svc","level":"info","msg":"Updated health status: Progressing -\u003e Healthy","reason":"ResourceUpdated","time":"2024-12-18T00:49:20Z","type":"Normal"}
{"application":"app***","dest-namespace":"default","dest-server":"https://kubernetes.default.svc","level":"info","msg":"Updated health status: Healthy -\u003e Degraded","reason":"ResourceUpdated","time":"2024-12-18T00:50:20Z","type":"Normal"}
{"application":"app***","dest-namespace":"default","dest-server":"https://kubernetes.default.svc","level":"info","msg":"Updated health status: Degraded -\u003e Healthy","reason":"ResourceUpdated","time":"2024-12-18T00:50:21Z","type":"Normal"}

@todaywasawesome todaywasawesome added the regression Bug is a regression, should be handled with high priority label Dec 19, 2024
@crenshaw-dev crenshaw-dev added the version:2.13 Latest confirmed affected version is 2.13 label Dec 19, 2024
@todaywasawesome
Copy link
Contributor

Thank you @andrii-korotkov-verkada for volunteering to investigate.

@atilsensalduz @neiljain Can you share what your apps look like and what healthchecks your resources are using?

@atilsensalduz
Copy link
Contributor Author

Hey guys, our applications follow a similar structure to the one below. We’ve encountered the same issue across different applications, not specific to one.

ArgoCD application:

kind: Application
metadata:
  labels:
    app.kubernetes.io/part-of: api
    argocd.argoproj.io/instance: aws-api
  name: dev
  namespace: argocd
spec:
  destination:
    name: aws-api
    namespace: dev
  project: api
  source:
    helm:
      releaseName: dev
      valueFiles:
      - values.yaml
    path: charts/dev
    repoURL: git@github.com
    targetRevision: main
  syncPolicy:
    automated:
      allowEmpty: true
      prune: true
      selfHeal: true
status:
  controllerNamespace: argocd
  health:
    status: Healthy
  history:
  operationState:
    message: successfully synced (no more tasks)
    operation:
      initiatedBy:
        automated: true
      retry:
        limit: 5
      sync:
        prune: true
    phase: Succeeded
    syncResult:
      resources:
      - group: policy
        hookPhase: Succeeded
        kind: PodDisruptionBudget
        message: PodDisruptionBudget has SufficientPods
        name: dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: ""
        hookPhase: Succeeded
        kind: ServiceAccount
        message: serviceaccount/dev-deployment-restart-sa unchanged
        name: dev-deployment-restart-sa
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: ""
        hookPhase: Succeeded
        kind: ConfigMap
        message: configmap/dora-metrics-script-dev unchanged
        name: dora-metrics-script-dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: ""
        hookPhase: Succeeded
        kind: ConfigMap
        message: configmap/dev configured
        name: dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: ""
        hookPhase: Succeeded
        kind: ConfigMap
        message: configmap/grafana-annotation-script-dev unchanged
        name: grafana-annotation-script-dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: rbac.authorization.k8s.io
        hookPhase: Succeeded
        kind: ClusterRoleBinding
        message: clusterrolebinding.rbac.authorization.k8s.io/dev-discovery
          reconciled. clusterrolebinding.rbac.authorization.k8s.io/dev-discovery
          unchanged
        name: dev-discovery
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: rbac.authorization.k8s.io
        hookPhase: Succeeded
        kind: Role
        message: role.rbac.authorization.k8s.io/dev-deployment-restart-role
          reconciled. role.rbac.authorization.k8s.io/dev-deployment-restart-role
          unchanged
        name: dev-deployment-restart-role
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: rbac.authorization.k8s.io
        hookPhase: Succeeded
        kind: RoleBinding
        message: rolebinding.rbac.authorization.k8s.io/dev-deployment-restart-rolebinding
          reconciled. rolebinding.rbac.authorization.k8s.io/dev-deployment-restart-rolebinding
          unchanged
        name: dev-deployment-restart-rolebinding
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: ""
        hookPhase: Succeeded
        kind: Service
        message: service/dev unchanged
        name: dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: apps
        hookPhase: Succeeded
        kind: Deployment
        message: deployment.apps/dev configured
        name: dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: autoscaling
        hookPhase: Succeeded
        kind: HorizontalPodAutoscaler
        message: recommended size matches current size
        name: dev
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v2
      - group: batch
        hookPhase: Succeeded
        kind: CronJob
        message: cronjob.batch/rolling-restart-job unchanged
        name: rolling-restart-job
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1
      - group: external-secrets.io
        hookPhase: Succeeded
        kind: ExternalSecret
        message: Secret was synced
        name: dev-external-secret
        namespace: dev
        status: Synced
        syncPhase: Sync
        version: v1beta1
      - group: batch
        hookPhase: Succeeded
        hookType: PostSync
        kind: Job
        message: job.batch/dora-dev-c589e19-postsync-1733733244 created
        name: dora-dev-c589e19-postsync-1733733244
        namespace: dev
        syncPhase: PostSync
        version: v1
      - group: batch
        hookPhase: Succeeded
        hookType: PostSync
        kind: Job
        message: job.batch/dev-c589e19-postsync-1733733244 created
        name: dev-c589e19-postsync-1733733244
        namespace: dev
        syncPhase: PostSync
        version: v1
      source:
        helm:
          valueFiles:
          - values.yaml
        path: charts/dev
        repoURL: git@github.com:argocd.git
        targetRevision: main
  resources:
  - kind: ConfigMap
    name: dev
    namespace: dev
    status: Synced
    version: v1
  - kind: ConfigMap
    name: dora-metrics-script-dev
    namespace: dev
    status: Synced
    version: v1
  - kind: ConfigMap
    name: grafana-annotation-script-dev
    namespace: dev
    status: Synced
    version: v1
  - health:
      status: Healthy
    kind: Service
    name: dev
    namespace: dev
    status: Synced
    version: v1
  - kind: ServiceAccount
    name: dev-deployment-restart-sa
    namespace: dev
    status: Synced
    version: v1
  - group: apps
    health:
      status: Healthy
    kind: Deployment
    name: dev
    namespace: dev
    status: Synced
    version: v1
  - group: autoscaling
    health:
      message: recommended size matches current size
      status: Healthy
    kind: HorizontalPodAutoscaler
    name: dev
    namespace: dev
    status: Synced
    version: v2
  - group: batch
    kind: CronJob
    name: rolling-restart-job
    namespace: dev
    status: Synced
    version: v1
  - group: external-secrets.io
    health:
      message: Secret was synced
      status: Healthy
    kind: ExternalSecret
    name: dev-external-secret
    namespace: dev
    status: Synced
    version: v1beta1
  - group: policy
    health:
      message: PodDisruptionBudget has SufficientPods
      status: Healthy
    kind: PodDisruptionBudget
    name: dev
    namespace: dev
    status: Synced
    version: v1
  - group: rbac.authorization.k8s.io
    kind: ClusterRoleBinding
    name: dev-discovery
    status: Synced
    version: v1
  - group: rbac.authorization.k8s.io
    kind: Role
    name: dev-deployment-restart-role
    namespace: dev
    status: Synced
    version: v1
  - group: rbac.authorization.k8s.io
    kind: RoleBinding
    name: dev-deployment-restart-rolebinding
    namespace: dev
    status: Synced
    version: v1
  sourceType: Helm
  summary:
    images:
    - bitnami/kubectl:latest
    - dev:5.13.0
  sync:
    comparedTo:
      destination:
        namespace: dev
      source:
        helm:
          parameters:
          valueFiles:
          - values.yaml
        path: charts/dev
        repoURL: git@github.com:argocd.git
        targetRevision: main
    status: Synced```


Probes:
```startupProbe:
  httpGet: 
    path: '/{{ include ".getMainSubPath" $ }}/health/liveness'
    port: http
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 2
  successThreshold: 1
  failureThreshold: 12
livenessProbe:
  httpGet:
    path: '/{{ include ".getMainSubPath" $ }}/health/liveness'
    port: http
  initialDelaySeconds: 30
  timeoutSeconds: 2
  successThreshold: 1
  periodSeconds: 30
  failureThreshold: 10
readinessProbe:
  httpGet:
    path: '/{{ include ".getMainSubPath" $ }}/health/readiness'
    port: http
  initialDelaySeconds: 30
  timeoutSeconds: 2
  successThreshold: 1
  periodSeconds: 30
  failureThreshold: 10```

@andrii-korotkov-verkada
Copy link
Contributor

Can you enable debug logs for the application controller and share all the logs relevant to the application, please?

@andrii-korotkov-verkada andrii-korotkov-verkada added the more-information-needed Further information is requested label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:health-check more-information-needed Further information is requested regression Bug is a regression, should be handled with high priority version:2.13 Latest confirmed affected version is 2.13
Projects
None yet
Development

No branches or pull requests

5 participants