Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast-tracked rollback to stable with dynamicStableScale went under maxUnavailable #3020

Closed
jessesuen opened this issue Sep 6, 2023 · 2 comments · Fixed by #3077
Closed
Labels
bug Something isn't working

Comments

@jessesuen
Copy link
Member

jessesuen commented Sep 6, 2023

Describe the bug

Argo Rollouts v1.4.1

The following sequence of events occurred on a user's rollout, with:

  • 21 replicas
  • traffic routing
  • dynamicStableScale
  1. The rollout was in the middle of an update. Traffic was split between canary stable
  2. Before the update completed, the stable pod spec was reapplied. When this happens, we treat this event as a fast-tracked rollback
  3. As the rollout was updating to the new desired/stable RS, the total available pods went under the minAvailable and traffic was directed 100% to an undersized replicaset.

See the following abbreviated logs (with emphasis on problematic behavior). The logs are grepped by only Patch events and Kube Events:

time="2023-08-29T02:22:18Z" level=info msg="Event(): reason: 'TrafficWeightUpdated' Traffic weight updated from 85 to 90"
time="2023-08-29T02:22:18Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":23,\"canary\":{\"weights\":{\"canary\":{\"weight\":90},\"stable\":{\"weight\":10}}},\"conditions\":[{\"lastTransitionTime\":\"2023-08-23T06:02:09Z\",\"lastUpdateTime\":\"2023-08-23T06:02:09Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:22:18Z\",\"lastUpdateTime\":\"2023-08-29T02:22:18Z\",\"message\":\"ReplicaSet \\\"guestbook-6c99db4ccf\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"phase\":\"Progressing\",\"readyReplicas\":23}}" generation=10 namespace=guestbook resourceVersion=561626179 rollout=guestbook
time="2023-08-29T02:22:19Z" level=info msg="Event(): reason: 'ScalingReplicaSet' Scaled down ReplicaSet guestbook-854db48c66 (revision 10) from 4 to 3"
time="2023-08-29T02:22:19Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":24,\"availableReplicas\":22,\"conditions\":[{\"lastTransitionTime\":\"2023-08-23T06:02:09Z\",\"lastUpdateTime\":\"2023-08-23T06:02:09Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:22:18Z\",\"lastUpdateTime\":\"2023-08-29T02:22:19Z\",\"message\":\"ReplicaSet \\\"guestbook-6c99db4ccf\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"readyReplicas\":22,\"replicas\":24}}" generation=10 namespace=guestbook resourceVersion=561631825 rollout=guestbook
time="2023-08-29T02:32:18Z" level=info msg="Event(): reason: 'TrafficWeightUpdated' Traffic weight updated from 90 to 95"
time="2023-08-29T02:32:18Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":23,\"canary\":{\"weights\":{\"canary\":{\"weight\":95},\"stable\":{\"weight\":5}}},\"conditions\":[{\"lastTransitionTime\":\"2023-08-23T06:02:09Z\",\"lastUpdateTime\":\"2023-08-23T06:02:09Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:22:18Z\",\"lastUpdateTime\":\"2023-08-29T02:32:18Z\",\"message\":\"ReplicaSet \\\"guestbook-6c99db4ccf\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"readyReplicas\":23}}" generation=10 namespace=guestbook resourceVersion=561631846 rollout=guestbook
time="2023-08-29T02:32:18Z" level=info msg="Event(): reason: 'ScalingReplicaSet' Scaled down ReplicaSet guestbook-854db48c66 (revision 10) from 3 to 2"
time="2023-08-29T02:32:18Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":23,\"availableReplicas\":22,\"readyReplicas\":22,\"replicas\":23}}" generation=10 namespace=guestbook resourceVersion=561667475 rollout=guestbook

############################################################
# Here is when fast-tracked rollback happened
############################################################
time="2023-08-29T02:34:06Z" level=info msg="Event(): reason: 'RolloutUpdated' Rollout updated to revision 12"
time="2023-08-29T02:34:06Z" level=info msg="Event(): reason: 'SkipSteps' Rollback to stable"
time="2023-08-29T02:34:06Z" level=info msg="Patched: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2023-08-23T06:02:09Z\",\"lastUpdateTime\":\"2023-08-23T06:02:09Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:22:18Z\",\"lastUpdateTime\":\"2023-08-29T02:34:06Z\",\"message\":\"ReplicaSet \\\"guestbook-854db48c66\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2023-08-29T02:34:06Z\",\"lastUpdateTime\":\"2023-08-29T02:34:06Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"}],\"currentPodHash\":\"854db48c66\",\"message\":\"more replicas need to be updated\",\"promoteFull\":null,\"updatedReplicas\":2,\"workloadObservedGeneration\":\"12\"}}" generation=10 namespace=guestbook resourceVersion=561673533 rollout=guestbook

############################################################
# We switched Service selector back to the stable (`854db48c66` revision 10 and 12) which is now the desired. 
# We also set traffic weight to 0. So all traffic is going to desired/stable
# However, notice that ReplicaSet `854db48c66` only has 2 replicas. This overwhelmed the ReplicaSet
############################################################
time="2023-08-29T02:34:11Z" level=info msg="Event(): reason: 'SwitchService' Switched selector for service 'o312971574277-canary' from '6c99db4ccf' to '854db48c66'"
time="2023-08-29T02:34:11Z" level=info msg="Event(): reason: 'TrafficWeightUpdated' Traffic weight updated from 95 to 0"
time="2023-08-29T02:34:11Z" level=info msg="Event(): reason: 'ScalingReplicaSet' Scaled up ReplicaSet guestbook-854db48c66 (revision 12) from 2 to 21"
time="2023-08-29T02:34:11Z" level=info msg="Patched: {\"status\":{\"canary\":{\"weights\":{\"canary\":{\"podTemplateHash\":\"854db48c66\",\"weight\":0},\"stable\":{\"weight\":100}}}}}" generation=10 namespace=guestbook resourceVersion=561673535 rollout=guestbook

############################################################
# The previous canary replicaset (`6c99db4ccf` revision 11) is scaled down
# But it doesn't matter because no traffic is directed towards it
############################################################
time="2023-08-29T02:34:11Z" level=info msg="Event(): reason: 'ScalingReplicaSet' Scaled down ReplicaSet guestbook-6c99db4ccf (revision 11) from 21 to 20"
time="2023-08-29T02:34:11Z" level=info msg="Event(): reason: 'ScalingReplicaSet' Scaled down ReplicaSet guestbook-6c99db4ccf (revision 11) from 20 to 0"
time="2023-08-29T02:34:11Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":22,\"conditions\":[{\"lastTransitionTime\":\"2023-08-23T06:02:09Z\",\"lastUpdateTime\":\"2023-08-23T06:02:09Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:34:06Z\",\"lastUpdateTime\":\"2023-08-29T02:34:06Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:22:18Z\",\"lastUpdateTime\":\"2023-08-29T02:34:11Z\",\"message\":\"ReplicaSet \\\"guestbook-854db48c66\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"replicas\":22}}" generation=10 namespace=guestbook resourceVersion=561673878 rollout=guestbook

############################################################
# Here we can clearly see that `availableReplicas` is only 2, based on the patch
############################################################
time="2023-08-29T02:34:11Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":2,\"availableReplicas\":2,\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:34:06Z\",\"lastUpdateTime\":\"2023-08-29T02:34:06Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:22:18Z\",\"lastUpdateTime\":\"2023-08-29T02:34:11Z\",\"message\":\"ReplicaSet \\\"guestbook-854db48c66\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2023-08-29T02:34:11Z\",\"lastUpdateTime\":\"2023-08-29T02:34:11Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"}],\"readyReplicas\":2,\"replicas\":2}}" generation=10 namespace=guestbook resourceVersion=561673992 rollout=guestbook
time="2023-08-29T02:34:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":21,\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:34:06Z\",\"lastUpdateTime\":\"2023-08-29T02:34:06Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:34:11Z\",\"lastUpdateTime\":\"2023-08-29T02:34:11Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-29T02:22:18Z\",\"lastUpdateTime\":\"2023-08-29T02:34:12Z\",\"message\":\"ReplicaSet \\\"guestbook-854db48c66\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"updated replicas are still becoming available\",\"replicas\":21,\"updatedReplicas\":21}}" generation=10 namespace=guestbook resourceVersion=561673998 rollout=guestbook
time="2023-08-29T02:44:13Z" level=info msg="Patched: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:34:06Z\",\"lastUpdateTime\":\"2023-08-29T02:34:06Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:34:11Z\",\"lastUpdateTime\":\"2023-08-29T02:34:11Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-29T02:44:13Z\",\"lastUpdateTime\":\"2023-08-29T02:44:13Z\",\"message\":\"ReplicaSet \\\"guestbook-854db48c66\\\" has timed out progressing.\",\"reason\":\"ProgressDeadlineExceeded\",\"status\":\"False\",\"type\":\"Progressing\"}],\"message\":\"ProgressDeadlineExceeded: ReplicaSet \\\"guestbook-854db48c66\\\" has timed out progressing.\",\"phase\":\"Degraded\"}}" generation=10 namespace=guestbook resourceVersion=561674153 rollout=guestbook
time="2023-08-29T02:46:50Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":3,\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T16:01:38Z\",\"lastUpdateTime\":\"2023-08-28T16:01:38Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:46:50Z\",\"lastUpdateTime\":\"2023-08-29T02:46:50Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-02T10:06:30Z\",\"lastUpdateTime\":\"2023-08-29T02:46:50Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"ReplicaSetNotAvailable\",\"status\":\"True\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2023-08-29T02:46:50Z\",\"lastUpdateTime\":\"2023-08-29T02:46:50Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"}],\"message\":\"updated replicas are still becoming available\",\"phase\":\"Progressing\",\"readyReplicas\":3}}" generation=4 namespace=ns-team-ffh-prod resourceVersion=559538368 rollout=firefly--firefly-api-prod-deploy1
time="2023-08-29T02:47:23Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":3,\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:34:06Z\",\"lastUpdateTime\":\"2023-08-29T02:34:06Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:34:11Z\",\"lastUpdateTime\":\"2023-08-29T02:34:11Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-29T02:47:23Z\",\"lastUpdateTime\":\"2023-08-29T02:47:23Z\",\"message\":\"ReplicaSet \\\"guestbook-854db48c66\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"updated replicas are still becoming available\",\"phase\":\"Progressing\",\"readyReplicas\":3}}" generation=10 namespace=guestbook resourceVersion=561716834 rollout=guestbook
time="2023-08-29T02:47:52Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":2,\"readyReplicas\":2}}" generation=4 namespace=ns-team-ffh-prod resourceVersion=561726222 rollout=firefly--firefly-api-prod-deploy1
time="2023-08-29T02:47:53Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":3,\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T16:01:38Z\",\"lastUpdateTime\":\"2023-08-28T16:01:38Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:46:50Z\",\"lastUpdateTime\":\"2023-08-29T02:46:50Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T02:46:50Z\",\"lastUpdateTime\":\"2023-08-29T02:46:50Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-02T10:06:30Z\",\"lastUpdateTime\":\"2023-08-29T02:47:53Z\",\"message\":\"ReplicaSet \\\"firefly--firefly-api-prod-deploy1-76b7c56b\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"readyReplicas\":3}}" generation=4 namespace=ns-team-ffh-prod resourceVersion=561729974 rollout=firefly--firefly-api-prod-deploy1
time="2023-08-29T02:48:24Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":4,\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:34:06Z\",\"lastUpdateTime\":\"2023-08-29T02:34:06Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:34:11Z\",\"lastUpdateTime\":\"2023-08-29T02:34:11Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-29T02:47:23Z\",\"lastUpdateTime\":\"2023-08-29T02:48:24Z\",\"message\":\"ReplicaSet \\\"guestbook-854db48c66\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"readyReplicas\":4}}" generation=10 namespace=guestbook resourceVersion=561728198 rollout=guestbook
time="2023-08-29T02:49:06Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":4,\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T16:01:38Z\",\"lastUpdateTime\":\"2023-08-28T16:01:38Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:49:06Z\",\"lastUpdateTime\":\"2023-08-29T02:49:06Z\",\"message\":\"Rollout is healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"True\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-02T10:06:30Z\",\"lastUpdateTime\":\"2023-08-29T02:49:06Z\",\"message\":\"ReplicaSet \\\"firefly--firefly-api-prod-deploy1-76b7c56b\\\" has successfully progressed.\",\"reason\":\"NewReplicaSetAvailable\",\"status\":\"True\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2023-08-29T02:49:06Z\",\"lastUpdateTime\":\"2023-08-29T02:49:06Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"}],\"message\":null,\"phase\":\"Healthy\",\"readyReplicas\":4}}" generation=4 namespace=ns-team-ffh-prod resourceVersion=561730028 rollout=firefly--firefly-api-prod-deploy1
time="2023-08-29T02:49:14Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":5,\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:34:06Z\",\"lastUpdateTime\":\"2023-08-29T02:34:06Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:34:11Z\",\"lastUpdateTime\":\"2023-08-29T02:34:11Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-29T02:47:23Z\",\"lastUpdateTime\":\"2023-08-29T02:49:14Z\",\"message\":\"ReplicaSet \\\"guestbook-854db48c66\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"readyReplicas\":5}}" generation=10 namespace=guestbook resourceVersion=561731865 rollout=guestbook
time="2023-08-29T02:49:22Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":6,\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:34:06Z\",\"lastUpdateTime\":\"2023-08-29T02:34:06Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:34:11Z\",\"lastUpdateTime\":\"2023-08-29T02:34:11Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-29T02:47:23Z\",\"lastUpdateTime\":\"2023-08-29T02:49:22Z\",\"message\":\"ReplicaSet \\\"guestbook-854db48c66\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"readyReplicas\":6}}" generation=10 namespace=guestbook resourceVersion=561734853 rollout=guestbook
time="2023-08-29T02:49:22Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":7,\"readyReplicas\":7}}" generation=10 namespace=guestbook resourceVersion=561735289 rollout=guestbook
time="2023-08-29T02:49:22Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":10,\"readyReplicas\":10}}" generation=10 namespace=guestbook resourceVersion=561735349 rollout=guestbook
time="2023-08-29T02:49:22Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":15,\"readyReplicas\":15}}" generation=10 namespace=guestbook resourceVersion=561735374 rollout=guestbook
time="2023-08-29T02:49:23Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":16,\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:34:06Z\",\"lastUpdateTime\":\"2023-08-29T02:34:06Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:34:11Z\",\"lastUpdateTime\":\"2023-08-29T02:34:11Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-29T02:47:23Z\",\"lastUpdateTime\":\"2023-08-29T02:49:23Z\",\"message\":\"ReplicaSet \\\"guestbook-854db48c66\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"readyReplicas\":16}}" generation=10 namespace=guestbook resourceVersion=561735375 rollout=guestbook
time="2023-08-29T02:49:23Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":17,\"readyReplicas\":17}}" generation=10 namespace=guestbook resourceVersion=561735406 rollout=guestbook
time="2023-08-29T02:59:22Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":18,\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:34:06Z\",\"lastUpdateTime\":\"2023-08-29T02:34:06Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2023-08-29T02:34:11Z\",\"lastUpdateTime\":\"2023-08-29T02:34:11Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2023-08-29T02:47:23Z\",\"lastUpdateTime\":\"2023-08-29T02:59:22Z\",\"message\":\"ReplicaSet \\\"guestbook-854db48c66\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"readyReplicas\":18}}" generation=10 namespace=guestbook resourceVersion=561735409 rollout=guestbook
time="2023-08-29T03:00:22Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":19,\"conditions\":[{\"lastTransitionTime\":\"2023-08-28T23:56:22Z\",\"lastUpdateTime\":\"2023-08-28T23:56:22Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2023-08-29T01:46:54Z\",\"lastUpdateTime\":\"2023-08-29T01:46:54Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2023-08-29T02:34:06Z\",

To Reproduce

I have not yet reproduced this.

Expected behavior

Dynamic stable scale should not have sent 100% to the new desired/stable RS because it only had 2 pods..

I think the bug is that while we may be handling this properly in the abort case, we may not be handling it in the fast-tracked rollback to stable.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@jessesuen jessesuen added the bug Something isn't working label Sep 6, 2023
@jessesuen
Copy link
Member Author

jessesuen commented Sep 6, 2023

I can think of a few workarounds:

  1. Instead of applying the old stable, apply the stable spec with some change (e.g. new annotation). This will be treated like a normal update. At which point a promote full operation could be performed.

  2. Instead of applying the old stable, perform an abort. The abort handling of scaling and traffic shifting, I believe to be correct. Once the rollout is aborted, it should be safe to apply the old stable spec.

  3. avoid dynamicStableScale feature. When this is disabled, the stable RS will remain 100% scaled during the update, and so the traffic shifting back to it would be safe.

Unfortunately, if the old stable spec is applied on a Rollout with dynamicStableScale and in the middle of an update, we may not honor maxUnavailable and shift too much traffic to the undersized stable/desired ReplicaSet.

@jessesuen
Copy link
Member Author

Here is how to reproduce it

  1. apply the following manifests:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: istio-canary
spec:
  replicas: 10
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: istio-canary
  template:
    metadata:
      labels:
        app: istio-canary
        sidecar.istio.io/inject: "true"
    spec:
      containers:
      - name: istio-canary
        image: argoproj/rollouts-demo:red
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        resources:
          requests:
            memory: 32Mi
            cpu: 5m
        readinessProbe:
          initialDelaySeconds: 30
          httpGet:
            path: /
            port: 8080
          periodSeconds: 30
  strategy:
    canary:
      dynamicStableScale: true
      canaryService: istio-canary-canary
      stableService: istio-canary-stable
      trafficRouting:
        istio:
          virtualService:
            name: istio-canary
      steps:
      - setWeight: 90
      - pause: {}

---
apiVersion: v1
kind: Service
metadata:
  name: istio-canary
spec:
  ports:
  - port: 80
    targetPort: http
    protocol: TCP
    name: http
  selector:
    app: istio-canary

---
apiVersion: v1
kind: Service
metadata:
  name: istio-canary-canary
spec:
  ports:
  - port: 80
    targetPort: http
    protocol: TCP
    name: http
  selector:
    app: istio-canary

---
apiVersion: v1
kind: Service
metadata:
  name: istio-canary-stable
spec:
  ports:
  - port: 80
    targetPort: http
    protocol: TCP
    name: http
  selector:
    app: istio-canary

---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: istio-canary
spec:
  gateways:
  - istio-canary
  hosts:
  - istio-canary
  - istio-canary.localhost
  - jesse-rollout-scaledown.demo.akuity.io
  http:
  - route:
    - destination:
        host: istio-canary-stable
      weight: 100
    - destination:
        host: istio-canary-canary
      weight: 0
  1. Change image of the rollout (e.g. argoproj/rollouts-demo:orange) to trigger an update. Allow it to reach the pause step. The ReplicaSet counts should look like the following:
$ k get rs -o wide
NAME                      DESIRED   CURRENT   READY   AGE   CONTAINERS     IMAGES                          SELECTOR
istio-canary-7b8bcb8869   9         9         9       22h   istio-canary   argoproj/rollouts-demo:orange   app=istio-canary,rollouts-pod-template-hash=7b8bcb8869
istio-canary-f7d4dcd68    1         1         1       22h   istio-canary   argoproj/rollouts-demo:red      app=istio-canary,rollouts-pod-template-hash=f7d4dcd68```
  1. Change image of the rollout back to original argoproj/rollouts-demo:red. The service will immediately switch back to stable and overwhelm the single stable pod with traffic. Additionally traffic weight similarly shifts 100% back to stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant