Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The handler sometimes doesn't cordon any node #97

Closed
someone-stole-my-name opened this issue Apr 12, 2023 · 9 comments
Closed

The handler sometimes doesn't cordon any node #97

someone-stole-my-name opened this issue Apr 12, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@someone-stole-my-name
Copy link
Contributor

Describe the bug

The getRollingUpdateTimestampsFromNode(node) conditions used throughout the code to check if a node should be cordoned or not have a flaw. If something or someone decides to stop the rolling update, and manually uncordon the nodes, the handler on next start will happily evict pods in that node without actually cordoning anything. It is specially problematic when using the eager cordoning feature, since it leads to an upgrade that can never end.

What do you see?

No response

What do you expect to see?

No response

List the steps that must be taken to reproduce this issue

  1. Start the handler
  2. Let it cordon a node
  3. Stop the handler
  4. Uncordon the node manually
  5. Start the handler again

Version

No response

Additional information

No response

@someone-stole-my-name someone-stole-my-name added the bug Something isn't working label Apr 12, 2023
@zaafar
Copy link

zaafar commented Jun 1, 2023

Is there any workaround for this bug?

@someone-stole-my-name
Copy link
Contributor Author

Is there any workaround for this bug?

Manually removing the annotations.

@zaafar
Copy link

zaafar commented Jun 1, 2023

Just to be clear, that wouldn't redo the EAGER_CORDONING, it will only redo the regular cordoning. Is there any hack/trick to redo the EAGER_CORDONING after this bug?

@TwiN
Copy link
Owner

TwiN commented Jun 5, 2023

Seems like this bug was introduced in #42. Versions prior to v1.7.0 didn't have the aws-eks-asg-rolling-update-handler.twin.sh/cordoned-at annotations, and I believe I had specifically avoided adding this annotation just in case something like this would happen.

I'll see if I can do something 🤔

@TwiN
Copy link
Owner

TwiN commented Jun 5, 2023

@zaafar @someone-stole-my-name Hey folks, I think #117 fixed this, but I'd appreciate if you could both give it a try.

As soon as https://github.com/TwiN/aws-eks-asg-rolling-update-handler/actions/runs/5172952729 finishes, it should be available with the latest tag (twinproduction/aws-eks-asg-rolling-update-handler:latest)

@TwiN TwiN pinned this issue Jun 5, 2023
@zaafar
Copy link

zaafar commented Jun 5, 2023

sure, i am going to test it and get back to you.

@zaafar
Copy link

zaafar commented Jun 5, 2023

looks good to me.

@zaafar
Copy link

zaafar commented Jun 5, 2023

please whenever you get some time, make a release.

@TwiN
Copy link
Owner

TwiN commented Jun 5, 2023

Just released the fix in v1.8.1

@TwiN TwiN closed this as completed Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants