-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay force refresh by DefaultInterval when OCI GetNodePool call retu… #6584
Conversation
Welcome @vbhargav875! |
07e885c
to
23ab68f
Compare
23ab68f
to
2bf403e
Compare
if err != nil { | ||
if httpStatusCode == 404 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the OCI SDK returns a 404, is that considered an error (i.e. is the err != nil
check going to be true)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in case of 404, err is not nil.
klog.Infof("rebuilding cache") | ||
var resp oke.GetNodePoolResponse | ||
var statusCode int | ||
for id := range staticNodePools { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One consideration. We're looping over all of the node pools, but we exit/error out on the first one that errors out, which means you could have 20 node pools, and if the first one is deleted/errors out, then we won't reconcile any of them.
This is how the previous behavior worked as well, so we don't need to address it now, but we should note it as a bug (perhaps as a code comment and/or an internal ticket).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
Changes look good to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlamillan, vbhargav875 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
In the OCI Nodepools Implementation of ClusterAutoscaler, we do not set any delay to the calls that ClusterAutoscaler makes to the OKE GetNodePool API when the API returns a 404 error (which is possible when the Nodepool is deleted or an incorrect Nodepool ocid is passed) and hence we continuously trigger forceRefresh (when CA triggers Refresh) and therefore CA will indefinitely make GetNodePool calls to CP.
Introducing a max retry count and setting a delay to the next force refresh when the retries is exhausted will help prevent this.
Does this PR introduce a user-facing change?
No
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
None