-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit refresh rate of GCE MIG instances. #5665
Conversation
/assign @jayantjain93 |
@@ -196,6 +196,8 @@ type AutoscalingOptions struct { | |||
AWSUseStaticInstanceList bool | |||
// ConcurrentGceRefreshes is the maximum number of concurrently refreshed instance groups or instance templates. | |||
ConcurrentGceRefreshes int | |||
// GCEMigInstancesMaxRefreshRate is the maximum rate at which GCE MIG instances from a given MIG can be refreshed. | |||
GCEMigInstancesMaxRefreshRate time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we collect multiple gce options that we have GceExpanderEphemeralStorageSupport
into a single cloudProvider struct. Having multiple cloud-provider specific flags in common Options isn't a good president.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, created a new struct
cluster-autoscaler/main.go
Outdated
@@ -192,6 +192,7 @@ var ( | |||
balancingLabelsFlag = multiStringFlag("balancing-label", "Specifies a label to use for comparing if two node groups are similar, rather than the built in heuristics. Setting this flag disables all other comparison logic, and cannot be combined with --balancing-ignore-label.") | |||
awsUseStaticInstanceList = flag.Bool("aws-use-static-instance-list", false, "Should CA fetch instance types in runtime or use a static list. AWS only") | |||
concurrentGceRefreshes = flag.Int("gce-concurrent-refreshes", 1, "Maximum number of concurrent refreshes per cloud object type.") | |||
gceMigInstancesMaxRefreshRate = flag.Duration("gce-mig-instances-max-refresh-rate", 5*time.Second, "Maximum rate at which GCE MIG instances from a given MIG can be refreshed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also look to collect cloud-provider specific flags in a block with comments.
e.g.
randomFlag = flag.Bool()
// --- gce only flags --
gceMigInstancesMaxRefreshRate = flag.Bool()
gceExpanderEphemeralStorageSupport = flag.Bool()
// ---
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
gceClient: gceClient, | ||
projectId: projectId, | ||
concurrentGceRefreshes: concurrentGceRefreshes, | ||
migInstancesMaxRefreshRate: migInstancesMaxRefreshRate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this variable (incl. in different places) to something on this lines of migInstancesMinRefreshWaitTime
.
We want to signify this is the amount of time you have to wait before refetching the mig instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, changed
klog.V(4).Infof("Regenerating MIG instances cache for %s", migRef.String()) | ||
instances, err := c.gceClient.FetchMigInstances(migRef) | ||
if err != nil { | ||
c.migLister.HandleMigIssue(migRef, err) | ||
return err | ||
} | ||
c.migInstancesLastRefreshedInfo[migRef.String()] = time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only sets the refresh value for a successful fetch which return instances.
Don't we also want to wait for the same timeout for unsuccessful fetches?
Maybe we don't, can you drop a small rationale for it tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment, I don't think we should cache unsuccessful results, as the issues can be transient with those so another call might make sense
5f6a3c1
to
8d990e5
Compare
/lgtm |
// GCESpecificOptions contain autoscaling options specific to GCE cloud provider. | ||
type GCESpecificOptions struct { | ||
// ConcurrentGceRefreshes is the maximum number of concurrently refreshed instance groups or instance templates. | ||
ConcurrentGceRefreshes int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these variables should not have Gce
in them, as they are already defined in GCESpecificOptions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, done.
provider := NewCachingMigInfoProvider(cache, migLister, client, mig.GceRef().Project, 1, tc.refreshRateDuration) | ||
_, err = provider.GetMigForInstance(instanceRef) | ||
assert.NoError(t, err) | ||
time.Sleep(tc.delayBetweenCalls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleeps in the unit test are a bit fragile. Could you add a clock to the cachingMigInfoProvider
and mock it in unit tests instead? That way we could avoid using sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/assign @BigDarkClown |
/lgtm |
@@ -158,12 +175,20 @@ func (c *cachingMigInfoProvider) findMigWithMatchingBasename(instanceRef GceRef) | |||
} | |||
|
|||
func (c *cachingMigInfoProvider) fillMigInstances(migRef GceRef) error { | |||
if val, ok := c.migInstancesLastRefreshedInfo[migRef.String()]; ok { | |||
// do not regenerate MIG instances cache if last refresh happened recently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe log info that we're serving stale data from cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BigDarkClown, jayantjain93, olagacek, x13n The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR limits the number of GCE calls to fetch mig instances. After large scale down, when a lot of instances got removed, we perform a lot of redundant calls to GCE - we try to fetch already deleted instances, even though it was done moments ago for a different, already deleted instance. To avoid that this PR introduces rate (by default 5s) at which mig instances can be refetched.
Special notes for your reviewer:
Does this PR introduce a user-facing change?
NONE
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: