Limit refresh rate of GCE MIG instances. #5665

olagacek · 2023-04-06T11:44:56Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR limits the number of GCE calls to fetch mig instances. After large scale down, when a lot of instances got removed, we perform a lot of redundant calls to GCE - we try to fetch already deleted instances, even though it was done moments ago for a different, already deleted instance. To avoid that this PR introduces rate (by default 5s) at which mig instances can be refetched.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

jayantjain93 · 2023-04-06T12:01:47Z

/assign @jayantjain93

jayantjain93 · 2023-04-06T11:54:14Z

cluster-autoscaler/config/autoscaling_options.go

@@ -196,6 +196,8 @@ type AutoscalingOptions struct {
 	AWSUseStaticInstanceList bool
 	// ConcurrentGceRefreshes is the maximum number of concurrently refreshed instance groups or instance templates.
 	ConcurrentGceRefreshes int
+	// GCEMigInstancesMaxRefreshRate is the maximum rate at which GCE MIG instances from a given MIG can be refreshed.
+	GCEMigInstancesMaxRefreshRate time.Duration


Can we collect multiple gce options that we have GceExpanderEphemeralStorageSupport into a single cloudProvider struct. Having multiple cloud-provider specific flags in common Options isn't a good president.

makes sense, created a new struct

jayantjain93 · 2023-04-06T11:56:36Z

cluster-autoscaler/main.go

@@ -192,6 +192,7 @@ var (
 	balancingLabelsFlag                = multiStringFlag("balancing-label", "Specifies a label to use for comparing if two node groups are similar, rather than the built in heuristics. Setting this flag disables all other comparison logic, and cannot be combined with --balancing-ignore-label.")
 	awsUseStaticInstanceList           = flag.Bool("aws-use-static-instance-list", false, "Should CA fetch instance types in runtime or use a static list. AWS only")
 	concurrentGceRefreshes             = flag.Int("gce-concurrent-refreshes", 1, "Maximum number of concurrent refreshes per cloud object type.")
+	gceMigInstancesMaxRefreshRate      = flag.Duration("gce-mig-instances-max-refresh-rate", 5*time.Second, "Maximum rate at which GCE MIG instances from a given MIG can be refreshed.")


I would also look to collect cloud-provider specific flags in a block with comments.
e.g.

randomFlag = flag.Bool() // --- gce only flags -- gceMigInstancesMaxRefreshRate = flag.Bool() gceExpanderEphemeralStorageSupport = flag.Bool() // ---

jayantjain93 · 2023-04-06T12:07:47Z

cluster-autoscaler/cloudprovider/gce/mig_info_provider.go

+		gceClient:                     gceClient,
+		projectId:                     projectId,
+		concurrentGceRefreshes:        concurrentGceRefreshes,
+		migInstancesMaxRefreshRate:    migInstancesMaxRefreshRate,


I would rename this variable (incl. in different places) to something on this lines of migInstancesMinRefreshWaitTime.
We want to signify this is the amount of time you have to wait before refetching the mig instances.

makes sense, changed

jayantjain93 · 2023-04-06T12:24:22Z

cluster-autoscaler/cloudprovider/gce/mig_info_provider.go

 	klog.V(4).Infof("Regenerating MIG instances cache for %s", migRef.String())
 	instances, err := c.gceClient.FetchMigInstances(migRef)
 	if err != nil {
 		c.migLister.HandleMigIssue(migRef, err)
 		return err
 	}
+	c.migInstancesLastRefreshedInfo[migRef.String()] = time.Now()


this only sets the refresh value for a successful fetch which return instances.
Don't we also want to wait for the same timeout for unsuccessful fetches?

Maybe we don't, can you drop a small rationale for it tho.

added a comment, I don't think we should cache unsuccessful results, as the issues can be transient with those so another call might make sense

jayantjain93 · 2023-04-07T13:57:22Z

/lgtm
/approve

BigDarkClown · 2023-04-14T11:28:56Z

cluster-autoscaler/config/autoscaling_options.go

+// GCESpecificOptions contain autoscaling options specific to GCE cloud provider.
+type GCESpecificOptions struct {
+	// ConcurrentGceRefreshes is the maximum number of concurrently refreshed instance groups or instance templates.
+	ConcurrentGceRefreshes int


All these variables should not have Gce in them, as they are already defined in GCESpecificOptions.

makes sense, done.

BigDarkClown · 2023-04-14T11:31:46Z

cluster-autoscaler/cloudprovider/gce/mig_info_provider_test.go

+			provider := NewCachingMigInfoProvider(cache, migLister, client, mig.GceRef().Project, 1, tc.refreshRateDuration)
+			_, err = provider.GetMigForInstance(instanceRef)
+			assert.NoError(t, err)
+			time.Sleep(tc.delayBetweenCalls)


Sleeps in the unit test are a bit fragile. Could you add a clock to the cachingMigInfoProvider and mock it in unit tests instead? That way we could avoid using sleep.

BigDarkClown · 2023-04-14T13:56:19Z

/assign @BigDarkClown

BigDarkClown · 2023-04-18T09:08:27Z

/lgtm
/approve

x13n · 2023-04-18T12:45:33Z

cluster-autoscaler/cloudprovider/gce/mig_info_provider.go

@@ -158,12 +175,20 @@ func (c *cachingMigInfoProvider) findMigWithMatchingBasename(instanceRef GceRef)
 }

 func (c *cachingMigInfoProvider) fillMigInstances(migRef GceRef) error {
+	if val, ok := c.migInstancesLastRefreshedInfo[migRef.String()]; ok {
+		// do not regenerate MIG instances cache if last refresh happened recently.


Maybe log info that we're serving stale data from cache?

x13n · 2023-04-18T14:56:10Z

/lgtm
/approve

k8s-ci-robot · 2023-04-18T15:02:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BigDarkClown, jayantjain93, olagacek, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [x13n]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 6, 2023

k8s-ci-robot requested a review from BigDarkClown April 6, 2023 11:45

k8s-ci-robot added the area/cluster-autoscaler label Apr 6, 2023

k8s-ci-robot requested a review from jayantjain93 April 6, 2023 11:45

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 6, 2023

k8s-ci-robot assigned jayantjain93 Apr 6, 2023

jayantjain93 reviewed Apr 6, 2023

View reviewed changes

olagacek force-pushed the master branch 2 times, most recently from 5f6a3c1 to 8d990e5 Compare April 6, 2023 16:52

olagacek requested a review from jayantjain93 April 6, 2023 16:59

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 7, 2023

BigDarkClown requested changes Apr 14, 2023

View reviewed changes

k8s-ci-robot assigned BigDarkClown Apr 14, 2023

olagacek force-pushed the master branch from 8d990e5 to 5e10a8b Compare April 17, 2023 12:50

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2023

olagacek force-pushed the master branch from 5e10a8b to a3f79ac Compare April 17, 2023 13:01

olagacek requested a review from BigDarkClown April 17, 2023 13:06

olagacek force-pushed the master branch from a3f79ac to 120b0a1 Compare April 17, 2023 13:21

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2023

x13n reviewed Apr 18, 2023

View reviewed changes

Limit refresh rate of GCE MIG instances.

656f191

olagacek force-pushed the master branch from 120b0a1 to 656f191 Compare April 18, 2023 13:08

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2023

olagacek requested a review from x13n April 18, 2023 14:27

k8s-ci-robot assigned x13n Apr 18, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2023

k8s-ci-robot merged commit 5c3f810 into kubernetes:master Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit refresh rate of GCE MIG instances. #5665

Limit refresh rate of GCE MIG instances. #5665

olagacek commented Apr 6, 2023

jayantjain93 commented Apr 6, 2023

jayantjain93 Apr 6, 2023

olagacek Apr 6, 2023

jayantjain93 Apr 6, 2023

olagacek Apr 6, 2023

jayantjain93 Apr 6, 2023

olagacek Apr 6, 2023

jayantjain93 Apr 6, 2023

olagacek Apr 6, 2023

jayantjain93 commented Apr 7, 2023

BigDarkClown Apr 14, 2023

olagacek Apr 17, 2023

BigDarkClown Apr 14, 2023

olagacek Apr 17, 2023

BigDarkClown commented Apr 14, 2023

BigDarkClown commented Apr 18, 2023

x13n Apr 18, 2023

olagacek Apr 18, 2023

x13n commented Apr 18, 2023

k8s-ci-robot commented Apr 18, 2023

Limit refresh rate of GCE MIG instances. #5665

Limit refresh rate of GCE MIG instances. #5665

Conversation

olagacek commented Apr 6, 2023

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

jayantjain93 commented Apr 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantjain93 commented Apr 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BigDarkClown commented Apr 14, 2023

BigDarkClown commented Apr 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

x13n commented Apr 18, 2023

k8s-ci-robot commented Apr 18, 2023