Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit refresh rate of GCE MIG instances. #5665

Merged
merged 1 commit into from
Apr 18, 2023

Conversation

olagacek
Copy link
Contributor

@olagacek olagacek commented Apr 6, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR limits the number of GCE calls to fetch mig instances. After large scale down, when a lot of instances got removed, we perform a lot of redundant calls to GCE - we try to fetch already deleted instances, even though it was done moments ago for a different, already deleted instance. To avoid that this PR introduces rate (by default 5s) at which mig instances can be refetched.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 6, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 6, 2023
@jayantjain93
Copy link
Contributor

/assign @jayantjain93

@@ -196,6 +196,8 @@ type AutoscalingOptions struct {
AWSUseStaticInstanceList bool
// ConcurrentGceRefreshes is the maximum number of concurrently refreshed instance groups or instance templates.
ConcurrentGceRefreshes int
// GCEMigInstancesMaxRefreshRate is the maximum rate at which GCE MIG instances from a given MIG can be refreshed.
GCEMigInstancesMaxRefreshRate time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we collect multiple gce options that we have GceExpanderEphemeralStorageSupport into a single cloudProvider struct. Having multiple cloud-provider specific flags in common Options isn't a good president.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, created a new struct

@@ -192,6 +192,7 @@ var (
balancingLabelsFlag = multiStringFlag("balancing-label", "Specifies a label to use for comparing if two node groups are similar, rather than the built in heuristics. Setting this flag disables all other comparison logic, and cannot be combined with --balancing-ignore-label.")
awsUseStaticInstanceList = flag.Bool("aws-use-static-instance-list", false, "Should CA fetch instance types in runtime or use a static list. AWS only")
concurrentGceRefreshes = flag.Int("gce-concurrent-refreshes", 1, "Maximum number of concurrent refreshes per cloud object type.")
gceMigInstancesMaxRefreshRate = flag.Duration("gce-mig-instances-max-refresh-rate", 5*time.Second, "Maximum rate at which GCE MIG instances from a given MIG can be refreshed.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also look to collect cloud-provider specific flags in a block with comments.
e.g.

randomFlag = flag.Bool()

// --- gce only flags --
gceMigInstancesMaxRefreshRate = flag.Bool()
gceExpanderEphemeralStorageSupport = flag.Bool()
// ---

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

gceClient: gceClient,
projectId: projectId,
concurrentGceRefreshes: concurrentGceRefreshes,
migInstancesMaxRefreshRate: migInstancesMaxRefreshRate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this variable (incl. in different places) to something on this lines of migInstancesMinRefreshWaitTime.
We want to signify this is the amount of time you have to wait before refetching the mig instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, changed

klog.V(4).Infof("Regenerating MIG instances cache for %s", migRef.String())
instances, err := c.gceClient.FetchMigInstances(migRef)
if err != nil {
c.migLister.HandleMigIssue(migRef, err)
return err
}
c.migInstancesLastRefreshedInfo[migRef.String()] = time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only sets the refresh value for a successful fetch which return instances.
Don't we also want to wait for the same timeout for unsuccessful fetches?

Maybe we don't, can you drop a small rationale for it tho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment, I don't think we should cache unsuccessful results, as the issues can be transient with those so another call might make sense

@olagacek olagacek force-pushed the master branch 2 times, most recently from 5f6a3c1 to 8d990e5 Compare April 6, 2023 16:52
@olagacek olagacek requested a review from jayantjain93 April 6, 2023 16:59
@jayantjain93
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 7, 2023
// GCESpecificOptions contain autoscaling options specific to GCE cloud provider.
type GCESpecificOptions struct {
// ConcurrentGceRefreshes is the maximum number of concurrently refreshed instance groups or instance templates.
ConcurrentGceRefreshes int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these variables should not have Gce in them, as they are already defined in GCESpecificOptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, done.

provider := NewCachingMigInfoProvider(cache, migLister, client, mig.GceRef().Project, 1, tc.refreshRateDuration)
_, err = provider.GetMigForInstance(instanceRef)
assert.NoError(t, err)
time.Sleep(tc.delayBetweenCalls)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleeps in the unit test are a bit fragile. Could you add a clock to the cachingMigInfoProvider and mock it in unit tests instead? That way we could avoid using sleep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@BigDarkClown
Copy link
Contributor

/assign @BigDarkClown

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2023
@olagacek olagacek requested a review from BigDarkClown April 17, 2023 13:06
@BigDarkClown
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2023
@@ -158,12 +175,20 @@ func (c *cachingMigInfoProvider) findMigWithMatchingBasename(instanceRef GceRef)
}

func (c *cachingMigInfoProvider) fillMigInstances(migRef GceRef) error {
if val, ok := c.migInstancesLastRefreshedInfo[migRef.String()]; ok {
// do not regenerate MIG instances cache if last refresh happened recently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe log info that we're serving stale data from cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2023
@olagacek olagacek requested a review from x13n April 18, 2023 14:27
@x13n
Copy link
Member

x13n commented Apr 18, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BigDarkClown, jayantjain93, olagacek, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2023
@k8s-ci-robot k8s-ci-robot merged commit 5c3f810 into kubernetes:master Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants