[Azure VMs pool] Introducing agentpool client #6685

wenxuan0923 · 2024-04-03T18:49:41Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

We will introduce a new vmType: vms to support the autoscaling for single instance vm pool. This new agentpool type will rely on AKS rp for scaling up and down instead of CRP call, which requires the agentpool client, cluster name and cluster resource group info.

In this PR, I'm adding the required info to the config and adding agentpool client into azClient. So far, the construction error is ignored so that no existing behavior will be impacted by this change.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This is the first step of a series of changes coming to support the vms pool autoscaling.

Does this PR introduce a user-facing change?

No, this change should not impact any existing behavior

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla · 2024-04-03T18:49:45Z

The committers listed above are authorized under a signed CLA.

✅ login: wenxuan0923 / name: Thalia Wang (a1cbab6, 73f525c, 5e379e4)

k8s-ci-robot · 2024-04-03T18:49:50Z

Welcome @wenxuan0923!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-04-03T18:49:50Z

Hi @wenxuan0923. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Bryce-Soghigian

Left some initial review, mostly about the arm options and better practices with the shared track 2 sdk.

Let me know what you think

Bryce-Soghigian · 2024-04-03T20:34:30Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

+		&policy.ClientOptions{
+			ClientOptions: azurecore_policy.ClientOptions{
+				Cloud: cloud.Configuration{
+					Services: map[cloud.ServiceName]cloud.ServiceConfiguration{
+						cloud.ResourceManager: {
+							Endpoint: baseURL,
+							Audience: "UNKNOWN",
+						},
+					},
+				},
+				Telemetry: azurecore_policy.TelemetryOptions{
+					ApplicationID: userAgent,
+				},
+				Retry: retryOptions,
+			},
+		})


I love to see usage of the track 2 sdk! I would recommend some shared middleware the project karpenter, fleet and some other projects consume via https://github.com/Azure/azure-sdk-for-go-extensions

There are a couple of benefits.

Armbalancer will load balance the requests on the client side to prevent arm instance level throttling

There is a bug in go's tcp implementation for http2(which track 2 will use) where the connections are not effectively closed and we will get tcp timeout errors where the connection isn't closed for 17 minutes until it finally fails. Stephane patched this here

We could keep going but you get the point, we want to have this shared middleware used for track 2 implementations so that you don't see the same bugs and we can share all of our throttling lessons. We maintain one core library for these arm middleware and all share the benefits.

https://github.com/Azure/azure-sdk-for-go-extensions/blob/main/pkg/middleware/arm_client_opts.go#L25 are the arm client options i would reccomend as the base then you can override things that dont work for you.

Bryce-Soghigian · 2024-04-03T20:37:08Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

+func getAgentpoolClientCredentials(cfg *Config) (azcore.TokenCredential, error) {
+	var cred azcore.TokenCredential
+	var err error
+	if cfg.AuthMethod == authMethodCLI {
+		cred, err = azidentity.NewAzureCLICredential(&azidentity.AzureCLICredentialOptions{
+			TenantID: cfg.TenantID})
+		if err != nil {
+			klog.Errorf("NewAzureCLICredential failed: %v", err)
+			return nil, err
+		}
+	} else if cfg.AuthMethod == "" || cfg.AuthMethod == authMethodPrincipal {
+		cred, err = azidentity.NewClientSecretCredential(cfg.TenantID, cfg.AADClientID, cfg.AADClientSecret, nil)
+		if err != nil {
+			klog.Errorf("NewClientSecretCredential failed: %v", err)
+			return nil, err
+		}
+	} else {
+		return nil, fmt.Errorf("unsupported authorization method: %s", cfg.AuthMethod)
+	}
+	return cred, nil
+}
+


This kind of code seems to be copy pasted around quite a bit. I wonder if we can start to share some of it via the azure-sdk-for-go-extensions repo

Karpenter has a very similar implementation of the same thing essentially we want to authenticate for SP, MSI and CLI cred we all share this code.
https://github.com/Azure/karpenter-provider-azure/blob/main/pkg/auth/cred.go#L28

Bryce-Soghigian · 2024-04-03T20:41:39Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

+	return azurecore_policy.RetryOptions{
+		MaxRetries:    5,
+		RetryDelay:    1 * time.Second,
+		MaxRetryDelay: 2 * time.Second,
+		TryTimeout:    10 * time.Second,
+	}


With retries it would be great to get a clearer picture as to why we want to follow this pattern and what throttling concerns this avoids.

Could you leave some comments here explaining that pattern?

I am trying to understand the following.
Why did we choose RetryDelay of 1s, and why is max retries 5? How does this affect the subscription level quota limits? When would we hit too many requests aka 429 errors with throttling following this retry pattern?

If you have thorough learnings about retries and want to stick to this pattern because its better than the ones declared in the DefaultRetryOptions in the shared azure-sdk-for-go-extensions, please tell us why and contribute there so we can all benefit.

Bryce-Soghigian · 2024-04-03T20:43:41Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

+// AgentPoolsClient interface defines the methods needed for scaling vms pool.
+// it is implemented by track2 sdk armcontainerservice.AgentPoolsClient
+type AgentPoolsClient interface {
+	Get(ctx context.Context,
+		resourceGroupName, resourceName, agentPoolName string,
+		options *armcontainerservice.AgentPoolsClientGetOptions) (
+		armcontainerservice.AgentPoolsClientGetResponse, error)
+	BeginCreateOrUpdate(
+		ctx context.Context,
+		resourceGroupName, resourceName, agentPoolName string,
+		parameters armcontainerservice.AgentPool,
+		options *armcontainerservice.AgentPoolsClientBeginCreateOrUpdateOptions) (
+		*runtime.Poller[armcontainerservice.AgentPoolsClientCreateOrUpdateResponse], error)
+	BeginDeleteMachines(
+		ctx context.Context,
+		resourceGroupName, resourceName, agentPoolName string,
+		machines armcontainerservice.AgentPoolDeleteMachinesParameter,
+		options *armcontainerservice.AgentPoolsClientBeginDeleteMachinesOptions) (
+		*runtime.Poller[armcontainerservice.AgentPoolsClientDeleteMachinesResponse], error)
+}
+
+const userAgent = "AKS-autoscaler"


@gandhipr can maybe tell you where, but i think we already have userAgent somewhere and should share the same user agent for CAS for all of the clients.

Would have to go digging to find it but seems to write queries against cas or detect when we are hitting throtlting or quota from kusto queries from CAS would be good to have all cas calls under a single user agent.

Bryce-Soghigian · 2024-04-03T20:46:26Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

+// AgentPoolsClient interface defines the methods needed for scaling vms pool.
+// it is implemented by track2 sdk armcontainerservice.AgentPoolsClient
+type AgentPoolsClient interface {
+	Get(ctx context.Context,
+		resourceGroupName, resourceName, agentPoolName string,
+		options *armcontainerservice.AgentPoolsClientGetOptions) (
+		armcontainerservice.AgentPoolsClientGetResponse, error)
+	BeginCreateOrUpdate(
+		ctx context.Context,
+		resourceGroupName, resourceName, agentPoolName string,
+		parameters armcontainerservice.AgentPool,
+		options *armcontainerservice.AgentPoolsClientBeginCreateOrUpdateOptions) (
+		*runtime.Poller[armcontainerservice.AgentPoolsClientCreateOrUpdateResponse], error)
+	BeginDeleteMachines(
+		ctx context.Context,
+		resourceGroupName, resourceName, agentPoolName string,
+		machines armcontainerservice.AgentPoolDeleteMachinesParameter,
+		options *armcontainerservice.AgentPoolsClientBeginDeleteMachinesOptions) (
+		*runtime.Poller[armcontainerservice.AgentPoolsClientDeleteMachinesResponse], error)
+}
+


Testing against this interface is something I would like to start and think about and is important for our current work to improve the testing scenarios for cluster autoscaler.

For karpenter, we use env test to simulate teh kubernetes environment, and we have fakes for all of the azure apis. https://github.com/Azure/karpenter-provider-azure/tree/main/pkg/fake

I want us to start thinking about the best way to fake the Agentpools client. It doesn't have to be solved in this pr, but its good to start thinking about

Bryce-Soghigian · 2024-04-03T20:47:12Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

+
+func newAgentpoolClientWithEndpoint(baseURL, subscriptionID string, retryOptions azurecore_policy.RetryOptions) (AgentPoolsClient, error) {
+	// for AKS managed CAS, we will use the ARMBaseURL to create a fake agent pool client
+	// so that the request will be sent to node provioner endpoint intead of the public ARM endpoint


Suggested change

// so that the request will be sent to node provioner endpoint intead of the public ARM endpoint

// so that the request will be sent to node provisioner endpoint instead of the public ARM endpoint

Bryce-Soghigian · 2024-04-03T20:47:26Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

+
+func newAgentpoolClientWithEndpoint(baseURL, subscriptionID string, retryOptions azurecore_policy.RetryOptions) (AgentPoolsClient, error) {
+	// for AKS managed CAS, we will use the ARMBaseURL to create a fake agent pool client
+	// so that the request will be sent to node provioner endpoint intead of the public ARM endpoint


Is it ok to mention the node provisioner publically?

Bryce-Soghigian · 2024-04-03T20:49:43Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

+	agentPoolsClient, err := armcontainerservice.NewAgentPoolsClient(cfg.SubscriptionID, cred,
+		&policy.ClientOptions{
+			ClientOptions: azurecore_policy.ClientOptions{
+				Cloud: cloud.Configuration{
+					Services: map[cloud.ServiceName]cloud.ServiceConfiguration{
+						cloud.ResourceManager: {
+							Endpoint: env.ResourceManagerEndpoint,
+							Audience: env.TokenAudience,
+						},
+					},
+				},
+				Telemetry: azurecore_policy.TelemetryOptions{
+					ApplicationID: userAgent,
+				},
+				Retry: retryOptions,
+			},
+		})


We are initializing the client twice here, can we share one fucntion for initalization and just pass in the right policy.ClientOptions? Also again would be good to use the base of the shared client options so that you dont' enouunter all the tcp bugs, throttling etc we encountered.

Bryce-Soghigian · 2024-04-04T05:10:31Z

cluster-autoscaler/go.mod

@@ -5,6 +5,10 @@ go 1.21
 require (
 	cloud.google.com/go/compute/metadata v0.2.3
 	github.com/Azure/azure-sdk-for-go v68.0.0+incompatible
+	github.com/Azure/azure-sdk-for-go-extensions v0.1.3


github.com/Azure/azure-sdk-for-go-extensions v0.1.6

Try v0.1.6 its the one we use in karpenter and we dont see that telemetry error you are getting

Bryce-Soghigian · 2024-04-04T05:11:14Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

+		}
+	}
+
+	return newAgentpoolClientWithConfig(cfg.SubscriptionID, cred, env.ResourceManagerEndpoint, env.TokenAudience, retryOptions)


Bryce-Soghigian · 2024-04-04T05:27:22Z

/ok-to-test

wenxuan0923 · 2024-04-04T16:14:34Z

/retest

zqingqing1 · 2024-04-09T02:47:40Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

+
+	if cfg.ARMBaseURL != "" {
+		klog.V(10).Infof("Using ARMBaseURL to create agent pool client")
+		return newAgentpoolClientWithConfig(cfg.SubscriptionID, nil, cfg.ARMBaseURL, "UNKNOWN", retryOptions)


do we need to check cfg.SubscriptionID? is it possible that it is empty by mistake?

why tokenCredential is nil?

zqingqing1 · 2024-04-09T02:52:58Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

@@ -287,5 +417,6 @@ func newAzClient(cfg *Config, env *azure.Environment) (*azClient, error) {
 		virtualMachinesClient:           virtualMachinesClient,
 		storageAccountsClient:           storageAccountsClient,
 		skuClient:                       skuClient,
+		agentPoolClient:                 agentPoolClient,


based on L408, it is possible that the agentPoolClient is nil. Be sure to check it when refer it in later code development.

gandhipr · 2024-04-12T17:10:22Z

/lgtm
/approve

tallaxes · 2024-04-12T19:31:05Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

+// AgentPoolsClient interface defines the methods needed for scaling vms pool.
+// it is implemented by track2 sdk armcontainerservice.AgentPoolsClient
+type AgentPoolsClient interface {


There is an implicit implementation choice here of implementing a new client here, rather than using those in cloud-provider-azure (and maybe extending them; looks like there is an MC client, but not AP one), which are used throughout the rest of CAS Azure provider. What are the tradeoffs, and which ones push us in this direction?

tallaxes · 2024-05-24T22:35:39Z

/retest

wenxuan0923 · 2024-05-30T02:46:56Z

/retest

tallaxes · 2024-06-13T21:44:31Z

/lgtm

tallaxes · 2024-06-13T21:50:03Z

/approve

tallaxes · 2024-06-13T22:03:01Z

/assign @feiskyer

feiskyer · 2024-06-14T18:20:07Z

cluster-autoscaler/cloudprovider/azure/azure_client.go

+			klog.Errorf("NewClientSecretCredential failed: %v", err)
+			return nil, err
+		}
+	} else {


no msi support here?

I think we could get this PR in first and add MSI support in a following PR

feiskyer

/approve

k8s-ci-robot · 2024-06-14T18:39:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Bryce-Soghigian, feiskyer, gandhipr, tallaxes, wenxuan0923

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [feiskyer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 3, 2024

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 3, 2024

k8s-ci-robot requested review from gandhipr and x13n April 3, 2024 18:50

k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Apr 3, 2024

wenxuan0923 force-pushed the wenx/update-azure-config branch 7 times, most recently from 6ee1b55 to e44f1a1 Compare April 3, 2024 19:45

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 3, 2024

Bryce-Soghigian suggested changes Apr 3, 2024

View reviewed changes

wenxuan0923 force-pushed the wenx/update-azure-config branch from 965e9a4 to 4b4c541 Compare April 3, 2024 22:28

Bryce-Soghigian reviewed Apr 4, 2024

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 4, 2024

zqingqing1 reviewed Apr 9, 2024

View reviewed changes

Bryce-Soghigian approved these changes Apr 11, 2024

View reviewed changes

k8s-ci-robot assigned gandhipr Apr 12, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2024

tallaxes approved these changes Apr 12, 2024

View reviewed changes

wenxuan0923 force-pushed the wenx/update-azure-config branch from de0550e to 3b592b0 Compare May 9, 2024 18:50

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 9, 2024

This was referenced May 9, 2024

[Azure VMs pool] Introducing agentpool client #6811

Closed

[Azure VMs pool] Introducing agentpool client #6831

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2024

wenxuan0923 added 2 commits May 28, 2024 18:48

Rebase

a1cbab6

Fx gomock

73f525c

wenxuan0923 force-pushed the wenx/update-azure-config branch from 586ade0 to 73f525c Compare May 29, 2024 01:50

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2024

Rename ARM_BASE_URL

5e379e4

k8s-ci-robot assigned tallaxes Jun 13, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 13, 2024

k8s-ci-robot assigned feiskyer Jun 13, 2024

feiskyer reviewed Jun 14, 2024

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 14, 2024

k8s-ci-robot merged commit b36d4fb into kubernetes:cluster-autoscaler-release-1.29 Jun 14, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Azure VMs pool] Introducing agentpool client #6685

[Azure VMs pool] Introducing agentpool client #6685

wenxuan0923 commented Apr 3, 2024 •

edited

Loading

linux-foundation-easycla bot commented Apr 3, 2024 •

edited

Loading

k8s-ci-robot commented Apr 3, 2024

k8s-ci-robot commented Apr 3, 2024

Bryce-Soghigian left a comment

Bryce-Soghigian Apr 3, 2024

Bryce-Soghigian Apr 3, 2024

Bryce-Soghigian Apr 3, 2024 •

edited

Loading

Bryce-Soghigian Apr 3, 2024

Bryce-Soghigian Apr 3, 2024

Bryce-Soghigian Apr 3, 2024

Bryce-Soghigian Apr 3, 2024

Bryce-Soghigian Apr 3, 2024

Bryce-Soghigian Apr 4, 2024

Bryce-Soghigian Apr 4, 2024

Bryce-Soghigian commented Apr 4, 2024

wenxuan0923 commented Apr 4, 2024

zqingqing1 Apr 9, 2024

zqingqing1 Apr 9, 2024

zqingqing1 Apr 9, 2024

gandhipr commented Apr 12, 2024 •

edited

Loading

tallaxes Apr 12, 2024

tallaxes commented May 24, 2024

wenxuan0923 commented May 30, 2024

tallaxes commented Jun 13, 2024

tallaxes commented Jun 13, 2024

tallaxes commented Jun 13, 2024

feiskyer Jun 14, 2024

feiskyer Jun 14, 2024

feiskyer left a comment

k8s-ci-robot commented Jun 14, 2024

	// so that the request will be sent to node provioner endpoint intead of the public ARM endpoint
	// so that the request will be sent to node provisioner endpoint instead of the public ARM endpoint

[Azure VMs pool] Introducing agentpool client #6685

[Azure VMs pool] Introducing agentpool client #6685

Conversation

wenxuan0923 commented Apr 3, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla bot commented Apr 3, 2024 • edited Loading

k8s-ci-robot commented Apr 3, 2024

k8s-ci-robot commented Apr 3, 2024

Bryce-Soghigian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bryce-Soghigian Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bryce-Soghigian commented Apr 4, 2024

wenxuan0923 commented Apr 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gandhipr commented Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

tallaxes commented May 24, 2024

wenxuan0923 commented May 30, 2024

tallaxes commented Jun 13, 2024

tallaxes commented Jun 13, 2024

tallaxes commented Jun 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feiskyer left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 14, 2024

wenxuan0923 commented Apr 3, 2024 •

edited

Loading

linux-foundation-easycla bot commented Apr 3, 2024 •

edited

Loading

Bryce-Soghigian Apr 3, 2024 •

edited

Loading

gandhipr commented Apr 12, 2024 •

edited

Loading