Disable Statefulsets provisioning from CL2 Load Tests #16172

hakuna-matatah · 2023-12-11T19:25:33Z

No description provided.

hakuna-matatah · 2023-12-11T19:26:48Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2023-12-12T15:27:17Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2023-12-12T21:16:22Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

hakuna-matatah · 2023-12-12T22:08:28Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

hakuna-matatah · 2023-12-12T22:57:44Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

hakuna-matatah · 2023-12-13T01:21:10Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

hakuna-matatah · 2023-12-13T02:52:29Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

dims · 2023-12-13T14:12:24Z

/approve
/lgtm

/assign @hakman

hakuna-matatah · 2023-12-13T16:34:20Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2023-12-13T16:57:39Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2023-12-13T21:12:46Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2023-12-14T23:38:36Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakuna-matatah · 2023-12-15T20:10:48Z

Some test failures are attributed to this issue - kubernetes/test-infra#31459

hakuna-matatah · 2023-12-18T05:13:33Z

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

hakuna-matatah · 2023-12-18T06:33:28Z

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

hakman · 2023-12-27T17:37:15Z

/lgtm
/approve
/hold in case @rifelpet has any final comments

k8s-ci-robot · 2023-12-27T17:37:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, hakman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hakman]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rifelpet · 2023-12-27T19:25:49Z

/hold cancel

hakuna-matatah · 2024-01-02T18:16:12Z

@hakman could you take a look at this and approve it if you don't find any issues with this. This would disable Statefulsets from CL2 tests and Restart checks for containers - These are the 2 consistent failures i have seen in the last many tests now and all of them point to Statefulsets not coming up and EBS containers getting restarted.

As expected disabling Statefulsets and restart_checks on EBS pods, helped succeed recent tests that were able to get through validation step of kops.
@hakman Some tests are failing in validation step itself which was a known issue due to EBS pods failing to come up and failing validation and test exits before kicking off a load test. To tackle those failures, is there a flag from KOPS side that we can use to not check EBS pods healthiness during validation phase ? Once EBS issue is root-caused we can turn it back on if we have such flag.

hakman · 2024-01-03T05:14:45Z

@hakman could you take a look at this and approve it if you don't find any issues with this. This would disable Statefulsets from CL2 tests and Restart checks for containers - These are the 2 consistent failures i have seen in the last many tests now and all of them point to Statefulsets not coming up and EBS containers getting restarted.

As expected disabling Statefulsets and restart_checks on EBS pods, helped succeed recent tests that were able to get through validation step of kops. @hakman Some tests are failing in validation step itself which was a known issue due to EBS pods failing to come up and failing validation and test exits before kicking off a load test. To tackle those failures, is there a flag from KOPS side that we can use to not check EBS pods healthiness during validation phase ? Once EBS issue is root-caused we can turn it back on if we have such flag.

@hakuna-matatah There is no flag to ignore EBS pods healthiness during validation phase. The check is generic, based on the existence of system-cluster-critical and system-node-critical.
Maybe that there's some env var that we can set in the EBS CSI driver pods to not look for something?
We are using IMDSv2 with very small hops value, maybe that is something that influences larger scale?

hakuna-matatah · 2024-01-03T16:13:08Z

@hakuna-matatah There is no flag to ignore EBS pods healthiness during validation phase. The check is generic, based on the existence of system-cluster-critical and system-node-critical. Maybe that there's some env var that we can set in the EBS CSI driver pods to not look for something? We are using IMDSv2 with very small hops value, maybe that is something that influences larger scale?

Thanks @hakman for you response, have you added these changes to see if it helps the EBS failures we are seeing, do you think it's actually bottleneck'd from APIServer interaction due to the scale ? Currently, I don't think we are snapshotting Prometheus metrics in CL2, if we had that, we could see if APIServer is throwing 429s to EBS drivers.

Maybe there's some env var that we can set in the EBS CSI driver pods to not look for something?

@torredil do you think is there any env-var that you think could think of ?

torredil · 2024-01-04T01:35:42Z

The root cause for kOps validation timeout failures is primarily due to Auto Scaling Group (ASG) throttling. As an example, lets take a look at i-0c5cf23ad74672693 from the latest run.

2024-01-03T06:19:37.561Z: RunInstances is invoked.
2024-01-03T06:19:42.520Z: kops validate cluster is invoked.
2024-01-03T06:23:12.189Z: Instance has not joined ASG, this error persists for ~24 minutes:

Retryable error (Throttling: Rate exceeded
	status code: 400, request id: e64b1857-28f4-4c08-8e89-19a0912ffdb0) from autoscaling/DescribeAutoScalingInstances - will retry after delay of 665.889374ms

2024-01-03T06:47:04.298Z: instance is entering the ASG
2024-01-03T06:48:26.211Z: Both EBS CSI node and aws-node pods are finally scheduled on the node, nearly 30 minutes later:

Successfully bound pod to node" pod="kube-system/ebs-csi-node-hc87x" node="i-0c5cf23ad74672693" evaluatedNodes=1 feasibleNodes=1

Successfully bound pod to node" pod="kube-system/aws-node-mngfh" node="i-0c5cf23ad74672693" evaluatedNodes=1 feasibleNodes=1

2024-01-03T06:48:41.364Z: AssignPrivateIpAddresses is invoked.
2024-01-03T07:14:32.918Z: kOps cluster validation times out after being unable to detect 10 consecutive successful validations (this number needs to be reduced IMO).

Log Source

TPS and ASG throttling from this account in us-east-2:

I've requested the quota increases, lets follow up internally. For context, this issue is related to kubernetes/k8s.io#6165.

cc: @hakuna-matatah @dims @hakman @wmesard

rifelpet · 2024-01-04T03:23:29Z

The DescribeAutoScalingInstances call is used to determine whether an instance is in its ASG's warm pool and whether or not we should enable the kubelet service. I noticed this is available in instance metadata so I'm migrating to use that in #16213. This should eliminate the DescribeAutoScalingInstances call volume in the scale tests. We'll see how much it helps with time for nodes to join the cluster

hakuna-matatah · 2024-01-04T17:01:07Z

@hakman @rifelpet Thanks for optimizing on the ASG calls ^^^. That will definitely help with throttling.

2024-01-03T06:48:26.211Z: Both EBS CSI node and aws-node pods are finally scheduled on the node, nearly 30 minutes later:
Successfully bound pod to node" pod="kube-system/ebs-csi-node-hc87x" node="i-0c5cf23ad74672693" evaluatedNodes=1 feasibleNodes=1
Successfully bound pod to node" pod="kube-system/aws-node-mngfh" node="i-0c5cf23ad74672693" evaluatedNodes=1 feasibleNodes=1
2024-01-03T06:48:41.364Z: AssignPrivateIpAddresses is invoked.
2024-01-03T07:14:32.918Z: kOps cluster validation times out after being unable to detect 10 consecutive successful validations (this number needs to be reduced IMO).

@torredil In the above example you posted, EBS pod and AWS node has come up already at 2024-01-03T06:48:26.211Z, is what you mentioned IIUC, so how is this particular node/ebs-pod/aws-node in question is related to kOPS cluster validation time out which was happened 26 minutes after ( 2024-01-03T07:14:32.918Z) aws-node/ebs-pod/node is ready ?

Do you think it would be good to have an example of the timeline of operations like this for ``aws-node/ebs-pod/node` that didn't come up ready until the kOPS validation time has run out to understand if this is actually the underlying root of the issues ? Am I missing something here ?

hakman · 2024-01-04T17:04:30Z

One more piece of the puzzle: #16216.
We should see soon how things look without ec2/DescribeInstances and ec2/DescribeInstanceTypes calls in nodeup.

rifelpet · 2024-01-05T00:57:14Z

FYI kops has experimental open telemetry support though it only supports tracing in the kops cli, not the k8s control plane or any node components. With this we can visualize tracing from prow jobs by downloading the otel files from the job artifacts and running a local jaeger-query server. See #16220 for more details.

I tried visualizing the scale test's otel files in it but it has crashed every browser I've tried :/ maybe someone else will have better luck.

Eventually we can add support for dumping traces from other components too.

torredil · 2024-01-05T15:48:12Z

@hakuna-matatah I think the next error we need to look at here is the following, which is causing a significant delay in node registration:

providerID was not set for node

see https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/1743153232743501824/artifacts/cluster-info/kube-system/kops-controller-7pdvn/kops-controller.log

As an example, lets look at i-0024e4829e39b7b4a from the latest run.

2024-01-05T06:20:02.53Z - kops validate cluster is invoked.
2024-01-05T06:26:44.83Z - error identifying node from kops-controller:

E0105 06:26:44.836264       1 controller.go:329] "msg"="Reconciler error" "error"="error identifying node \"i-0024e4829e39b7b4a\": providerID was not set for node i-0024e4829e39b7b4a" "Node"={"name":"i-0024e4829e39b7b4a"} "controller"="node" "controllerGroup"="" "controllerKind"="Node" "name"="i-0024e4829e39b7b4a" "namespace"="" "reconcileID"="f434feac-931e-4348-870e-f136135e62c8"
...

2024-01-05T06:49:44.89Z - Node is initialized with cloud provider:

I0105 06:49:44.894481       1 node_controller.go:431] Initializing node i-0024e4829e39b7b4a with cloud provider
I0105 06:49:45.206776       1 node_controller.go:502] Successfully initialized node i-0024e4829e39b7b4a with cloud provider

2024-01-05T06:49:45.19 - Node is added to NodeTree with corresponding zone

I0105 06:49:45.190956      11 node_tree.go:65] "Added node in listed group to NodeTree" node="i-0024e4829e39b7b4a" zone="us-east-2:\x00:us-east-2b"

2024-01-05T06:54:09.80

 I0105 06:54:09.806668       1 node_controller.go:163] sending patch for node "i-0024e4829e39b7b4a": "{\"metadata\":{\"labels\":{\"node-role.kubernetes.io/node\":\"\"}}}"

Liveness probes succeed shortly after this process.

cc: @hakman

hakman · 2024-01-05T16:01:28Z

I added the --provider-id=aws:///availability-zone/instance-id in #16223. This should address providerID was not set for node and makes the boot sequence really fast. See the 5k run after that:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kops/16204/presubmit-kops-aws-scale-amazonvpc-using-cl2/1743208014908755968

09:56:11.943779 - validation starts
10:04:01.875856 - all nodes joined the cluster
10:05:32.108396 - all nodes ready, only ebs-csi-node pods not ready
10:48:11.809966 - validation passes (40+ min later).

Observations:

CCM is fairly slow: node_controller.go:267] Update 5003 nodes status took 8m30.356517966s.
CSI node fails to start repeatedly and enters CrashLoopBackOff, but previous container logs show only some failed to establish connection to CSI driver: context deadline exceeded.

@torredil any ideas?

torredil · 2024-01-05T19:45:00Z

@hakman Excellent thanks for swiftly taking care of that! I think we're on the right track with the CCM observation. For every instance of CSI node container livenessprobe check failures I've observed, the checks start succeeding after CCM successfully initializes the node with cloud provider and adds node labels such that KCM is able to retrieve the zone information.

kube-controller-manager.log:

I0105 06:48:15.764482      11 topologycache.go:237] "Can't get CPU or zone information for node" node="i-00fc83f221ebae26e"
I0105 06:48:15.764618      11 topologycache.go:253] "Insufficient node info for topology hints" totalZones=0 totalCPU="8" sufficientNodeInfo=false

Can't get CPU or zone information for node

As an example, lets take a look at i-002f08e03b0d03709 , where CSI node pod comes up shortly after kops validate and livenessprobe check does not fail:

kube-scheduler.log

I0104 06:25:17.395453      11 node_tree.go:79] "Removed node in listed group from NodeTree" node="i-002f08e03b0d03709" zone=""
I0104 06:25:17.395484      11 node_tree.go:65] "Added node in listed group to NodeTree" node="i-002f08e03b0d03709" zone="us-east-2:\x00:us-east-2c"

kubelet.log

I0104 06:25:27.810194    2897 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: ebs.csi.aws.com endpoint: /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock versions: 1.0.0
I0104 06:25:27.810386    2897 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: ebs.csi.aws.com at endpoint: /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock

Unfortunately, the aws-cloud-controller-manager.log starts at 06:52:20.454135 so we can't see the log where CCM initializes node and adds labels, but presumably that occurs before 06:25:17.395484.

kube-controller-manager.log:

I0104 06:30:42.230341      11 ttl_controller.go:281] "Changed ttl annotation" node="i-002f08e03b0d03709" TTL="5m0s"

Notice the absence of Can't get CPU or zone information for node error.

hakuna-matatah · 2024-01-05T22:00:41Z

CCM is fairly slow: node_controller.go:267] Update 5003 nodes status took 8m30.356517966s.

@hakman We can leverage this flag from CCM to improve the node_syncs on kops side for CCM, this should make it faster as it's going spin up more workers instead of one, we can set to 30 ? Default workers I think is set to 1 in CCM, and from what I see, KOPS doesn't allow to set that flag to customize as it's not defined as setting in it's custom config struct def IIUC. So, enabling this as a flag on kops and setting it would help.

hakman · 2024-01-06T05:15:07Z

CCM is fairly slow: node_controller.go:267] Update 5003 nodes status took 8m30.356517966s.

@hakman We can leverage this flag from CCM to improve the node_syncs on kops side for CCM, this should make it faster as it's going spin up more workers instead of one, we can set to 30 ? Default workers I think is set to 1 in CCM, and from what I see, KOPS doesn't allow to set that flag to customize as it's not defined as setting in it's custom config struct def IIUC. So, enabling this as a flag on kops and setting it would help.

Let's see if #16228 will help. 😀

torredil · 2024-01-08T21:56:16Z

@hakman @hakuna-matatah

Looks like it helped, but we still have ways to go:

aws-cloud-controller-manager.log

I0107 06:25:27.482746       1 node_controller.go:502] Successfully initialized node i-07814b493f17c05fe with cloud provider
...
I0107 06:47:51.234044       1 node_controller.go:502] Successfully initialized node i-07ab0fadf6b251bd2 with cloud provider

I0107 06:25:27.223436       1 node_controller.go:267] Update 3 nodes status took 69.915µs.
I0107 06:31:46.038439       1 node_controller.go:267] Update 5003 nodes status took 1m18.814078847s.
I0107 06:40:28.547820       1 node_controller.go:267] Update 5003 nodes status took 3m42.508428426s.
I0107 06:52:51.190852       1 node_controller.go:267] Update 5003 nodes status took 7m22.642821075s.
I0107 07:06:17.260322       1 node_controller.go:267] Update 5003 nodes status took 8m26.068627268s.
I0107 07:19:41.514106       1 node_controller.go:267] Update 5003 nodes status took 8m24.252855746s.
I0107 07:33:14.847352       1 node_controller.go:267] Update 5003 nodes status took 8m33.332452261s.
I0107 07:46:54.687156       1 node_controller.go:267] Update 5003 nodes status took 8m39.839100999s.
I0107 08:00:40.719528       1 node_controller.go:267] Update 5003 nodes status took 8m46.031692279s.
I0107 08:14:29.259301       1 node_controller.go:267] Update 5003 nodes status took 8m48.538857604s.
I0107 08:28:16.464665       1 node_controller.go:267] Update 5003 nodes status took 8m47.204680091s.
I0107 08:42:03.231571       1 node_controller.go:267] Update 5003 nodes status took 8m46.766106626s.
I0107 08:55:52.191038       1 node_controller.go:267] Update 5003 nodes status took 8m48.959225455s.
I0107 09:09:42.420701       1 node_controller.go:267] Update 5003 nodes status took 8m50.228761742s.
I0107 09:23:31.476705       1 node_controller.go:267] Update 5003 nodes status took 8m49.055156794s.
I0107 09:37:12.876390       1 node_controller.go:267] Update 5003 nodes status took 8m41.399313664s.

how about setting ConcurrentNodeSyncs to 100 for this test? at 5, we are not close to making out QPS.

Add warm ip and min ip for better prefix management

7350927

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 11, 2023

k8s-ci-robot requested review from johngmyers and olemarkus December 11, 2023 19:25

Increase cidrs to battle insufficient cidr blocks issue with EC2

1973f63

hakuna-matatah force-pushed the master branch from 16b65b7 to 1973f63 Compare December 13, 2023 01:18

hakuna-matatah changed the title ~~{WIP} Add warm ip and min ip for better prefix management~~ {WIP} Add more cidr block to tackle insufficient CIDRs issue from EC2 APIs Dec 13, 2023

k8s-ci-robot assigned hakman and dims Dec 13, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 13, 2023

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 18, 2023

hakuna-matatah force-pushed the master branch from 3e55466 to 52ec8b9 Compare December 27, 2023 16:52

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 27, 2023

add testoverrides to not run statefulsets

a82f9b3

hakuna-matatah force-pushed the master branch from 52ec8b9 to a82f9b3 Compare December 27, 2023 17:03

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Dec 27, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 27, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 27, 2023

k8s-ci-robot merged commit ec584c9 into kubernetes:master Dec 27, 2023
7 checks passed

k8s-ci-robot added this to the v1.29 milestone Dec 27, 2023

hakman mentioned this pull request Jan 4, 2024

aws: Skip WarmPool checks when it's not enabled #16212

Closed

BrewTestBot mentioned this pull request May 16, 2024

kops 1.29.0 Homebrew/homebrew-core#171874

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable Statefulsets provisioning from CL2 Load Tests #16172

Disable Statefulsets provisioning from CL2 Load Tests #16172

hakuna-matatah commented Dec 11, 2023

hakuna-matatah commented Dec 11, 2023

hakuna-matatah commented Dec 12, 2023

hakuna-matatah commented Dec 12, 2023

hakuna-matatah commented Dec 12, 2023

hakuna-matatah commented Dec 12, 2023

hakuna-matatah commented Dec 13, 2023

hakuna-matatah commented Dec 13, 2023

dims commented Dec 13, 2023

hakuna-matatah commented Dec 13, 2023

hakuna-matatah commented Dec 13, 2023

hakuna-matatah commented Dec 13, 2023

hakuna-matatah commented Dec 14, 2023

hakuna-matatah commented Dec 15, 2023

hakuna-matatah commented Dec 18, 2023

hakuna-matatah commented Dec 18, 2023

hakman commented Dec 27, 2023

k8s-ci-robot commented Dec 27, 2023

rifelpet commented Dec 27, 2023

hakuna-matatah commented Jan 2, 2024 •

edited

Loading

hakman commented Jan 3, 2024

hakuna-matatah commented Jan 3, 2024 •

edited

Loading

torredil commented Jan 4, 2024

rifelpet commented Jan 4, 2024 •

edited

Loading

hakuna-matatah commented Jan 4, 2024 •

edited

Loading

hakman commented Jan 4, 2024 •

edited

Loading

rifelpet commented Jan 5, 2024

torredil commented Jan 5, 2024

hakman commented Jan 5, 2024 •

edited

Loading

torredil commented Jan 5, 2024

hakuna-matatah commented Jan 5, 2024 •

edited

Loading

hakman commented Jan 6, 2024

torredil commented Jan 8, 2024

Disable Statefulsets provisioning from CL2 Load Tests #16172

Disable Statefulsets provisioning from CL2 Load Tests #16172

Conversation

hakuna-matatah commented Dec 11, 2023

hakuna-matatah commented Dec 11, 2023

hakuna-matatah commented Dec 12, 2023

hakuna-matatah commented Dec 12, 2023

hakuna-matatah commented Dec 12, 2023

hakuna-matatah commented Dec 12, 2023

hakuna-matatah commented Dec 13, 2023

hakuna-matatah commented Dec 13, 2023

dims commented Dec 13, 2023

hakuna-matatah commented Dec 13, 2023

hakuna-matatah commented Dec 13, 2023

hakuna-matatah commented Dec 13, 2023

hakuna-matatah commented Dec 14, 2023

hakuna-matatah commented Dec 15, 2023

hakuna-matatah commented Dec 18, 2023

hakuna-matatah commented Dec 18, 2023

hakman commented Dec 27, 2023

k8s-ci-robot commented Dec 27, 2023

rifelpet commented Dec 27, 2023

hakuna-matatah commented Jan 2, 2024 • edited Loading

hakman commented Jan 3, 2024

hakuna-matatah commented Jan 3, 2024 • edited Loading

torredil commented Jan 4, 2024

rifelpet commented Jan 4, 2024 • edited Loading

hakuna-matatah commented Jan 4, 2024 • edited Loading

hakman commented Jan 4, 2024 • edited Loading

rifelpet commented Jan 5, 2024

torredil commented Jan 5, 2024

hakman commented Jan 5, 2024 • edited Loading

torredil commented Jan 5, 2024

hakuna-matatah commented Jan 5, 2024 • edited Loading

hakman commented Jan 6, 2024

torredil commented Jan 8, 2024

hakuna-matatah commented Jan 2, 2024 •

edited

Loading

hakuna-matatah commented Jan 3, 2024 •

edited

Loading

rifelpet commented Jan 4, 2024 •

edited

Loading

hakuna-matatah commented Jan 4, 2024 •

edited

Loading

hakman commented Jan 4, 2024 •

edited

Loading

hakman commented Jan 5, 2024 •

edited

Loading

hakuna-matatah commented Jan 5, 2024 •

edited

Loading