-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable Statefulsets provisioning from CL2 Load Tests #16172
Conversation
/test presubmit-kops-aws-scale-amazonvpc-using-cl2 |
1 similar comment
/test presubmit-kops-aws-scale-amazonvpc-using-cl2 |
/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2 |
2 similar comments
/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2 |
/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2 |
16b65b7
to
1973f63
Compare
/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2 |
/test presubmit-kops-aws-scale-amazonvpc-using-cl2 |
/approve /assign @hakman |
/test presubmit-kops-aws-scale-amazonvpc-using-cl2 |
3 similar comments
/test presubmit-kops-aws-scale-amazonvpc-using-cl2 |
/test presubmit-kops-aws-scale-amazonvpc-using-cl2 |
/test presubmit-kops-aws-scale-amazonvpc-using-cl2 |
Some test failures are attributed to this issue - kubernetes/test-infra#31459 |
/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2 |
/test presubmit-kops-aws-scale-amazonvpc-using-cl2 |
3e55466
to
52ec8b9
Compare
52ec8b9
to
a82f9b3
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, hakman The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
As expected disabling Statefulsets and restart_checks on EBS pods, helped succeed recent tests that were able to get through validation step of kops. |
@hakuna-matatah There is no flag to ignore EBS pods healthiness during validation phase. The check is generic, based on the existence of |
Thanks @hakman for you response, have you added these changes to see if it helps the EBS failures we are seeing, do you think it's actually bottleneck'd from APIServer interaction due to the scale ? Currently, I don't think we are snapshotting Prometheus metrics in CL2, if we had that, we could see if APIServer is throwing
@torredil do you think is there any env-var that you think could think of ? |
The root cause for kOps validation timeout failures is primarily due to Auto Scaling Group (ASG) throttling. As an example, lets take a look at
TPS and ASG throttling from this account in us-east-2: I've requested the quota increases, lets follow up internally. For context, this issue is related to kubernetes/k8s.io#6165. |
The DescribeAutoScalingInstances call is used to determine whether an instance is in its ASG's warm pool and whether or not we should enable the kubelet service. I noticed this is available in instance metadata so I'm migrating to use that in #16213. This should eliminate the DescribeAutoScalingInstances call volume in the scale tests. We'll see how much it helps with time for nodes to join the cluster |
@hakman @rifelpet Thanks for optimizing on the ASG calls ^^^. That will definitely help with throttling.
@torredil In the above example you posted, EBS pod and AWS node has come up already at Do you think it would be good to have an example of the timeline of operations like this for ``aws-node/ebs-pod/node` that didn't come up ready until the kOPS validation time has run out to understand if this is actually the underlying root of the issues ? Am I missing something here ? |
One more piece of the puzzle: #16216. |
FYI kops has experimental open telemetry support though it only supports tracing in the kops cli, not the k8s control plane or any node components. With this we can visualize tracing from prow jobs by downloading the otel files from the job artifacts and running a local jaeger-query server. See #16220 for more details. I tried visualizing the scale test's otel files in it but it has crashed every browser I've tried :/ maybe someone else will have better luck. Eventually we can add support for dumping traces from other components too. |
@hakuna-matatah I think the next error we need to look at here is the following, which is causing a significant delay in node registration:
As an example, lets look at
Liveness probes succeed shortly after this process. cc: @hakman |
I added the
Observations:
@torredil any ideas? |
@hakman Excellent thanks for swiftly taking care of that! I think we're on the right track with the CCM observation. For every instance of CSI node container livenessprobe check failures I've observed, the checks start succeeding after CCM successfully initializes the node with cloud provider and adds node labels such that KCM is able to retrieve the zone information.
As an example, lets take a look at i-002f08e03b0d03709 , where CSI node pod comes up shortly after
Unfortunately, the aws-cloud-controller-manager.log starts at
Notice the absence of |
@hakman We can leverage this flag from CCM to improve the node_syncs on kops side for CCM, this should make it faster as it's going spin up more workers instead of one, we can set to |
Let's see if #16228 will help. 😀 |
Looks like it helped, but we still have ways to go: aws-cloud-controller-manager.log
how about setting |
No description provided.