-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arm64 instances not having enough power to run robustness test #17246
Comments
I've seen this historically as well. It happens every Monday when the bulk lot of dependabot pr's are raised. Basically it seems to be CPU contention on the shared build infrastructure. On normal conditions everything is fine, but when the entire build queue has been consumed with running jobs the qps of robustness suffers. @alexellis do you have any suggestions on how we address this? The jobs do have enough CPU allocated, we just have contention or other overall slowdown on the actuated side when lots of jobs are running at the same time. |
Haven't observed any time correlation, could you provide some examples? It it is really the case could we consider spreading our jobs throughout a day? |
Example yesterday when the bulk lot of dependabot pr's opened at the same time (each one runs their own mini robustness tests on actuated arm64 runner): If we check the logs for some of these we see:
Example last week when all dependabot prs open same time: If we check the logs for some of these we see:
Given these jobs perform nicely when the runner boxes are not under a lot of load my theory is we just have cpu contention. It would be good to get some input from folks at actuated.dev on what we should be doing as a managed service consumer to not end up with this situation. We can't control when flurries of pr's will open at the same time and our build infra should be able to scale to cope with it within reason imo. cc @alexellis |
After #17323 arm64 tests have been stable |
Hi folks, glad you have this resolved. We provide support via Slack and I am unable to monitor all my GitHub notifications actively due to the abundance of them. If you run into issues please have someone reach out and we can look at the server and the load during that time. We've not noticed anything out of the ordinary, but can look more closely if / when needed. |
Which github workflows are flaking?
Robustness test
Which tests are flaking?
TestRobustness/Etcd/LowTraffic/ClusterOfSize1/LazyFS
Github Action link
https://github.com/etcd-io/etcd/actions/runs/7526852823/job/20485816102
Reason for failure (if possible)
Failures in 2 recent test:
Anything else we need to know?
Also https://github.com/etcd-io/etcd/actions/runs/7500305622/job/20418722131
The two flakes happened within 3 days.
The text was updated successfully, but these errors were encountered: