Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm64 instances not having enough power to run robustness test #17246

Closed
serathius opened this issue Jan 15, 2024 · 5 comments
Closed

arm64 instances not having enough power to run robustness test #17246

serathius opened this issue Jan 15, 2024 · 5 comments

Comments

@serathius
Copy link
Member

serathius commented Jan 15, 2024

Which github workflows are flaking?

Robustness test

Which tests are flaking?

TestRobustness/Etcd/LowTraffic/ClusterOfSize1/LazyFS

Github Action link

https://github.com/etcd-io/etcd/actions/runs/7526852823/job/20485816102

Reason for failure (if possible)

Failures in 2 recent test:

2024-01-15T10:55:41.9635115Z     traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 76.322871 qps

Anything else we need to know?

Also https://github.com/etcd-io/etcd/actions/runs/7500305622/job/20418722131

The two flakes happened within 3 days.

@jmhbnz
Copy link
Member

jmhbnz commented Jan 15, 2024

I've seen this historically as well. It happens every Monday when the bulk lot of dependabot pr's are raised.

Basically it seems to be CPU contention on the shared build infrastructure. On normal conditions everything is fine, but when the entire build queue has been consumed with running jobs the qps of robustness suffers.

@alexellis do you have any suggestions on how we address this? The jobs do have enough CPU allocated, we just have contention or other overall slowdown on the actuated side when lots of jobs are running at the same time.

@serathius
Copy link
Member Author

Haven't observed any time correlation, could you provide some examples?

It it is really the case could we consider spreading our jobs throughout a day?

@jmhbnz
Copy link
Member

jmhbnz commented Jan 16, 2024

Example yesterday when the bulk lot of dependabot pr's opened at the same time (each one runs their own mini robustness tests on actuated arm64 runner):

image

If we check the logs for some of these we see:

https://github.com/etcd-io/etcd/actions/runs/7532121712/job/20502126721
traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 66.367248 qps

https://github.com/etcd-io/etcd/actions/runs/7526852823/job/20485816102
traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 76.322871 qps

https://github.com/etcd-io/etcd/actions/runs/7532088744/job/20502027732
traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 57.351455 qps

Example last week when all dependabot prs open same time:

image

If we check the logs for some of these we see:

https://github.com/etcd-io/etcd/actions/runs/7451288068/job/20272077094
traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 90.190679 qps

https://github.com/etcd-io/etcd/actions/runs/7451286461/job/20272067521
traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 79.069443 qps

Given these jobs perform nicely when the runner boxes are not under a lot of load my theory is we just have cpu contention.

It would be good to get some input from folks at actuated.dev on what we should be doing as a managed service consumer to not end up with this situation. We can't control when flurries of pr's will open at the same time and our build infra should be able to scale to cope with it within reason imo.

cc @alexellis

@serathius
Copy link
Member Author

After #17323 arm64 tests have been stable

image

@alexellis
Copy link
Contributor

Hi folks, glad you have this resolved.

We provide support via Slack and I am unable to monitor all my GitHub notifications actively due to the abundance of them.

If you run into issues please have someone reach out and we can look at the server and the load during that time.

We've not noticed anything out of the ordinary, but can look more closely if / when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants