arm64 instances not having enough power to run robustness test #17246

serathius · 2024-01-15T15:50:52Z

Which github workflows are flaking?

Robustness test

Which tests are flaking?

TestRobustness/Etcd/LowTraffic/ClusterOfSize1/LazyFS

Github Action link

https://github.com/etcd-io/etcd/actions/runs/7526852823/job/20485816102

Reason for failure (if possible)

Failures in 2 recent test:

2024-01-15T10:55:41.9635115Z     traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 76.322871 qps

Anything else we need to know?

Also https://github.com/etcd-io/etcd/actions/runs/7500305622/job/20418722131

The two flakes happened within 3 days.

The text was updated successfully, but these errors were encountered:

jmhbnz · 2024-01-15T17:51:11Z

I've seen this historically as well. It happens every Monday when the bulk lot of dependabot pr's are raised.

Basically it seems to be CPU contention on the shared build infrastructure. On normal conditions everything is fine, but when the entire build queue has been consumed with running jobs the qps of robustness suffers.

@alexellis do you have any suggestions on how we address this? The jobs do have enough CPU allocated, we just have contention or other overall slowdown on the actuated side when lots of jobs are running at the same time.

serathius · 2024-01-16T12:28:31Z

Haven't observed any time correlation, could you provide some examples?

It it is really the case could we consider spreading our jobs throughout a day?

jmhbnz · 2024-01-16T19:25:55Z

Example yesterday when the bulk lot of dependabot pr's opened at the same time (each one runs their own mini robustness tests on actuated arm64 runner):

If we check the logs for some of these we see:

https://github.com/etcd-io/etcd/actions/runs/7532121712/job/20502126721
traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 66.367248 qps

https://github.com/etcd-io/etcd/actions/runs/7526852823/job/20485816102
traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 76.322871 qps

https://github.com/etcd-io/etcd/actions/runs/7532088744/job/20502027732
traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 57.351455 qps

Example last week when all dependabot prs open same time:

If we check the logs for some of these we see:

https://github.com/etcd-io/etcd/actions/runs/7451288068/job/20272077094
traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 90.190679 qps

https://github.com/etcd-io/etcd/actions/runs/7451286461/job/20272067521
traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 79.069443 qps

Given these jobs perform nicely when the runner boxes are not under a lot of load my theory is we just have cpu contention.

It would be good to get some input from folks at actuated.dev on what we should be doing as a managed service consumer to not end up with this situation. We can't control when flurries of pr's will open at the same time and our build infra should be able to scale to cope with it within reason imo.

cc @alexellis

serathius · 2024-01-31T12:15:39Z

After #17323 arm64 tests have been stable

alexellis · 2024-02-01T10:06:35Z

Hi folks, glad you have this resolved.

We provide support via Slack and I am unable to monitor all my GitHub notifications actively due to the abundance of them.

If you run into issues please have someone reach out and we can look at the server and the load during that time.

We've not noticed anything out of the ordinary, but can look more closely if / when needed.

serathius added area/testing type/flake labels Jan 15, 2024

serathius mentioned this issue Jan 16, 2024

Duplicated watch event detected in robustness test #17247

Closed

4 tasks

serathius mentioned this issue Jan 25, 2024

Disable lazyfs test on arm64 machines #17323

Merged

serathius closed this as completed Jan 31, 2024

jmhbnz mentioned this issue Feb 13, 2024

build(deps): bump actions/upload-artifact from 4.3.0 to 4.3.1 #17406

Merged

jmhbnz mentioned this issue Mar 18, 2024

Investigate arm64 robustness performance #17595

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm64 instances not having enough power to run robustness test #17246

arm64 instances not having enough power to run robustness test #17246

serathius commented Jan 15, 2024 •

edited

Loading

jmhbnz commented Jan 15, 2024

serathius commented Jan 16, 2024

jmhbnz commented Jan 16, 2024

serathius commented Jan 31, 2024

alexellis commented Feb 1, 2024

arm64 instances not having enough power to run robustness test #17246

arm64 instances not having enough power to run robustness test #17246

Comments

serathius commented Jan 15, 2024 • edited Loading

Which github workflows are flaking?

Which tests are flaking?

Github Action link

Reason for failure (if possible)

Anything else we need to know?

jmhbnz commented Jan 15, 2024

serathius commented Jan 16, 2024

jmhbnz commented Jan 16, 2024

serathius commented Jan 31, 2024

alexellis commented Feb 1, 2024

serathius commented Jan 15, 2024 •

edited

Loading