CI: Reduce our monthly minutes usage #1366

jsdw · 2024-01-15T16:42:00Z

We need to reduce our monthly CI usage so that we don't exceed an org imposed (I think) cap on monthly CI minutes. Currently, Subxt is the worst offender.

The 45 minute timeout added to our tests should already help with this a lot (and perhaps we can further reduce this to say 30mint; we'd have to see! We don't want potentially successful jobs timing out).

Further, once we fix the unstable backend test timeouts (@lexnv is working on this; my hope is we can merge a fix in the next couple of days) we should reduce our usage some more.

Next, let's see whether using the faster ubuntu-8-core etc runners (which we can assume use a multiple of minutes depending on how many multiples of 4 cores they are, eg 8 core would be 2x minutes) actually are worth it where used, and revert to slower machines if the total time is not decreased significantly. We need to find a good balance here between test speed and minute use.

Perhaps we could also optimise our CI pipeline so that we run fast checks first (fmt and clippy for instance, but check the times these take to decide) in one block, and then the rest in another block (thanks for the idea @tadeohepperle). This will hopefully catch the quite frequent/basic fmt/clippy fails and prevent excess minutes being used until they are all fixed up, only running the long tests when the basics are all good.

Related issues:

https://github.com/paritytech/ci_cd/issues/919 for subxt over usage
https://github.com/paritytech/ci_cd/issues/553 for monthly minutes showing subxt doing badly

The text was updated successfully, but these errors were encountered:

jsdw · 2024-01-15T17:03:51Z

For reference, per minute rates for runners are as follows (https://docs.github.com/en/billing/managing-billing-for-github-actions/about-billing-for-github-actions#per-minute-rates):

Operating system	vCPUs	Per-minute rate (USD)
Linux	2	$0.008
Linux	4	$0.016
Linux	8	$0.032
Linux	16	$0.064
Linux	32	$0.128
Linux	64	$0.256
Windows	2	$0.016
Windows	8	$0.064
Windows	16	$0.128
Windows	32	$0.256
Windows	64	$0.512
macOS	3 or 4	$0.08
macOS	12	$0.12
macOS	6 (M1)	$0.16

I believe that if you don't specify a runner, you get the 2 core linux one (I thought until now that the 4 core one was the default). So everything else is a multiple of this cost essentially based on the core count.

Addendum: When we use the basic runner, it is included in the free monthly mionutes that we get as an org (which we may exceed anyway many months, but would need to check again).

lexnv · 2024-01-17T13:38:55Z

We could further reduce the CI time by sharing a substrate-node across tests. Although we should consider this at a later time since the other proposed ideas are easier to implement.

Currently, each test will spawn a substrate node; and submit a few transactions and rpc calls (some with side-effects others not). Further, some tests will wait for the local substrate node to produce a few blocks.

By sharing the substrate node, we eliminate the overhead of spawning the node and waiting for blocks.

We could still have valid tests that produce side effects:

test 1: alice sends 1 DOT to bob; check alice account noune == 2
test 2: alice submits a TX which increments the account; expect alice account nonce == 2
The side effect would happen if test 1 runs to completion before test 2; test 2 assumed that alice account started with nonce == 1.

To mitigate this we could initialize the substrate node with multiple test accounts; and each test would use a different account provided by our testing backend. We may need to add a custom testing RPC call to populate accounts, since last time I checked the initial accounts are seeded from polkadot-sdk code.

Would love to get some feedback on this 🙏 What do you guys think?

That said, I believe most of the CI time is related to my PRs investigating the timeout issues (lightclient and unstable backend). And we should indeed expect a normal return of CI consumption once debugging stops.

jsdw · 2024-01-17T15:15:38Z

Just for the complexity and fear of weird interactions between tests, I'd share a substrate-ndoe binary as a bit of a last resort really! I am hopeful that we can get our CI down to a good level with smaller timeouts, fixing the issue where we actually hit them, and running clippy+fmt+whatever first to fail early in the common case :)

jsdw · 2024-01-19T14:49:22Z

I think we've addressed these for now, so I'll close this and we can re-visit when we get feedback on our usage from now.

lexnv mentioned this issue Jan 17, 2024

ci: Reduce the light client timeout to 15 minutes #1373

Merged

jsdw closed this as completed Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Reduce our monthly minutes usage #1366

CI: Reduce our monthly minutes usage #1366

jsdw commented Jan 15, 2024 •

edited

Loading

jsdw commented Jan 15, 2024 •

edited

Loading

lexnv commented Jan 17, 2024

jsdw commented Jan 17, 2024

jsdw commented Jan 19, 2024

CI: Reduce our monthly minutes usage #1366

CI: Reduce our monthly minutes usage #1366

Comments

jsdw commented Jan 15, 2024 • edited Loading

jsdw commented Jan 15, 2024 • edited Loading

lexnv commented Jan 17, 2024

jsdw commented Jan 17, 2024

jsdw commented Jan 19, 2024

jsdw commented Jan 15, 2024 •

edited

Loading

jsdw commented Jan 15, 2024 •

edited

Loading