CI: Speed up CircleCI by folding build workflow downstream #4426

michaeldiamant · 2022-08-18T16:37:06Z

Proposes to speed up CircleCI builds by consolidating build workflow steps into downstream workflows.

By only changing the CI architecture without modifying resource class + parallelism, we see the following change:

	Before	After	Change %
Duration (min)	35	27.4	-22%
$ annually	$17,971	$10,998	-39%

Feels like a win-win. The change saves money + saves time.

Details:

Result notes:
- The sample size here is small (< 5), so it's prudent to assume some variation.
- The $ annually cost assumes a flat credit rate ($ per credit = $0.0006) and uses a 90 day average CI invocation rate.
- There's a spread sheet with more detail for those interested: https://docs.google.com/spreadsheets/d/13JM3vqOWFpzWrtI2hT02OpvojP2tdYm89GLR3aqY0t4/edit#gid=537120180. Review benefits from a live walkthrough.
- In practice, I presume the credit crate is lower because of CircleCI's variable pricing. The salient point is both measurements use the same unit price.
The PR's primary insight is that the build workflow spends >= 50% of its duration persisting to workspace (e.g. https://app.circleci.com/pipelines/github/algorand/go-algorand/8736/workflows/8dedb6dc-c5c4-4c6a-83e5-f01219396acc/jobs/158031). Since the build workflow is a pre-requisite to all downstream workflows, the PR proposes parallelizing the build step by folding it into downstream workflows.

This reverts commit ead4161.

codecov · 2022-08-18T16:37:18Z

Codecov Report

Merging #4426 (eca634d) into master (e53605d) will increase coverage by 0.09%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #4426      +/-   ##
==========================================
+ Coverage   55.11%   55.20%   +0.09%     
==========================================
  Files         397      397              
  Lines       50073    50073              
==========================================
+ Hits        27598    27645      +47     
+ Misses      20177    20142      -35     
+ Partials     2298     2286      -12

Impacted Files	Coverage Δ
ledger/tracker.go	`73.93% <0.00%> (-0.86%)`	⬇️
network/wsNetwork.go	`64.89% <0.00%> (+0.19%)`	⬆️
cmd/tealdbg/debugger.go	`73.49% <0.00%> (+0.80%)`	⬆️
data/transactions/verify/txn.go	`44.73% <0.00%> (+0.87%)`	⬆️
catchup/peerSelector.go	`100.00% <0.00%> (+1.04%)`	⬆️
crypto/merkletrie/node.go	`93.48% <0.00%> (+1.86%)`	⬆️
agreement/proposalManager.go	`98.03% <0.00%> (+1.96%)`	⬆️
catchup/service.go	`70.12% <0.00%> (+1.97%)`	⬆️
agreement/cryptoVerifier.go	`69.71% <0.00%> (+2.11%)`	⬆️
crypto/merkletrie/trie.go	`68.61% <0.00%> (+2.18%)`	⬆️
... and 3 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

.circleci/config.yml

jannotti

Certainly not my final call, but I'm convinced this is better until/unless we drastically shrink the persisted workspace after an independent build.

And building is so fast that even if we do do that, this might be only only barely worse.

algorandskiy · 2022-08-19T15:50:54Z

Verified:
The old flow: amd64 build (8m) + 4 downstream tasks (72m total, 27m max) => 27+8 = 25m run time
The new flow: 4 downstream tasks (68m total, 21m max) => 21m run time

Looks good but for whatever reason all downstream tasks are faster in the new configuration than in the old although they make more work. So questionable. Shouldn't we run the new more times to collect more data ensure the deviation from 21m we see is low?

algojack · 2022-08-19T15:51:13Z

the problem with this approach is that we are getting away from "build once, test/release many times". Now we will be building many times (which is not ideal) but ok. I'd rather look at what we are persisting and minimize that to what we actually need to persist.

michaeldiamant · 2022-08-19T16:38:46Z

for whatever reason all downstream tasks are faster in the new configuration than in the old although they make more work.

@algorandskiy Your observation is accurate though I think it can be explained. Here's a terse note, let me know if you'd like to discuss live.

Here's the old arm64_test_verification: https://app.circleci.com/pipelines/github/algorand/go-algorand/8857/workflows/4c5b2c71-e077-41e8-89b1-e31cc1f2b032/jobs/159863. It spends 46s/60s or ~76% of its duration attaching to workspace.
The new analog spends 0s on the same step: https://app.circleci.com/pipelines/github/algorand/go-algorand/8861/workflows/b221d409-e3f8-4ec6-a9b2-9baba4f1a70a/jobs/159909.
Additionally, since the Go build process is ~1.5min with resource class = large and assuming ~45s to attach to work space in old topology, there's only a net ~45s overhead for moving the build downstream.

Shouldn't we run the new more times to collect more data ensure the deviation from 21m we see is low?

@algorandskiy A few thoughts:

I definitely expect variability. Though I still expect the PR to be a net win.
I kicked off several rebuilds: https://app.circleci.com/pipelines/github/algorand/go-algorand?branch=circleci_consolidate. Since the rebuilds are running in parallel, we may observe queuing due to CircleCI constraints. Which is to say, these runs may approximate a worst case scenario.

I'd rather look at what we are persisting and minimize that to what we actually need to persist.

@algojack I think you're asking a question that the group needs to fundamentally weigh. A few thoughts from my perspective:

Are you suggesting to hold the PR until such an investigation happens? If so, let's speak live. I prefer to not hold the PR. I am open to being convinced to work on the problem after merging the PR.
My somewhat loosely held opinion: Since the Go build completes quickly (< 3m) without vertically scaling up (resource class = medium), I think there's limited benefit to keeping the as-is topology.
- Put another way, persisting to + reading from workspace must be appreciably < Go build duration for the as-is topology to be a net win.
- Secondarily - Keeping the as-is topology implies cache maintenance. It's more friction to need to be aware of what's cached when, where, and why.
Obviously, at some point, the assumption may no longer hold. Though my feeling is that the assumption will hold for the foreseeable future.

jannotti · 2022-08-19T16:58:01Z

the problem with this approach is that we are getting away from "build once, test/release many times".

I think it's a mistake to think of CI as "release". It's just testing and the things that matter are speed, cost, and maintainability.

I think this PR improves the situation in all three dimensions.

It is demonstrably faster and cheaper, and is fewer lines and complexity in the .yml. The complexity of building in separate jobs, tied together by persisting specific items between the jobs was something we put up with because of the presumed advantages (in speed/cost) not something that was good for its own sake.

jannotti · 2022-08-19T17:00:24Z

Looks good but for whatever reason all downstream tasks are faster in the new configuration than in the old although they make more work. So questionable. Shouldn't we run the new more times to collect more data ensure the deviation from 21m we see is low?

That's actually why we should do it. The downstream jobs were wasting time re-attaching a huge workspace created by the build, even though they didn't need. (Especially, for example, the test verification jobs that take less than a second, but were spending a minute re-attaching the workspace)

algorandskiy · 2022-08-19T17:06:50Z

This time the slowest amd64 is again 20m so I guess I'm convinced.

algojack · 2022-08-19T17:30:35Z

the problem with this approach is that we are getting away from "build once, test/release many times".

I think it's a mistake to think of CI as "release". It's just testing and the things that matter are speed, cost, and maintainability.

We want to have reproducible build and I believe "build once test many times" also. So to consider this as being the build we compare our releases to, is not a mistake in my opinion.

algojack · 2022-08-19T18:21:59Z

not against this PR. Just wanted to discuss so that we are all aware of where we are headed with this. And we did discuss this on slack, so everyone seems to be ok with it.

michaeldiamant added 12 commits August 17, 2022 22:46

Attempt to consolidate build into test workflows

58b2da3

Try to fix the build

ead4161

Revert "Try to fix the build"

118fdc2

This reverts commit ead4161.

Try to fix build steps

28e94d7

Restore dir creation

f3e35e2

Increase parallelism

2ffc7a5

Merge branch 'master' into circleci_consolidate

abbc24b

More integration parallelism

bd838f9

Checkpoint working config

f23630f

Checkpoint matrix and filters alias usage

c70b961

Delete commented out steps

6f9b989

Reset to as was parallelism and executors

6f2cfc9

michaeldiamant changed the title ~~Speed up CircleCI by folding build workflow downstream~~ CI: Speed up CircleCI by folding build workflow downstream Aug 18, 2022

michaeldiamant added the Enhancement label Aug 18, 2022

michaeldiamant marked this pull request as ready for review August 18, 2022 16:58

michaeldiamant requested review from cce, algojack and jannotti August 18, 2022 16:59

michaeldiamant mentioned this pull request Aug 18, 2022

CI: Speed up CircleCI builds by increasing parallelism and resource classes #4427

Closed

jannotti reviewed Aug 19, 2022

View reviewed changes

.circleci/config.yml Outdated Show resolved Hide resolved

.circleci/config.yml Show resolved Hide resolved

michaeldiamant added 2 commits August 19, 2022 09:49

Add explanatory topology comment

cbb4486

Consolidate build pre-requisite steps into generic_build command

eca634d

jannotti approved these changes Aug 19, 2022

View reviewed changes

algorandskiy closed this Aug 19, 2022

algorandskiy reopened this Aug 19, 2022

algorandskiy approved these changes Aug 19, 2022

View reviewed changes

algojack approved these changes Aug 19, 2022

View reviewed changes

jannotti merged commit feab018 into master Aug 19, 2022

jannotti deleted the circleci_consolidate branch August 19, 2022 18:36

This was referenced Aug 24, 2022

CI: Re-introduce build_nightly workflow to persist artifacts for upload_binaries #4457

Merged

CI: Increase expect + unit test parallelism #4386

Closed

Algo-devops-service mentioned this pull request Sep 20, 2022

go-algorand 3.10.0-beta Release PR #4565

Merged

onetechnical mentioned this pull request Sep 30, 2022

go-algorand 3.10.0-beta Release PR #4612

Merged

Algo-devops-service mentioned this pull request Sep 30, 2022

go-algorand 3.10.0-stable Release PR #4618

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Speed up CircleCI by folding build workflow downstream #4426

CI: Speed up CircleCI by folding build workflow downstream #4426

michaeldiamant commented Aug 18, 2022

codecov bot commented Aug 18, 2022 •

edited

Loading

jannotti left a comment

algorandskiy commented Aug 19, 2022

algojack commented Aug 19, 2022

michaeldiamant commented Aug 19, 2022

jannotti commented Aug 19, 2022 •

edited

Loading

jannotti commented Aug 19, 2022

algorandskiy commented Aug 19, 2022

algojack commented Aug 19, 2022

algojack commented Aug 19, 2022

CI: Speed up CircleCI by folding build workflow downstream #4426

CI: Speed up CircleCI by folding build workflow downstream #4426

Conversation

michaeldiamant commented Aug 18, 2022

codecov bot commented Aug 18, 2022 • edited Loading

Codecov Report

jannotti left a comment

Choose a reason for hiding this comment

algorandskiy commented Aug 19, 2022

algojack commented Aug 19, 2022

michaeldiamant commented Aug 19, 2022

jannotti commented Aug 19, 2022 • edited Loading

jannotti commented Aug 19, 2022

algorandskiy commented Aug 19, 2022

algojack commented Aug 19, 2022

algojack commented Aug 19, 2022

codecov bot commented Aug 18, 2022 •

edited

Loading

jannotti commented Aug 19, 2022 •

edited

Loading