Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Speed up CircleCI by folding build workflow downstream #4426

Merged
merged 14 commits into from
Aug 19, 2022

Conversation

michaeldiamant
Copy link
Contributor

Proposes to speed up CircleCI builds by consolidating build workflow steps into downstream workflows.

By only changing the CI architecture without modifying resource class + parallelism, we see the following change:

  Before After Change %
Duration (min) 35 27.4 -22%
$ annually $17,971 $10,998 -39%

Feels like a win-win. The change saves money + saves time.

Details:

@codecov
Copy link

codecov bot commented Aug 18, 2022

Codecov Report

Merging #4426 (eca634d) into master (e53605d) will increase coverage by 0.09%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #4426      +/-   ##
==========================================
+ Coverage   55.11%   55.20%   +0.09%     
==========================================
  Files         397      397              
  Lines       50073    50073              
==========================================
+ Hits        27598    27645      +47     
+ Misses      20177    20142      -35     
+ Partials     2298     2286      -12     
Impacted Files Coverage Δ
ledger/tracker.go 73.93% <0.00%> (-0.86%) ⬇️
network/wsNetwork.go 64.89% <0.00%> (+0.19%) ⬆️
cmd/tealdbg/debugger.go 73.49% <0.00%> (+0.80%) ⬆️
data/transactions/verify/txn.go 44.73% <0.00%> (+0.87%) ⬆️
catchup/peerSelector.go 100.00% <0.00%> (+1.04%) ⬆️
crypto/merkletrie/node.go 93.48% <0.00%> (+1.86%) ⬆️
agreement/proposalManager.go 98.03% <0.00%> (+1.96%) ⬆️
catchup/service.go 70.12% <0.00%> (+1.97%) ⬆️
agreement/cryptoVerifier.go 69.71% <0.00%> (+2.11%) ⬆️
crypto/merkletrie/trie.go 68.61% <0.00%> (+2.18%) ⬆️
... and 3 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@michaeldiamant michaeldiamant changed the title Speed up CircleCI by folding build workflow downstream CI: Speed up CircleCI by folding build workflow downstream Aug 18, 2022
@michaeldiamant michaeldiamant marked this pull request as ready for review August 18, 2022 16:58
.circleci/config.yml Outdated Show resolved Hide resolved
.circleci/config.yml Show resolved Hide resolved
Copy link
Contributor

@jannotti jannotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly not my final call, but I'm convinced this is better until/unless we drastically shrink the persisted workspace after an independent build.

And building is so fast that even if we do do that, this might be only only barely worse.

@algorandskiy
Copy link
Contributor

Verified:
The old flow: amd64 build (8m) + 4 downstream tasks (72m total, 27m max) => 27+8 = 25m run time
The new flow: 4 downstream tasks (68m total, 21m max) => 21m run time

Looks good but for whatever reason all downstream tasks are faster in the new configuration than in the old although they make more work. So questionable. Shouldn't we run the new more times to collect more data ensure the deviation from 21m we see is low?

@algojack
Copy link
Contributor

the problem with this approach is that we are getting away from "build once, test/release many times". Now we will be building many times (which is not ideal) but ok. I'd rather look at what we are persisting and minimize that to what we actually need to persist.

@michaeldiamant
Copy link
Contributor Author

for whatever reason all downstream tasks are faster in the new configuration than in the old although they make more work.

@algorandskiy Your observation is accurate though I think it can be explained. Here's a terse note, let me know if you'd like to discuss live.

Shouldn't we run the new more times to collect more data ensure the deviation from 21m we see is low?

@algorandskiy A few thoughts:

I'd rather look at what we are persisting and minimize that to what we actually need to persist.

@algojack I think you're asking a question that the group needs to fundamentally weigh. A few thoughts from my perspective:

  • Are you suggesting to hold the PR until such an investigation happens? If so, let's speak live. I prefer to not hold the PR. I am open to being convinced to work on the problem after merging the PR.
  • My somewhat loosely held opinion: Since the Go build completes quickly (< 3m) without vertically scaling up (resource class = medium), I think there's limited benefit to keeping the as-is topology.
    • Put another way, persisting to + reading from workspace must be appreciably < Go build duration for the as-is topology to be a net win.
    • Secondarily - Keeping the as-is topology implies cache maintenance. It's more friction to need to be aware of what's cached when, where, and why.
  • Obviously, at some point, the assumption may no longer hold. Though my feeling is that the assumption will hold for the foreseeable future.

@jannotti
Copy link
Contributor

jannotti commented Aug 19, 2022

the problem with this approach is that we are getting away from "build once, test/release many times".

I think it's a mistake to think of CI as "release". It's just testing and the things that matter are speed, cost, and maintainability.

I think this PR improves the situation in all three dimensions.

It is demonstrably faster and cheaper, and is fewer lines and complexity in the .yml. The complexity of building in separate jobs, tied together by persisting specific items between the jobs was something we put up with because of the presumed advantages (in speed/cost) not something that was good for its own sake.

@jannotti
Copy link
Contributor

Looks good but for whatever reason all downstream tasks are faster in the new configuration than in the old although they make more work. So questionable. Shouldn't we run the new more times to collect more data ensure the deviation from 21m we see is low?

That's actually why we should do it. The downstream jobs were wasting time re-attaching a huge workspace created by the build, even though they didn't need. (Especially, for example, the test verification jobs that take less than a second, but were spending a minute re-attaching the workspace)

@algorandskiy
Copy link
Contributor

This time the slowest amd64 is again 20m so I guess I'm convinced.

@algojack
Copy link
Contributor

the problem with this approach is that we are getting away from "build once, test/release many times".

I think it's a mistake to think of CI as "release". It's just testing and the things that matter are speed, cost, and maintainability.

We want to have reproducible build and I believe "build once test many times" also. So to consider this as being the build we compare our releases to, is not a mistake in my opinion.

@algojack
Copy link
Contributor

not against this PR. Just wanted to discuss so that we are all aware of where we are headed with this. And we did discuss this on slack, so everyone seems to be ok with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants