ci: Improve caching #3211

wolfgangwalther · 2024-02-06T18:16:30Z

Plenty of text in the commit messages. Hopefully this also fixes the currently failing loadtest, but I'm not sure about that.

TODO:

Save nix cache only on main
Split stack/cabal caches into dependency and working directory caches
Reduce cache size
~~Add PR caching for postgrest-build~~
~~Garbage collect nix/stack/cabal caches?~~
~~Purge old caches proactively when saving new cache?~~

wolfgangwalther · 2024-02-06T19:33:22Z

Indeed, loadtest is fixed that way. Not sure whether that's a one-time thing or not, but still a good sign.

A few more observations, looking at https://github.com/PostgREST/postgrest/actions/caches. Since I'm not sure whether that's publicly available, some key facts:

There's a big warning on top, that reads: "Approaching total cache storage limit (53.35 GB of 10 GB Used). Least recently used caches will be automatically evicted to limit the total cache storage to 10 GB."
Those 53 GB have been created since yesterday, so that's not going to be sustainable.
macOS-all: 3.1 GB
Linux-test: 2.7 GB
Linux-style: 1.4 GB
Linux-common: 2.7 GB
Stack/cabal caches between 300MB and 1GB each

test and style are the new split in this MR. common is the old cache from before. Clearly, splitting test and style does not make any difference in terms of cache size - everything in style is also in test. On the flip-side, merging the loadtest into it is great, the cache is not really bigger.

Overall, we only have 10 GB, so we need to use the cache a lot less. I think we should do the following:

Only store caches on the main branch.
Restore caches on main and in PRs.
The caches will have fixed names and will be replaced by a new commit on main. No prefix matching.
Merge style and test again.

This way we can use the 10 GB we have for:

3.1 GB for macos
2.7 GB for tools
2.3 GB for stack
1 GB for Cabal
Sum: 9.1 GB

By only storing caches when on main we avoid having the main cache be evicted when a PR changes dependencies. This will make CI fast for those PRs which only touch code, those will use the cache. PRs which touch dependencies or nix will be slower, because they only use cachix. This seems like the most useful thing to do here.

Edit: The above calculation does not take the "static" cache into account, which is another 1 GB. So the cache would already be at it's maximum with those...

To avoid having to rebuild the dynamic postgrest build in every nix job, before we run all the tests on it, we can additionally cache just the dist-newstyle folder between the "prepopulate job" and all the test jobs. This comes in at around 200 MB, so we should still have room for those, even when pushing them in PRs. We'd just need to make sure that those caches get some kind of priority over the main caches in terms of eviction. I.e. caches from main should always be evicted last, PR cached first. Will need to figure out whether that's possible.

develop7

Everything seems to be taken care of, LGTM!

develop7 · 2024-02-07T11:36:22Z

we can additionally cache just the dist-newstyle folder between the "prepopulate job" and all the test jobs

Ooh, this one's good. Should've thought of this myself. Looks like something action/cache can easily handle.

We'd just need to make sure that those caches get some kind of priority over the main caches in terms of eviction.

The current eviction policy is that the oldest caches are deleted first, so all we could do is to create the big Nix cache last somehow. Let's see if we do hit that limit first.

Garbage collect nix/stack/cabal caches?

On the one hand, it makes sense for Nix at lease, on the other — there were full Nix cache rebuilds and from what I've noticed, the cache size doesn't seem to increase significantly, if ever; it was 3-something GB all the time.

Stack/cabal caches between 300MB and 1GB each

Speaking of Stack caches, somehow Linux cache is 380M, MacOS is 680M and Windows is 1GB, I wonder why?

wolfgangwalther · 2024-02-07T11:43:47Z

Speaking of Stack caches, somehow Linux cache is 380M, MacOS is 680M and Windows is 1GB, I wonder why?

Yeah, that's something I wondered about, too. I tried stack locally, but I'm getting sizes of more than 2 GB for my ~/.stack folder. This might be what we're seeing on Windows as a slightly compressed cache, though...

I think the difference is that stack would download GHC in those cases, and we'd cache that, too. Which, I think, doesn't make much sense - downloading and installing GHC via stack is by far the fastest piece. Caching the dependencies is the important bit, though.

Maybe we can do better with some combination of ghcup and --system-ghc or so.

wolfgangwalther · 2024-02-07T11:45:41Z

The current eviction policy is that the oldest caches are deleted first, so all we could do is to create the big Nix cache last somehow. Let's see if we do hit that limit first.

The problem is that the caches from main will by definition always be "the oldest", so any PR cache that blows past the limit would cause some cache from main to be evicted.

wolfgangwalther · 2024-02-07T11:47:13Z

The problem is that the caches from main will by definition always be "the oldest", so any PR cache that blows past the limit would cause some cache from main to be evicted.

Interestingly, all PR caches from yesterday have already been evicted while main caches are still there. So maybe there is already some built-in prio between those.

wolfgangwalther · 2024-02-07T11:56:55Z

Since all caches are evicted after 7 days anyway, we could just run a daily workflow to restore (and save?) the current caches on main. This way they'd always be "used last" and hopefully not be evicted instead of PR caches.

develop7 · 2024-02-07T12:35:23Z

Interestingly, all PR caches from yesterday have already been evicted while main caches are still there.

Oh, were they? That'd be really useful. Were the corresponding branches deleted or related PRs merged?

develop7 · 2024-02-07T12:58:30Z

Maybe we can do better with some combination of ghcup and --system-ghc or so.

We could even set stack up with ghcup, and configure stack in a way it'll use ghcup to install GHC (actually ghcup bootstrap script does it unless you ask it to), and if we cache .ghcup separately, we could share cached GHC between cabal and stack builds. With GHC tarballs are 200 to 300M big, times 3 platforms, this would result in another ~1GB saved cache.

wolfgangwalther · 2024-02-07T13:07:38Z

we could share cached GHC between cabal and stack builds.

Not really, they use different GHC versions. Stack is on GHC 9.4.5, Cabal on 9.4.8, 9.6.4 and 9.8.1.

develop7 · 2024-02-07T13:08:56Z

Since all caches are evicted after 7 days anyway, we could just run a daily workflow to restore (and save?) the current caches on main. This way they'd always be "used last" and hopefully not be evicted instead of PR caches.

Yeah, I think I've suggested it somewhere in comments to nix-cache-related PRs. The eviction policy doesn't mention access time, though, but that seems like the best effort.

wolfgangwalther · 2024-02-07T13:11:17Z

Oh, were they? That'd be really useful. Were the corresponding branches deleted or related PRs merged?

Ah, I thought we'd have the case with this PR. But this PR only re-uses caches from main, so that just kept those and deleted everything else...

develop7 · 2024-02-07T13:18:40Z

Not really, they use different GHC versions

Why do we stick to 9.4.5 for Stack, BTW? 9.4.8 is in Stackage for a month or so

develop7 · 2024-02-07T13:20:10Z

Excellent point on caching GHC and dependencies in non-nix builds separately

wolfgangwalther · 2024-02-07T13:20:24Z

Why do we stick to 9.4.5 for Stack, BTW? 9.4.8 is in Stackage for a month or so

But not available on FreebBSD, yet:

#3155 (comment)

wolfgangwalther · 2024-02-07T13:22:13Z

So the idea would be to consistently:

Install GHC / Stack / Cabal via ghcup. Every time, no cache.
Cache dependencies for Stack/Cabal.
Cache the actual build for Stack/Cabal separately. This avoids invalidating the dependencies cache for regular code changes.

wolfgangwalther · 2024-02-07T13:24:27Z

We could probably also just throw out the Linux x64 cabal build with GHC 9.4.8. We are building with 9.4.8 in nix, with 9.4.5 via stack on various platforms and with 9.4.8 on arm via Cabal. That should be enough for 9.4.x series...

wolfgangwalther · 2024-02-07T13:34:01Z

@develop7 do you see any way of "not storing the cache" with cache-nix-action? I think this is possible in v5, but we have reverted to v4. Maybe we should try v5 again - might not have been buggy at all, maybe we just had our caches invalidated for other reasons or so?

develop7 · 2024-02-07T14:15:27Z

do you see any way of "not storing the cache" with cache-nix-action?

@wolfgangwalther perhaps, use restore action only? That's what the cache-nix-action is using internally anyway.

wolfgangwalther · 2024-02-07T17:05:09Z

@develop7

perhaps, use restore action only? That's what the cache-nix-action is using internally anyway.

Ah, right I remember why I looked at this and then didn't do it: This would require us to put the save part manually in each place where we currently use the setup-nix action. I didn't find a way to add a "post job hook" or something like that in a composite action, yet.

However, v5 of cache-nix-action has the save input which can control this. Using this would allow to keep things simple - just using setup-nix would be enough.

wolfgangwalther · 2024-02-07T20:54:04Z

I looked at the "Build MacOS (Nix)" job a little bit, because it by far takes the most time to run (compared to other nix jobs). Both on main (with cache) and in this PR (without cache, because the key was changed). The numbers are:

The last run on main took 14m 2s. 13m 37s were spent in the "Setup Nix Action" - most of that to download the actions cache of around 3.8 GB.
The last run in this PR took 15m 26s, of which 12m 26s were spent downloading from cachix. Nothing was built. "Setup Nix Action" took 2m 46s.

Assuming the 2m 46s are the "base" time for nix setup without cache download, downloading the actions cache took around 11m here. Compare that to 12m 30s for cachix... and I start to wonder why do we bother with caching this stuff via github actions at all? Cachix seems fast enough to me for that purpose.

I can see the github actions cache be useful to:

cache stack and cabal
cache postgrest-build for later stages of the same workflow

But I seriously question how much we are gaining from caching nix in the actions cache again.

As an experiment, I added a commit to replace the actions cache for this job with something else: Basically a lookup of cachix, which doesn't always involve a full download. This should be very fast when nothing changes - and if it does, it would only download those dependencies that are required for the rebuild. Compare this to downloading everything every time..

develop7 · 2024-02-08T15:22:03Z

I start to wonder why do we bother with caching this stuff via github actions at all?

Nix caching does shave off a minute or two per Nix-involving job, which shortens the feedback loop accordingly; usually to these two minutes, due to CI jobs running in parallel. For the macOS case, where there is a single cache consumer job and low-performing runner, it certainly does make sense to drop cache altogether.

This is my first approach to GitHub Actions and their cache in particular, and, being honest, it is performing worse than I've expected, especially Windows/Mac runners (download speed is barely 100Mb/sec (down to abysmal 20MB/sec even), while in Linux runners it's 380Mb/s sometimes). With the dedicated infrastructure, I expect the improvement to be more significant, but currently it might not be worth the effort.

wolfgangwalther · 2024-02-08T21:09:46Z

As an experiment, I added a commit to replace the actions cache for this job with something else: Basically a lookup of cachix, which doesn't always involve a full download. This should be very fast when nothing changes - and if it does, it would only download those dependencies that are required for the rebuild. Compare this to downloading everything every time..

So that didn't work at all in CI, while locally it was fine. I have no idea, yet, what's happening there.

I didn't find a way to add a "post job hook" or something like that in a composite action, yet.

This is done now - I was able to keep it all in the setup-nix action.

This restores caches on all branches and pull requests, but only stores them on the main branch and release branches. This prevents those caches from being evicted early when we hit the 10 GB limit quickly.

This is a first step to split up the cabal and stack caches in separate pieces. Here we split the work folder, which just contains the postgrest-specific build artifacts, into a separate cache. More fine-grained caching should give us better cache hits and much fewer upload size in the regular case, improving CI performance. Since the work file caches are very small (about 30-40 MB) they are cached for PRs, too. This will allow the majority of PRs, which only change source code files, but no dependencies, to still have cached their build files for additional commits.

This removes the GHC install from stack caches to reduce size.

This action makes sure to always have the correct GHC and/or stack version installed in all environments. This solves problem where ghc or stack might not be available on newer macos images anymore or where ghcup is not available by default on our new custom github runner on arm.

wolfgangwalther · 2024-05-20T11:22:43Z

I focused on improvements to the stack and cabal caching here and decided against trying to play with more nix-related github actions caching. Instead I will try #3364 (comment) to speed up the macos nix job.

wolfgangwalther requested a review from develop7 February 6, 2024 18:16

wolfgangwalther force-pushed the loadtest-caches branch from 67da5b7 to 742268e Compare February 6, 2024 18:18

wolfgangwalther marked this pull request as draft February 6, 2024 21:10

wolfgangwalther force-pushed the loadtest-caches branch from 742268e to ffb7e24 Compare February 6, 2024 21:10

wolfgangwalther mentioned this pull request Feb 7, 2024

ci: improve previous caches reuse #3187

Merged

develop7 approved these changes Feb 7, 2024

View reviewed changes

wolfgangwalther mentioned this pull request Feb 7, 2024

ci: delete cache-id parameter of setup-nix action #3180

Closed

wolfgangwalther force-pushed the loadtest-caches branch from ffb7e24 to 8ff19f4 Compare February 7, 2024 19:08

wolfgangwalther force-pushed the loadtest-caches branch 10 times, most recently from 59fe0f7 to e736401 Compare February 26, 2024 19:05

wolfgangwalther mentioned this pull request Feb 26, 2024

ci: Nix cache miss in "Run doctests" step of "Test (Nix)" CI job #3183

Closed

wolfgangwalther force-pushed the loadtest-caches branch 8 times, most recently from 112574c to 98b1cf3 Compare March 4, 2024 08:07

wolfgangwalther force-pushed the loadtest-caches branch 2 times, most recently from 4900cc6 to 43a3187 Compare May 12, 2024 19:20

wolfgangwalther added 4 commits May 20, 2024 13:15

ci: Only save caches on main and release branches

04ac95b

This restores caches on all branches and pull requests, but only stores them on the main branch and release branches. This prevents those caches from being evicted early when we hit the 10 GB limit quickly.

ci: Reduce stack cache size

dc9a27d

This removes the GHC install from stack caches to reduce size.

wolfgangwalther force-pushed the loadtest-caches branch from 0139c84 to a510b8b Compare May 20, 2024 11:20

wolfgangwalther marked this pull request as ready for review May 20, 2024 11:20

wolfgangwalther merged commit 9e6a89f into PostgREST:main May 20, 2024
11 checks passed

wolfgangwalther deleted the loadtest-caches branch June 2, 2024 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Improve caching #3211

ci: Improve caching #3211

wolfgangwalther commented Feb 6, 2024 •

edited

Loading

wolfgangwalther commented Feb 6, 2024 •

edited

Loading

develop7 left a comment

develop7 commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

develop7 commented Feb 7, 2024

develop7 commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

develop7 commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

develop7 commented Feb 7, 2024

develop7 commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

develop7 commented Feb 7, 2024 •

edited

Loading

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

develop7 commented Feb 8, 2024

wolfgangwalther commented Feb 8, 2024

wolfgangwalther commented May 20, 2024

ci: Improve caching #3211

ci: Improve caching #3211

Conversation

wolfgangwalther commented Feb 6, 2024 • edited Loading

wolfgangwalther commented Feb 6, 2024 • edited Loading

develop7 left a comment

Choose a reason for hiding this comment

develop7 commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

develop7 commented Feb 7, 2024

develop7 commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

develop7 commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

develop7 commented Feb 7, 2024

develop7 commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

develop7 commented Feb 7, 2024 • edited Loading

wolfgangwalther commented Feb 7, 2024

wolfgangwalther commented Feb 7, 2024

develop7 commented Feb 8, 2024

wolfgangwalther commented Feb 8, 2024

wolfgangwalther commented May 20, 2024

wolfgangwalther commented Feb 6, 2024 •

edited

Loading

wolfgangwalther commented Feb 6, 2024 •

edited

Loading

develop7 commented Feb 7, 2024 •

edited

Loading