Helm release manifest too large #1783

conorsch · 2023-01-03T23:38:38Z

Describe the bug
The helm logic for deploying to kubernetes creates a bundle greater than 1MB, which is the largest permissible Secret size in k8s, causing deploys to fail.

To Reproduce
See example CI run here: https://github.com/penumbra-zone/penumbra/actions/runs/3832692505/jobs/6523427954 The exact error message is:

Error: UPGRADE FAILED: create: failed to create: Secret "sh.helm.release.v1.penumbra-testnet.v3" is invalid: data: Too long: must have at most 1048576 bytes

Expected behavior
The deploy succeeds.

Additional context
This happens because the ci deployment logic first generates a testnet config and then interpolates those flat lines inside the Helm chart. Unfortunately this means that the filesize of each one of those generated config files is added to the manifest bundle, which is b64'd and stored as a secret on the cluster. Once this bag of bytes crossed the 1MB threshold, deploys started breaking. That threshold was crossed when updating the Discord allocations as part of #1781

What do?

We have two options:

Long term, we should stop interpolating config files and instead use a separate script to generate YAML vars, which can be passed in at runtime.
Short term, we can selectively omit certain non-critical config files via helmignore.

To unblock deploy of 039-praxidike today, I opted for 2 locally. I'll push that up onto main to unbreak auto-deploys of preview. The clock is ticking on 2 no longer being an option, once the allocation files are large enough. For now, let's pause manual updates of the Discord allocations until we revise the deploy logic to use 1.

The text was updated successfully, but these errors were encountered:

Unbreaking deploys by using helmignore to omit certain flat files from inclusion in the automatically generated bundle. Prior to this change, all helm deploys fail with a "too large" error, since the entirety of the manifest base64-encoded is now greater than 1MB, too large for a k8s secret, thanks to the recent update of the Discord allocations. Towards #1783.

conorsch · 2023-01-04T00:41:12Z

I'll push that up onto main to unbreak auto-deploys of preview.

The fix in ea41cca indeed unbroke preview deploys, as shown here: https://github.com/penumbra-zone/penumbra/actions/runs/3833889668/jobs/6525811171

This is a naive import of the history from 2023-01 and 2023-02. Thought I'd need to deduplicate it manually, but the `new-testnet.sh` already dedupes by addr on import. Let's see how CI likes it. Closes #2063. Refs #1783.

conorsch · 2023-05-05T21:39:54Z

This bit us again, see failed deploy in https://github.com/penumbra-zone/penumbra/actions/runs/4897494168/jobs/8745987867. It was #2491 that updated Discord allocations. Will take a look at deduping those manually or otherwise reverting them for now, to unblock release on Monday.

Regenerated the 052 chain info after reverting recent Discord allocation additions, to avoid #1783. Was careful to preserve the paper allocations from #2491, which we want in 52 as part of the DEX feature set.

conorsch · 2023-05-05T23:20:51Z

Short-term, I've reverted the update to the Discord allocations in ce63890. I was careful to preserve the paper asset additions introduced in #2491. That change gets us back under the limit, unbreaking preview immediately, and unblocking release on Monday. We're basically redlining on the filesize limitation as-is, though, so we basically can't update Discord allocations until this problem is resolved.

As an aside, we'd do well to deduplicate the discord allocations by discord user id. Tried that, but it wasn't a big enough change to unbreak things, so I'm not including it right now.

Here's the patch for future reference

diff --git a/testnets/new-testnet.sh b/testnets/new-testnet.sh
index f5205e34..bd9e5ae1 100755
--- a/testnets/new-testnet.sh
+++ b/testnets/new-testnet.sh
@@ -14,7 +14,13 @@ cp "$PREVIOUS_TESTNET_DIRECTORY/validators.json" "$NEW_TESTNET_DIRECTORY/validat
 
 echo "Setting up allocations for new testnet..."
 
-cut -d , -f 6 < discord_history.csv \
+# We deduplicate addresses by keying on Discord user ID. The goals are two-fold:
+#
+#   * don't reward spamming the bot with ephemeral addresses for more test tokens
+#   * limit the filesize of the discord allocations file, pending resolution of GH1783.
+#
+awk -F, '{key = $4 FS } !seen[key]++' < discord_history.csv \
+    | cut -d , -f 6 \
     | tail -n +2 | sort | uniq \
     | grep ^penumbrav \
     | sed -e 's/^/"1_000__000_000","upenumbra","/; s/$/"/' \

Long-term, the solution is as described above: we need to change the way we pass genesis information into the cluster deployment. Reading a flat directly into a ConfigMap won't cut it, due to the 1MB limit we're constrained by. We'd probably do better to split the cluster deployment logic into two logical halves: testnet generation with validators, and bootstrapping fullnodes onto an existing testnet. Rather than perform the testnet generation outside the cluster and feed the data in, we can extend the initContainers logic to create the chain data in a PV and pass that around during the config-munging stage. Will ping some SL folks for input, given that they likely have encountered this problem elsewhere, supporting other projects.

Here's some quantitative data I gathered while testing locally:

on v0.51.2, genesis.json is: 970k, or 970026 bytes
on main (prior to 2502), it's 1.4MB, or 1430551 bytes
on feat branch with dedupe logic, it's 1.2MB, or 1192299 bytes
on Reverts latest discord allocations to unbreak deploys #2502, to resolve, it's 973k, or 972807 bytes (good enough for now)

Will raise priority on this, so we can return to updating the Discord allocations ad-hoc, rather than avoiding it, as we must now.

Regenerated the 052 chain info after reverting recent Discord allocation additions, to avoid #1783. Was careful to preserve the paper allocations from #2491, which we want in 52 as part of the DEX feature set.

conorsch · 2023-05-05T23:46:58Z

Preview back up, as of https://github.com/penumbra-zone/penumbra/actions/runs/4898387278/jobs/8747515808.

During ramp-up to Testnet 52 (#2415), we opportunistically updated the Discord allocations, and encountered #1783 again. While we need a durable solution to that problem, we also want a quick fix, so we turn to tried-and-true file compression: we gzip the generated json file, base64 encode it to sideload it into the cluster, then decompress it as part of the initContainer bootstrapping logic during node creation. This is a "good enough" solution to buy time for the near future, ensuring that we can continue to update Discord allocations when we want. The compression ratio is approximately 3x, which gives us some headroom.

conorsch · 2023-09-18T18:50:10Z

The helm limit issue was resolved as of #3033: there's no longer a 1MB limitation on the size of genesis data we must upload to the cluster nodes. Instead, we continue to bake the default allocations into the pd binary, and we use pd testnet generate inside the cluster to create a new network. Since no sideloading of genesis data via API occurs, the 1MB ConfigMap limit is irrelevant.

However, updating the Discord allocations (2023-08-03 -> 2023-09-18) caused the smoke test to fail in #3047, due to a message-too-large error:

Sep 18 16:49:18.666 INFO penumbra_view::worker: view worker error e=status: OutOfRange, message: "Error, message length too large: found 5347639 bytes, the limit is: 4194304 bytes", details: [], metadata: MetadataMap { headers: {} }

That's the 4MB message size limit for gRPC, which is less than the new candidate genesis file, which is >5MB. So we're still effectively blocked on making the genesis file larger, although no longer by the CI tooling. The tonic docs state that max_decoding_message_size can be overridden on a per-client basis, but it's not obvious to me how we'd customize that config attribute in our generated proto files.

Refs #3123. Notably does not include any Discord allocation updates, as we're still blocked by [0]. [0] #1783 (comment)

hdevalence · 2023-10-01T20:21:26Z

Can we split that Tonic issue out into a new issue? It seems worth fixing separately.

conorsch · 2023-10-04T15:19:13Z

Can we split that Tonic issue out into a new issue? It seems worth fixing separately.

Done: #3147

conorsch · 2023-11-28T21:46:04Z

This is no longer an issue, given the changes in #3033. The gRPC message size limitation we may want to return to, but we've got that tracked separately already.

conorsch mentioned this issue Jan 4, 2023

release testnet 39: Praxidike #1757

Closed

12 tasks

conorsch self-assigned this Jan 6, 2023

redshiftzero added this to Testnets Jan 6, 2023

redshiftzero moved this to Testnet 40: Themisto in Testnets Jan 6, 2023

conorsch moved this from Testnet 40: Themisto to Future in Testnets Jan 12, 2023

conorsch moved this from Future to Testnet 41: Callirrhoe in Testnets Jan 13, 2023

conorsch moved this from Testnet 41: Callirrhoe to Future in Testnets Jan 19, 2023

conorsch mentioned this issue Feb 28, 2023

Update historical allocations #2063

Closed

conorsch mentioned this issue May 5, 2023

Reverts latest discord allocations to unbreak deploys #2502

Merged

conorsch mentioned this issue May 8, 2023

ci: compress genesis file during deployment #2512

Merged

conorsch mentioned this issue Jul 28, 2023

Update testnet generation logic for actual deployment scenarios #2882

Closed

conorsch mentioned this issue Sep 8, 2023

ci: separate deploy logic for networks and nodes #2983

Closed

17 tasks

conorsch added a commit that referenced this issue Sep 29, 2023

chore: update chain id for 06-iapetus

82363c2

Refs #3123. Notably does not include any Discord allocation updates, as we're still blocked by [0]. [0] #1783 (comment)

conorsch mentioned this issue Sep 29, 2023

chore: update chain id for 062-iapetus #3124

Merged

conorsch added a commit that referenced this issue Sep 29, 2023

chore: update chain id for 062-iapetus

4f66118

Refs #3123. Notably does not include any Discord allocation updates, as we're still blocked by [0]. [0] #1783 (comment)

conorsch added a commit that referenced this issue Sep 29, 2023

chore: update chain id for 062-iapetus

3b50080

Refs #3123. Notably does not include any Discord allocation updates, as we're still blocked by [0]. [0] #1783 (comment)

aubrika added this to Penumbra Oct 30, 2023

github-project-automation bot moved this to 🗄️ Backlog in Penumbra Oct 30, 2023

conorsch closed this as completed Nov 28, 2023

github-project-automation bot moved this from Future to Testnet 63: Rhea (Web Wallet) in Testnets Nov 28, 2023

github-project-automation bot moved this from 🗄️ Backlog to ✅ Done in Penumbra Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm release manifest too large #1783

Helm release manifest too large #1783

conorsch commented Jan 3, 2023

conorsch commented Jan 4, 2023

conorsch commented May 5, 2023

conorsch commented May 5, 2023

conorsch commented May 5, 2023

conorsch commented Sep 18, 2023

hdevalence commented Oct 1, 2023

conorsch commented Oct 4, 2023

conorsch commented Nov 28, 2023

Helm release manifest too large #1783

Helm release manifest too large #1783

Comments

conorsch commented Jan 3, 2023

conorsch commented Jan 4, 2023

conorsch commented May 5, 2023

conorsch commented May 5, 2023

conorsch commented May 5, 2023

conorsch commented Sep 18, 2023

hdevalence commented Oct 1, 2023

conorsch commented Oct 4, 2023

conorsch commented Nov 28, 2023