-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Helm release manifest too large #1783
Comments
Unbreaking deploys by using helmignore to omit certain flat files from inclusion in the automatically generated bundle. Prior to this change, all helm deploys fail with a "too large" error, since the entirety of the manifest base64-encoded is now greater than 1MB, too large for a k8s secret, thanks to the recent update of the Discord allocations. Towards #1783.
The fix in ea41cca indeed unbroke preview deploys, as shown here: https://github.com/penumbra-zone/penumbra/actions/runs/3833889668/jobs/6525811171 |
This bit us again, see failed deploy in https://github.com/penumbra-zone/penumbra/actions/runs/4897494168/jobs/8745987867. It was #2491 that updated Discord allocations. Will take a look at deduping those manually or otherwise reverting them for now, to unblock release on Monday. |
Short-term, I've reverted the update to the Discord allocations in ce63890. I was careful to preserve the paper asset additions introduced in #2491. That change gets us back under the limit, unbreaking preview immediately, and unblocking release on Monday. We're basically redlining on the filesize limitation as-is, though, so we basically can't update Discord allocations until this problem is resolved. As an aside, we'd do well to deduplicate the discord allocations by discord user id. Tried that, but it wasn't a big enough change to unbreak things, so I'm not including it right now. Here's the patch for future referencediff --git a/testnets/new-testnet.sh b/testnets/new-testnet.sh
index f5205e34..bd9e5ae1 100755
--- a/testnets/new-testnet.sh
+++ b/testnets/new-testnet.sh
@@ -14,7 +14,13 @@ cp "$PREVIOUS_TESTNET_DIRECTORY/validators.json" "$NEW_TESTNET_DIRECTORY/validat
echo "Setting up allocations for new testnet..."
-cut -d , -f 6 < discord_history.csv \
+# We deduplicate addresses by keying on Discord user ID. The goals are two-fold:
+#
+# * don't reward spamming the bot with ephemeral addresses for more test tokens
+# * limit the filesize of the discord allocations file, pending resolution of GH1783.
+#
+awk -F, '{key = $4 FS } !seen[key]++' < discord_history.csv \
+ | cut -d , -f 6 \
| tail -n +2 | sort | uniq \
| grep ^penumbrav \
| sed -e 's/^/"1_000__000_000","upenumbra","/; s/$/"/' \ Long-term, the solution is as described above: we need to change the way we pass genesis information into the cluster deployment. Reading a flat directly into a ConfigMap won't cut it, due to the 1MB limit we're constrained by. We'd probably do better to split the cluster deployment logic into two logical halves: testnet generation with validators, and bootstrapping fullnodes onto an existing testnet. Rather than perform the testnet generation outside the cluster and feed the data in, we can extend the initContainers logic to create the chain data in a PV and pass that around during the config-munging stage. Will ping some SL folks for input, given that they likely have encountered this problem elsewhere, supporting other projects. Here's some quantitative data I gathered while testing locally:
Will raise priority on this, so we can return to updating the Discord allocations ad-hoc, rather than avoiding it, as we must now. |
Preview back up, as of https://github.com/penumbra-zone/penumbra/actions/runs/4898387278/jobs/8747515808. |
During ramp-up to Testnet 52 (#2415), we opportunistically updated the Discord allocations, and encountered #1783 again. While we need a durable solution to that problem, we also want a quick fix, so we turn to tried-and-true file compression: we gzip the generated json file, base64 encode it to sideload it into the cluster, then decompress it as part of the initContainer bootstrapping logic during node creation. This is a "good enough" solution to buy time for the near future, ensuring that we can continue to update Discord allocations when we want. The compression ratio is approximately 3x, which gives us some headroom.
During ramp-up to Testnet 52 (#2415), we opportunistically updated the Discord allocations, and encountered #1783 again. While we need a durable solution to that problem, we also want a quick fix, so we turn to tried-and-true file compression: we gzip the generated json file, base64 encode it to sideload it into the cluster, then decompress it as part of the initContainer bootstrapping logic during node creation. This is a "good enough" solution to buy time for the near future, ensuring that we can continue to update Discord allocations when we want. The compression ratio is approximately 3x, which gives us some headroom.
The helm limit issue was resolved as of #3033: there's no longer a 1MB limitation on the size of genesis data we must upload to the cluster nodes. Instead, we continue to bake the default allocations into the However, updating the Discord allocations (2023-08-03 -> 2023-09-18) caused the smoke test to fail in #3047, due to a message-too-large error:
That's the 4MB message size limit for gRPC, which is less than the new candidate genesis file, which is >5MB. So we're still effectively blocked on making the genesis file larger, although no longer by the CI tooling. The tonic docs state that |
Refs #3123. Notably does not include any Discord allocation updates, as we're still blocked by [0]. [0] #1783 (comment)
Refs #3123. Notably does not include any Discord allocation updates, as we're still blocked by [0]. [0] #1783 (comment)
Refs #3123. Notably does not include any Discord allocation updates, as we're still blocked by [0]. [0] #1783 (comment)
Can we split that Tonic issue out into a new issue? It seems worth fixing separately. |
Done: #3147 |
This is no longer an issue, given the changes in #3033. The gRPC message size limitation we may want to return to, but we've got that tracked separately already. |
Describe the bug
The helm logic for deploying to kubernetes creates a bundle greater than 1MB, which is the largest permissible Secret size in k8s, causing deploys to fail.
To Reproduce
See example CI run here: https://github.com/penumbra-zone/penumbra/actions/runs/3832692505/jobs/6523427954 The exact error message is:
Expected behavior
The deploy succeeds.
Additional context
This happens because the ci deployment logic first generates a testnet config and then interpolates those flat lines inside the Helm chart. Unfortunately this means that the filesize of each one of those generated config files is added to the manifest bundle, which is b64'd and stored as a secret on the cluster. Once this bag of bytes crossed the 1MB threshold, deploys started breaking. That threshold was crossed when updating the Discord allocations as part of #1781
What do?
We have two options:
To unblock deploy of 039-praxidike today, I opted for 2 locally. I'll push that up onto main to unbreak auto-deploys of preview. The clock is ticking on 2 no longer being an option, once the allocation files are large enough. For now, let's pause manual updates of the Discord allocations until we revise the deploy logic to use 1.
The text was updated successfully, but these errors were encountered: