Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm release manifest too large #1783

Closed
Tracked by #2983
conorsch opened this issue Jan 3, 2023 · 8 comments
Closed
Tracked by #2983

Helm release manifest too large #1783

conorsch opened this issue Jan 3, 2023 · 8 comments
Assignees

Comments

@conorsch
Copy link
Contributor

conorsch commented Jan 3, 2023

Describe the bug
The helm logic for deploying to kubernetes creates a bundle greater than 1MB, which is the largest permissible Secret size in k8s, causing deploys to fail.

To Reproduce
See example CI run here: https://github.com/penumbra-zone/penumbra/actions/runs/3832692505/jobs/6523427954 The exact error message is:

Error: UPGRADE FAILED: create: failed to create: Secret "sh.helm.release.v1.penumbra-testnet.v3" is invalid: data: Too long: must have at most 1048576 bytes

Expected behavior
The deploy succeeds.

Additional context
This happens because the ci deployment logic first generates a testnet config and then interpolates those flat lines inside the Helm chart. Unfortunately this means that the filesize of each one of those generated config files is added to the manifest bundle, which is b64'd and stored as a secret on the cluster. Once this bag of bytes crossed the 1MB threshold, deploys started breaking. That threshold was crossed when updating the Discord allocations as part of #1781

What do?

We have two options:

  1. Long term, we should stop interpolating config files and instead use a separate script to generate YAML vars, which can be passed in at runtime.
  2. Short term, we can selectively omit certain non-critical config files via helmignore.

To unblock deploy of 039-praxidike today, I opted for 2 locally. I'll push that up onto main to unbreak auto-deploys of preview. The clock is ticking on 2 no longer being an option, once the allocation files are large enough. For now, let's pause manual updates of the Discord allocations until we revise the deploy logic to use 1.

conorsch added a commit that referenced this issue Jan 3, 2023
Unbreaking deploys by using helmignore to omit
certain flat files from inclusion in the automatically generated bundle.
Prior to this change, all helm deploys fail with a "too large" error,
since the entirety of the manifest base64-encoded is now greater than
1MB, too large for a k8s secret, thanks to the recent update of the
Discord allocations.

Towards #1783.
@conorsch
Copy link
Contributor Author

conorsch commented Jan 4, 2023

I'll push that up onto main to unbreak auto-deploys of preview.

The fix in ea41cca indeed unbroke preview deploys, as shown here: https://github.com/penumbra-zone/penumbra/actions/runs/3833889668/jobs/6525811171

@conorsch conorsch self-assigned this Jan 6, 2023
@redshiftzero redshiftzero moved this to Testnet 40: Themisto in Testnets Jan 6, 2023
@conorsch conorsch moved this from Testnet 40: Themisto to Future in Testnets Jan 12, 2023
@conorsch conorsch moved this from Future to Testnet 41: Callirrhoe in Testnets Jan 13, 2023
@conorsch conorsch moved this from Testnet 41: Callirrhoe to Future in Testnets Jan 19, 2023
conorsch added a commit that referenced this issue Mar 1, 2023
This is a naive import of the history from 2023-01 and 2023-02.
Thought I'd need to deduplicate it manually, but the `new-testnet.sh`
already dedupes by addr on import. Let's see how CI likes it.

Closes #2063. Refs #1783.
conorsch added a commit that referenced this issue Mar 1, 2023
This is a naive import of the history from 2023-01 and 2023-02.
Thought I'd need to deduplicate it manually, but the `new-testnet.sh`
already dedupes by addr on import. Let's see how CI likes it.

Closes #2063. Refs #1783.
@conorsch
Copy link
Contributor Author

conorsch commented May 5, 2023

This bit us again, see failed deploy in https://github.com/penumbra-zone/penumbra/actions/runs/4897494168/jobs/8745987867. It was #2491 that updated Discord allocations. Will take a look at deduping those manually or otherwise reverting them for now, to unblock release on Monday.

conorsch added a commit that referenced this issue May 5, 2023
Regenerated the 052 chain info after reverting recent Discord allocation
additions, to avoid #1783. Was careful to preserve the paper allocations
from #2491, which we want in 52 as part of the DEX feature set.
@conorsch
Copy link
Contributor Author

conorsch commented May 5, 2023

Short-term, I've reverted the update to the Discord allocations in ce63890. I was careful to preserve the paper asset additions introduced in #2491. That change gets us back under the limit, unbreaking preview immediately, and unblocking release on Monday. We're basically redlining on the filesize limitation as-is, though, so we basically can't update Discord allocations until this problem is resolved.

As an aside, we'd do well to deduplicate the discord allocations by discord user id. Tried that, but it wasn't a big enough change to unbreak things, so I'm not including it right now.

Here's the patch for future reference
diff --git a/testnets/new-testnet.sh b/testnets/new-testnet.sh
index f5205e34..bd9e5ae1 100755
--- a/testnets/new-testnet.sh
+++ b/testnets/new-testnet.sh
@@ -14,7 +14,13 @@ cp "$PREVIOUS_TESTNET_DIRECTORY/validators.json" "$NEW_TESTNET_DIRECTORY/validat
 
 echo "Setting up allocations for new testnet..."
 
-cut -d , -f 6 < discord_history.csv \
+# We deduplicate addresses by keying on Discord user ID. The goals are two-fold:
+#
+#   * don't reward spamming the bot with ephemeral addresses for more test tokens
+#   * limit the filesize of the discord allocations file, pending resolution of GH1783.
+#
+awk -F, '{key = $4 FS } !seen[key]++' < discord_history.csv \
+    | cut -d , -f 6 \
     | tail -n +2 | sort | uniq \
     | grep ^penumbrav \
     | sed -e 's/^/"1_000__000_000","upenumbra","/; s/$/"/' \

Long-term, the solution is as described above: we need to change the way we pass genesis information into the cluster deployment. Reading a flat directly into a ConfigMap won't cut it, due to the 1MB limit we're constrained by. We'd probably do better to split the cluster deployment logic into two logical halves: testnet generation with validators, and bootstrapping fullnodes onto an existing testnet. Rather than perform the testnet generation outside the cluster and feed the data in, we can extend the initContainers logic to create the chain data in a PV and pass that around during the config-munging stage. Will ping some SL folks for input, given that they likely have encountered this problem elsewhere, supporting other projects.

Here's some quantitative data I gathered while testing locally:

Will raise priority on this, so we can return to updating the Discord allocations ad-hoc, rather than avoiding it, as we must now.

conorsch added a commit that referenced this issue May 5, 2023
Regenerated the 052 chain info after reverting recent Discord allocation
additions, to avoid #1783. Was careful to preserve the paper allocations
from #2491, which we want in 52 as part of the DEX feature set.
@conorsch
Copy link
Contributor Author

conorsch commented May 5, 2023

conorsch added a commit that referenced this issue May 8, 2023
During ramp-up to Testnet 52 (#2415), we opportunistically updated the
Discord allocations, and encountered #1783 again. While we need a
durable solution to that problem, we also want a quick fix, so we turn
to tried-and-true file compression: we gzip the generated json file,
base64 encode it to sideload it into the cluster, then decompress it as
part of the initContainer bootstrapping logic during node creation.
This is a "good enough" solution to buy time for the near future,
ensuring that we can continue to update Discord allocations when we
want. The compression ratio is approximately 3x, which gives us some
headroom.
conorsch added a commit that referenced this issue May 8, 2023
During ramp-up to Testnet 52 (#2415), we opportunistically updated the
Discord allocations, and encountered #1783 again. While we need a
durable solution to that problem, we also want a quick fix, so we turn
to tried-and-true file compression: we gzip the generated json file,
base64 encode it to sideload it into the cluster, then decompress it as
part of the initContainer bootstrapping logic during node creation.
This is a "good enough" solution to buy time for the near future,
ensuring that we can continue to update Discord allocations when we
want. The compression ratio is approximately 3x, which gives us some
headroom.
@conorsch
Copy link
Contributor Author

The helm limit issue was resolved as of #3033: there's no longer a 1MB limitation on the size of genesis data we must upload to the cluster nodes. Instead, we continue to bake the default allocations into the pd binary, and we use pd testnet generate inside the cluster to create a new network. Since no sideloading of genesis data via API occurs, the 1MB ConfigMap limit is irrelevant.

However, updating the Discord allocations (2023-08-03 -> 2023-09-18) caused the smoke test to fail in #3047, due to a message-too-large error:

Sep 18 16:49:18.666 INFO penumbra_view::worker: view worker error e=status: OutOfRange, message: "Error, message length too large: found 5347639 bytes, the limit is: 4194304 bytes", details: [], metadata: MetadataMap { headers: {} }

That's the 4MB message size limit for gRPC, which is less than the new candidate genesis file, which is >5MB. So we're still effectively blocked on making the genesis file larger, although no longer by the CI tooling. The tonic docs state that max_decoding_message_size can be overridden on a per-client basis, but it's not obvious to me how we'd customize that config attribute in our generated proto files.

conorsch added a commit that referenced this issue Sep 29, 2023
Refs #3123. Notably does not include any Discord allocation updates,
as we're still blocked by [0].

[0] #1783 (comment)
conorsch added a commit that referenced this issue Sep 29, 2023
Refs #3123. Notably does not include any Discord allocation updates,
as we're still blocked by [0].

[0] #1783 (comment)
conorsch added a commit that referenced this issue Sep 29, 2023
Refs #3123. Notably does not include any Discord allocation updates,
as we're still blocked by [0].

[0] #1783 (comment)
@hdevalence
Copy link
Member

Can we split that Tonic issue out into a new issue? It seems worth fixing separately.

@conorsch
Copy link
Contributor Author

conorsch commented Oct 4, 2023

Can we split that Tonic issue out into a new issue? It seems worth fixing separately.

Done: #3147

@aubrika aubrika added this to Penumbra Oct 30, 2023
@github-project-automation github-project-automation bot moved this to 🗄️ Backlog in Penumbra Oct 30, 2023
@conorsch
Copy link
Contributor Author

This is no longer an issue, given the changes in #3033. The gRPC message size limitation we may want to return to, but we've got that tracked separately already.

@github-project-automation github-project-automation bot moved this from Future to Testnet 63: Rhea (Web Wallet) in Testnets Nov 28, 2023
@github-project-automation github-project-automation bot moved this from 🗄️ Backlog to ✅ Done in Penumbra Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Status: Testnet 63: Rhea (Web Wallet)
Development

No branches or pull requests

2 participants