Replace current `testnet-preview` deployment with new k8s deployment #1659

hdevalence · 2022-11-23T02:23:51Z

Is your feature request related to a problem? Please describe.

We should try to move over to the new k8s deployment system built by Strangelove, and start with replacing testnet-preview. The goal of testnet-preview is that it should be an exact preview of what would be deployed if the current state of the main branch were tagged as a release. This ensures that there are no deployment surprises when tagging a release, and allows testing client protocols against the current state of the main branch.

The only difference between testnet-preview. and testnet. should be that when deploying testnet., we pass the --preserve-chain-id parameter to pd testnet generate to avoid randomizing the chain ID (since there should only be one deployment per tag).

Describe the solution you'd like

Update k8s helm chart logic to support side-by-side deploys - ci: refactor helm logic for unique resources #1707
Write new "deploy-preview-new" workflow for testnet-preview on k8s - ci: adds testnet-preview k8s deploy workflow #1719
Ensure the action is triggered on pushes to main and uses the latest container images (does it need to wait for them to be built?) - ci: set explicit workflow dependencies #1730
Merge new workflows into main, use GHA interface to trigger them
Compare /status endpoints between e.g. http://testnet-preview.penumbra.zone:26657/status & http://fullnode.testnet-preview.penumbra.zone:26657/status
Ask for team testing against fullnode.testnet-preview.penumbra.zone
Update DNS for cut-over on "testnet-preview" (can happen whenever, mostly internal use only)
Update DNS for cut-over (will need to be coordinated with next testnet deploy, otherwise we lose state)

The text was updated successfully, but these errors were encountered:

hdevalence · 2022-11-26T18:52:14Z

The current k8s deployment provides TLS access to the Tendermint RPC endpoint (load-balancing over fullnodes). We should provide an additional endpoint that gives TLS access to the pd GRPC endpoint.

We are not in a position to use a TLS endpoint from pcli for kind of boring reasons (we hardcode "http" in a bunch of places, and assume that we have one host for both tendermint + pd with endpoints on different ports), but exposing a TLS pd endpoint is important to do now because we're trying to use grpc-web to access it, and without TLS, this is not really possible because of mixed content rules.

conorsch · 2022-11-29T17:05:20Z

Took a look at what's required here for cut-over. We currently run two discrete testnets:

testnet.penumbra.zone (built from approximately-weekly tags)
testnet-preview.penumbra.zone (built very frequently from latest HEAD of main branch)

Right now, the k8s deployment logic assumes there's only one testnet, and it destructively resets on updates. That's already a great match for how we manage testnet-preview, but we want to do both on k8s. I'll work on adding a few more knobs to the new deployment logic, so we can set HELM_RELEASE or similar and touch only the proper set of testnet resources during CI runs.

conorsch · 2022-12-02T17:08:37Z

WIP branch coming together at https://github.com/penumbra-zone/penumbra/tree/1659-testnet-preview-via-k8s. Mostly that diff is adding comments, docs, and some refactoring of the test scripts to make more space for multiple environments. I haven't created a separate cluster, but the Terraform logic is already present to do so. Currently working on:

documenting the fullnode.* A record (DNS records are managed out of band) that relates to NodePort added k8s testnet - public peering #1660
debugging unhealthy service backends (this is related to the refactoring work I've done; I broke it); for reference:

kubectl get ingress penumbra-testnet-ingress -o json | jq '.metadata.annotations["ingress.kubernetes.io/backends"]' -r -C
{"k8s-be-30563--0a5eab405c618cec":"HEALTHY","k8s1-0a5eab40-default-penumbra-testnet-26657-ed3f3817":"UNHEALTHY","k8s1-0a5eab40-default-penumbra-testnet-8080-5680cfee":"UNHEALTHY"}

Once those problems are resolved, I'll move on to creating side-by-side environments, and touch up the script as necessary to make sure that subsequent deployments don't clobber unwanted resources.

conorsch · 2022-12-06T23:15:53Z

Cluster config for "testnet" setup is solid, will PR in some housekeeping changes with more docs, comments, and labels throughout. Encountered a problem when I tried to deploy "testnet-preview":

Error syncing to GCP: error running load balancer syncing routine: loadbalancer 86iwvh2x-default-penumbra-testnet-preview-ingr-tbg5eif4 does not exist: googleapi: Error 403: QUOTA_EXCEEDED - Quota 'IN_USE_ADDRESSES' exceeded. Limit: 8.0 globally.

So it appears we've exhausted our account limit on global reserved IPs. I'll see if we can raise that limit, but more likely we'll need to switch to a lower tier of reserved IP to sidestep that limit.

conorsch · 2022-12-07T22:06:45Z

Looked into the IP quota issue. There are actually two limits in play: STATIC_ADDRESSES (limit of 8) and IN_USE_ADDRESSES (also limit of 8). We're already careful about reserving only one static address per testnet, so it's not the STATIC_ADDRESSES limit we're hitting. Rather, it's IN_USE_ADDRESSES, since all external IPs in use by Services count toward that total—regardless of whether they're static or not. Requesting a quota bump (which we've had to do similarly for other resource types, such as persistence storage, back in 2022-09) which should unblock. They promise response in <2d, but I expect more like <2h. 🤞

For posterity: gcloud compute project-info describe | grep -A1 -B1 ADDR was useful for getting a picture of the limits in play.

conorsch · 2022-12-07T22:13:41Z

They promise response in <2d, but I expect more like <2h.

OK, it was actually <2m:

🙃

conorsch · 2022-12-08T00:09:41Z

Comparing the two testnet deployments for disparities, it looks like node_info.other.tx_index is set to "null" in k8s but "kv" in the testnet config template. Looks like maybe we want to set that to "kv".

conorsch · 2022-12-08T01:20:12Z

uses the latest container images (does it need to wait for them to be built?)

Not yet implemented on #1719; relevant docs are here https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#workflow_run

conorsch · 2022-12-08T23:11:10Z

Currently working on sorting out the implicit workflow dependencies, and making them explicit. For instance, do we want to build container images if the tests fail? We do not! However, that's currently how things work: the container images get built regardless of the state of other workflows.

Similarly, we must strictly order the workflows so that 1) tests pass; then 2) container image is built; then 3) a deploy is made to the relevant environment. GitHub Actions will allow us to chain up to a maximum of three (3) workflows:

You can't use workflow_run to chain together more than three levels of workflows.

Other potential footguns include the need to manually inspect a previous workflow run and inspect whether it failed: by default, a failed dependency workflow will still trigger execution of the dependent workflow, which I still find surprising; additionally, it may not be possible to inspect whether a dependency workflow was triggered due to a tag or a branch change (which is important for us because it's how we gate testnet vs testnet-preview deploys). In the short term, I may opt to copy/paste several workflows and embed them as jobs, to take advantage of more finegrained control of trigger events.

Next testnet is due Monday, 2022-12-12, and I'd very much like to use the new setup. Today, 2022-12-08, I plan to cut over testnet-preview.penumbra.zone as, ahem, a "preview" of what's to come.

conorsch · 2022-12-09T01:47:26Z

Today, 2022-12-08, I plan to cut over testnet-preview.penumbra.zone

This is done: testnet-preview.penumbra.zone now points to the new k8s deployment. Post-merge it was automatically updated.

terminal output monitoring rollout, for those interested

NAME                                   READY   STATUS        RESTARTS   AGE
penumbra-testnet-fn-0-fvkfk            3/3     Running       0          25h
penumbra-testnet-preview-fn-0-j2hsp    3/3     Terminating   0          24h
penumbra-testnet-preview-val-0-7kghg   2/2     Terminating   0          24h
penumbra-testnet-preview-val-1-pjll9   2/2     Terminating   0          24h
penumbra-testnet-val-0-7fsq2           2/2     Running       0          25h
penumbra-testnet-val-1-lkvcj           2/2     Running       0          25h

❯ kubectl get pods
NAME                                   READY   STATUS    RESTARTS   AGE
penumbra-testnet-fn-0-fvkfk            3/3     Running   0          25h
penumbra-testnet-preview-fn-0-q7s8f    0/3     Pending   0          3s
penumbra-testnet-preview-val-0-qn5gc   0/2     Pending   0          3s
penumbra-testnet-preview-val-1-48mbn   0/2     Pending   0          3s
penumbra-testnet-val-0-7fsq2           2/2     Running   0          25h
penumbra-testnet-val-1-lkvcj           2/2     Running   0          25h

❯ kubectl get pods
NAME                                   READY   STATUS    RESTARTS   AGE
penumbra-testnet-fn-0-fvkfk            3/3     Running   0          25h
penumbra-testnet-preview-fn-0-q7s8f    3/3     Running   0          11m
penumbra-testnet-preview-val-0-qn5gc   2/2     Running   0          11m
penumbra-testnet-preview-val-1-48mbn   2/2     Running   0          11m
penumbra-testnet-val-0-7fsq2           2/2     Running   0          25h
penumbra-testnet-val-1-lkvcj           2/2     Running   0          25h

Still more work to do on the workflow dependencies for Monday's deployment; I'll pick that back up tomorrow.

conorsch · 2022-12-09T23:46:21Z

Calling this done for now. Here's a recent automatic deploy of testnet-preview to the k8s cluster: https://github.com/penumbra-zone/penumbra/actions/runs/3661278131 Come Monday, we'll need to update the A record for testnet.penumbra.zone to point at the relevant IP:

❯ terraform output
testnet_preview_reserved_ip = "34.117.153.161" # already done
testnet_reserved_ip = "34.111.241.130" # this one still needs to be updated

We'll do that as part of the testnet deploy. Already lowered the TTL 30m -> 5m in prep for the cut-over.

conorsch · 2022-12-13T01:44:04Z

Was not able to use the new cluster setup for testnet 038 today (#1743). In the interest of , I fell back to reusing the legacy droplet, and configured pd and tendermint manually based on the 038 code. It was necessary to stand up the services "manually," because the deprecated workflows were removed in #1730; a bit prematurely, in retrospect.

The root cause of the botched cluster deployment lies in my oversight last week of mistakenly deploying the testnet tag to the preview environment (#1744); this was fixed this morning in 17a3267, but the late discovery of the misconfiguration means we did not have an adequate "preview" environment to observe the most recent cluster config. As such, I suspect we missed identifying some breaking changes recently.

As a result, the current state of our deployments is a bit brittle right now. To wit:

🚨 testnet-preview.penumbra.zone is down; there's currently a CrashLoopBackoff on the relevant pods, due to an unhappy tendermint container
⚠️ testnet.penumbra.zone is running on the legacy infra
⚠️ the deployment workflows in GitHub Actions are manually disabled, to prevent automatic changes to state (🌈 per-PR CI runs remain unaffected)
⚠️ galileo is unhappy; this has happened a few times after testnet deploys, but this time i suspect codechanges may be necessary; see Testnet #38: Kalyke #1743 (comment)

Starting tomorrow, I'll focus on unbreaking testnet-preview, since that's our canary in the coal mine. Once preview is happy again, I'll resume deploys of testnet-on-k8s, and provide updates here.

conorsch · 2023-01-06T19:53:50Z

This is done: testnet-preview is now served via k8s, and has been since 2022-12-12, via 5b42c45. I'll open another issue tracking the transition of testnet (cf. testnet-preview) to k8s.

hdevalence assigned conorsch Nov 23, 2022

hdevalence added this to Testnets Dec 2, 2022

conorsch moved this to Todo in Testnets Dec 2, 2022

aubrika moved this from Todo to In Progress in Testnets Dec 2, 2022

conorsch mentioned this issue Dec 6, 2022

ci: refactor helm logic for unique resources #1707

Merged

redshiftzero moved this from In Progress to Testnet 38: Kalyke in Testnets Dec 9, 2022

conorsch mentioned this issue Dec 12, 2022

Testnet #38: Kalyke #1743

Closed

13 tasks

conorsch mentioned this issue Dec 23, 2022

Spike on HTTPS reverse proxy for pd #1766

Closed

conorsch mentioned this issue Jan 6, 2023

Provisioning: reconcile tendermint configs #1816

Closed

conorsch closed this as completed Jan 6, 2023

github-project-automation bot moved this from Testnet 38: Kalyke to Testnet 40: Themisto in Testnets Jan 6, 2023

This was referenced Jan 6, 2023

Replace current testnet deployment with new k8s deployment #1818

Closed

Kubernetes Deployment for penumbra fullnode #1288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace current `testnet-preview` deployment with new k8s deployment #1659

Replace current `testnet-preview` deployment with new k8s deployment #1659

hdevalence commented Nov 23, 2022 •

edited by conorsch

Loading

hdevalence commented Nov 26, 2022

conorsch commented Nov 29, 2022

conorsch commented Dec 2, 2022

conorsch commented Dec 6, 2022

conorsch commented Dec 7, 2022 •

edited

Loading

conorsch commented Dec 7, 2022

conorsch commented Dec 8, 2022

conorsch commented Dec 8, 2022

conorsch commented Dec 8, 2022

conorsch commented Dec 9, 2022

conorsch commented Dec 9, 2022

conorsch commented Dec 13, 2022

conorsch commented Jan 6, 2023

Replace current testnet-preview deployment with new k8s deployment #1659

Replace current testnet-preview deployment with new k8s deployment #1659

Comments

hdevalence commented Nov 23, 2022 • edited by conorsch Loading

hdevalence commented Nov 26, 2022

conorsch commented Nov 29, 2022

conorsch commented Dec 2, 2022

conorsch commented Dec 6, 2022

conorsch commented Dec 7, 2022 • edited Loading

conorsch commented Dec 7, 2022

conorsch commented Dec 8, 2022

conorsch commented Dec 8, 2022

conorsch commented Dec 8, 2022

conorsch commented Dec 9, 2022

conorsch commented Dec 9, 2022

conorsch commented Dec 13, 2022

conorsch commented Jan 6, 2023

Replace current `testnet-preview` deployment with new k8s deployment #1659

Replace current `testnet-preview` deployment with new k8s deployment #1659

hdevalence commented Nov 23, 2022 •

edited by conorsch

Loading

conorsch commented Dec 7, 2022 •

edited

Loading