Add a standalone node to testnet deployments #3336

hdevalence · 2023-11-15T00:51:48Z

Is your feature request related to a problem? Please describe.

We need pd to work out of the box. But we don't test this anywhere, because we wrap it up in a bunch of infrastructure that hides broken behavior: #3281

Describe the solution you'd like

Add a node to the deployment that runs a full node the way we expect users to be able to: pd start --grpc-auto-https mydomain.com and cometbft start, no load-balancing, no reverse proxies, etc.

The text was updated successfully, but these errors were encountered:

conorsch · 2024-01-24T17:58:12Z

This is a good idea. As background, I've actually been running a discrete fullnode on separate infra, but failed to do so for 64. Aside from testnets, it's important that we replicate this functionality for preview, so that we can catch bugs like #3650 before release. However, it's complicated in that we must preserve the contents of the ACME cache, currently defined as <pd_home>/rustls_acme_cache: https://github.com/penumbra-zone/penumbra/blob/v0.64.2/crates/bin/pd/src/main.rs#L487-L488 Most of our preview logic is a full wipe and reset, and we'll need to be a bit more careful to avoid getting a domain banned from the ACME ratelimits.

We want to exercise the pd https logic, but we can't naively run it from scratch on every deploy, because that'd be far too many API requests to reissue certs from ACME. Instead, let's preserve the ACME directory before wiping state, and reuse it before bouncing the service. This setup requires always-on bxoes provisioned out of band. Still TK: * use dedicated `ci` shell account * add GHA secrets for key material * use --acme-staging arg for first few runs * add dedicated workflow ad-hoc runs Refs #3336.

We want to exercise the pd https logic, but we can't naively run it from scratch on every deploy, because that'd be far too many API requests to reissue certs from ACME. Instead, let's preserve the ACME directory before wiping state, and reuse it before bouncing the service. This setup requires always-on boxes provisioned out of band. So far, this adds the base logic via a workflow. In order to get it running, I'll need to iterate on the workflow, but workflows must land on main prior to being available for ad-hoc execution. Refs #3336.

These changes build on #3709, specifically: * consuming ssh privkey & hostkey material from GHA secrets * creates a dedicated workflow So far this only targets preview. Will run the job ad-hoc a few times and make changes as necessary before porting to testnet env and hooking up to the automatically-triggered release workflows. Refs #3336.

conorsch · 2024-03-18T23:44:23Z

Added a workflow for this on preview. I'm going to run it ad-hoc a few times, and if no problems—like ratelimit triggers—then I'll move it to prod ACME API and make it part of the automatic deployments.

The ratelimiting on the HTTPS RPC frontend was getting dropped on chain resets, due to duplicated vars. I've been keeping an eye on performance and re-adding post-deploy, but only just identified the root cause, via manual lints. This oversight caused problems during a deploy of v0.68.0, during which an ad-hoc solo node was set up to sidestep the load. See #3336 for more work towards automatic solo nodes.

Promotes the ad-hoc "deploy-standalone" workflow to automatic, called as a dependent job in the preview deploy. Also adds a corresponding job to the testnet deploy. These nodes are live now: * https://solo-pd.testnet-preview.plinfra.net * https://solo-pd.testnet.plinfra.net We use a separate domain from other deployed services, to contain side-effects from failure while exercising the auto-https logic. Closes #3336.

conorsch · 2024-03-21T18:44:12Z

The automatic preview deploy is triggering too early, before the newly created network's RPC endpoints are returning. For most of our deploys, that's not a problem, because they'll automatically retry until successful. For the "standalone" config, however, that doesn't leverage the same orchestration, we need to be more explicit.

Two improvements come to mind: ensure that the RPC endpoints are honoring the readiness state of the fullnodes behind them: right now, any fullnode in the deployment is instantly added to the RPC backend pool, but we should gate admission into the pool on formal readiness, meaning the internal rpc endpoint is returning OK. We could also instruct the deploy flow to block until all pods are ready, which would resolve the problem of the standalone deploy firing too early, but not address the intermittent RPC downtime during chain resets on preview.

Slightly smarter CI logic, which will block until all pods are marked Ready post-deployment. Due to an oversight, the "part-of" label wasn't applied to the fullnode pods, so the deploy script exited after the validators were running, but before the fullnodes were finished setting up. That was fine, until #3336, which tacked on a subsequent deploy step that assumes the RPC is ready to rock. Also updates the statefulsets to deploy the child pods in parallel, rather than serially, which shaves a few minutes off setup/teardown. Only really affects preview env, which has frequent deploy churn.

conorsch · 2024-03-25T20:46:42Z

This work is basically complete, although I haven't documented the new endpoints anywhere. I'll stick those in the wiki before closing.

One major omission is that we don't have automatic handling of point-releases for these standalone nodes. That's fine: we're more focused on upgrades right now (#4087), which requires a lot of manual maintenance. Will circle back with a more automated setup for point releases when there's time, otherwise I'll handle that manually for the next few point releases.

conorsch · 2024-04-11T20:03:21Z

This work is done. We now have a standalone node, serving pd directly, exercising its auto-https logic, for both testnet and preview:

Using a separate domain as a precaution to avoid banning cert issuance on the for-now more commonly used domain, penumbra.zone. There are some shortcuts here: we don't ingest metrics from these hosts, point-releases don't roll out to them automatically. They're SSH-accessible to the PL team, so they also serve as "always-on" boxes. One change I haven't yet made that I'd very much like to is an optional flag to store the acme cert info in a separate directory, which would vastly simplify using the https logic for pd in a lot more cases.

The broad strokes of work described here is accomplished, so I'm closing the ticket.

github-project-automation bot added this to Testnets Nov 15, 2023

hdevalence mentioned this issue Jan 25, 2024

pd: 🔨 rework RootCommand::start auto-https logic #3652

Merged

cratelyn mentioned this issue Jan 26, 2024

pd: 😄 add a CLI flag to use the LetsEncrypt staging environment #3681

Closed

conorsch mentioned this issue Jan 31, 2024

ci: workflow for standalone pd #3709

Merged

conorsch added this to Penumbra Feb 2, 2024

github-project-automation bot moved this to 🗄️ Backlog in Penumbra Feb 2, 2024

conorsch moved this from 🗄️ Backlog to In progress in Penumbra Feb 2, 2024

conorsch self-assigned this Feb 2, 2024

conorsch added the A-CI/CD Relates to continuous integration & deployment of Penumbra label Feb 2, 2024

aubrika moved this from In progress to 🗄️ Backlog in Penumbra Feb 13, 2024

conorsch moved this from 🗄️ Backlog to In progress in Penumbra Mar 18, 2024

conorsch added this to the Sprint 2 milestone Mar 18, 2024

conorsch mentioned this issue Mar 18, 2024

ci: standalone pd node workflow #4045

Merged

conorsch mentioned this issue Mar 19, 2024

ci: fix middleware vars #4054

Merged

conorsch mentioned this issue Mar 21, 2024

ci: automatic deployments of standalone nodes #4069

Merged

conorsch closed this as completed in #4069 Mar 21, 2024

github-project-automation bot moved this from In progress to Done in Penumbra Mar 21, 2024

conorsch reopened this Mar 21, 2024

github-project-automation bot moved this from Done to In progress in Penumbra Mar 21, 2024

conorsch mentioned this issue Mar 22, 2024

ci: wait for all pods ready post-deploy #4091

Merged

aubrika modified the milestones: Sprint 2, Sprint 3 Mar 25, 2024

cratelyn modified the milestones: Sprint 3, Sprint 4 Apr 8, 2024

conorsch closed this as completed Apr 11, 2024

github-project-automation bot moved this from In progress to Done in Penumbra Apr 11, 2024

This was referenced May 23, 2024

Release v0.75.1, via point-release #4448

Closed

ci: solo pd node reset on point release #4459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a standalone node to testnet deployments #3336

Add a standalone node to testnet deployments #3336

hdevalence commented Nov 15, 2023

conorsch commented Jan 24, 2024

conorsch commented Mar 18, 2024

conorsch commented Mar 21, 2024

conorsch commented Mar 25, 2024

conorsch commented Apr 11, 2024

Add a standalone node to testnet deployments #3336

Add a standalone node to testnet deployments #3336

Comments

hdevalence commented Nov 15, 2023

conorsch commented Jan 24, 2024

conorsch commented Mar 18, 2024

conorsch commented Mar 21, 2024

conorsch commented Mar 25, 2024

conorsch commented Apr 11, 2024