-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a standalone node to testnet deployments #3336
Comments
This is a good idea. As background, I've actually been running a discrete fullnode on separate infra, but failed to do so for 64. Aside from testnets, it's important that we replicate this functionality for preview, so that we can catch bugs like #3650 before release. However, it's complicated in that we must preserve the contents of the ACME cache, currently defined as |
We want to exercise the pd https logic, but we can't naively run it from scratch on every deploy, because that'd be far too many API requests to reissue certs from ACME. Instead, let's preserve the ACME directory before wiping state, and reuse it before bouncing the service. This setup requires always-on bxoes provisioned out of band. Still TK: * use dedicated `ci` shell account * add GHA secrets for key material * use --acme-staging arg for first few runs * add dedicated workflow ad-hoc runs Refs #3336.
We want to exercise the pd https logic, but we can't naively run it from scratch on every deploy, because that'd be far too many API requests to reissue certs from ACME. Instead, let's preserve the ACME directory before wiping state, and reuse it before bouncing the service. This setup requires always-on bxoes provisioned out of band. Still TK: * use dedicated `ci` shell account * add GHA secrets for key material * use --acme-staging arg for first few runs * add dedicated workflow ad-hoc runs Refs #3336.
We want to exercise the pd https logic, but we can't naively run it from scratch on every deploy, because that'd be far too many API requests to reissue certs from ACME. Instead, let's preserve the ACME directory before wiping state, and reuse it before bouncing the service. This setup requires always-on boxes provisioned out of band. So far, this adds the base logic via a workflow. In order to get it running, I'll need to iterate on the workflow, but workflows must land on main prior to being available for ad-hoc execution. Refs #3336.
We want to exercise the pd https logic, but we can't naively run it from scratch on every deploy, because that'd be far too many API requests to reissue certs from ACME. Instead, let's preserve the ACME directory before wiping state, and reuse it before bouncing the service. This setup requires always-on boxes provisioned out of band. So far, this adds the base logic via a workflow. In order to get it running, I'll need to iterate on the workflow, but workflows must land on main prior to being available for ad-hoc execution. Refs #3336.
We want to exercise the pd https logic, but we can't naively run it from scratch on every deploy, because that'd be far too many API requests to reissue certs from ACME. Instead, let's preserve the ACME directory before wiping state, and reuse it before bouncing the service. This setup requires always-on boxes provisioned out of band. So far, this adds the base logic via a workflow. In order to get it running, I'll need to iterate on the workflow, but workflows must land on main prior to being available for ad-hoc execution. Refs #3336.
These changes build on #3709, specifically: * consuming ssh privkey & hostkey material from GHA secrets * creates a dedicated workflow So far this only targets preview. Will run the job ad-hoc a few times and make changes as necessary before porting to testnet env and hooking up to the automatically-triggered release workflows. Refs #3336.
These changes build on #3709, specifically: * consuming ssh privkey & hostkey material from GHA secrets * creates a dedicated workflow So far this only targets preview. Will run the job ad-hoc a few times and make changes as necessary before porting to testnet env and hooking up to the automatically-triggered release workflows. Refs #3336.
Added a workflow for this on preview. I'm going to run it ad-hoc a few times, and if no problems—like ratelimit triggers—then I'll move it to prod ACME API and make it part of the automatic deployments. |
The ratelimiting on the HTTPS RPC frontend was getting dropped on chain resets, due to duplicated vars. I've been keeping an eye on performance and re-adding post-deploy, but only just identified the root cause, via manual lints. This oversight caused problems during a deploy of v0.68.0, during which an ad-hoc solo node was set up to sidestep the load. See #3336 for more work towards automatic solo nodes.
The ratelimiting on the HTTPS RPC frontend was getting dropped on chain resets, due to duplicated vars. I've been keeping an eye on performance and re-adding post-deploy, but only just identified the root cause, via manual lints. This oversight caused problems during a deploy of v0.68.0, during which an ad-hoc solo node was set up to sidestep the load. See #3336 for more work towards automatic solo nodes.
Promotes the ad-hoc "deploy-standalone" workflow to automatic, called as a dependent job in the preview deploy. Also adds a corresponding job to the testnet deploy. These nodes are live now: * https://solo-pd.testnet-preview.plinfra.net * https://solo-pd.testnet.plinfra.net We use a separate domain from other deployed services, to contain side-effects from failure while exercising the auto-https logic. Closes #3336.
Promotes the ad-hoc "deploy-standalone" workflow to automatic, called as a dependent job in the preview deploy. Also adds a corresponding job to the testnet deploy. These nodes are live now: * https://solo-pd.testnet-preview.plinfra.net * https://solo-pd.testnet.plinfra.net We use a separate domain from other deployed services, to contain side-effects from failure while exercising the auto-https logic. Closes #3336.
The automatic preview deploy is triggering too early, before the newly created network's RPC endpoints are returning. For most of our deploys, that's not a problem, because they'll automatically retry until successful. For the "standalone" config, however, that doesn't leverage the same orchestration, we need to be more explicit. Two improvements come to mind: ensure that the RPC endpoints are honoring the readiness state of the fullnodes behind them: right now, any fullnode in the deployment is instantly added to the RPC backend pool, but we should gate admission into the pool on formal readiness, meaning the internal rpc endpoint is returning OK. We could also instruct the deploy flow to block until all pods are ready, which would resolve the problem of the standalone deploy firing too early, but not address the intermittent RPC downtime during chain resets on preview. |
Slightly smarter CI logic, which will block until all pods are marked Ready post-deployment. Due to an oversight, the "part-of" label wasn't applied to the fullnode pods, so the deploy script exited after the validators were running, but before the fullnodes were finished setting up. That was fine, until #3336, which tacked on a subsequent deploy step that assumes the RPC is ready to rock. Also updates the statefulsets to deploy the child pods in parallel, rather than serially, which shaves a few minutes off setup/teardown. Only really affects preview env, which has frequent deploy churn.
Slightly smarter CI logic, which will block until all pods are marked Ready post-deployment. Due to an oversight, the "part-of" label wasn't applied to the fullnode pods, so the deploy script exited after the validators were running, but before the fullnodes were finished setting up. That was fine, until #3336, which tacked on a subsequent deploy step that assumes the RPC is ready to rock. Also updates the statefulsets to deploy the child pods in parallel, rather than serially, which shaves a few minutes off setup/teardown. Only really affects preview env, which has frequent deploy churn.
This work is basically complete, although I haven't documented the new endpoints anywhere. I'll stick those in the wiki before closing. One major omission is that we don't have automatic handling of point-releases for these standalone nodes. That's fine: we're more focused on upgrades right now (#4087), which requires a lot of manual maintenance. Will circle back with a more automated setup for point releases when there's time, otherwise I'll handle that manually for the next few point releases. |
This work is done. We now have a standalone node, serving Using a separate domain as a precaution to avoid banning cert issuance on the for-now more commonly used domain, The broad strokes of work described here is accomplished, so I'm closing the ticket. |
Is your feature request related to a problem? Please describe.
We need
pd
to work out of the box. But we don't test this anywhere, because we wrap it up in a bunch of infrastructure that hides broken behavior: #3281Describe the solution you'd like
Add a node to the deployment that runs a full node the way we expect users to be able to:
pd start --grpc-auto-https mydomain.com
andcometbft start
, no load-balancing, no reverse proxies, etc.The text was updated successfully, but these errors were encountered: