From b66e3acb0dba21edb77b9564499fb6557dadac42 Mon Sep 17 00:00:00 2001 From: Steven Platt Date: Wed, 20 Nov 2024 17:40:33 -0500 Subject: [PATCH 1/5] initial draft of testnet-runbook --- spartan/testnet-runbook.md | 116 +++++++++++++++++++++++++++++++++++++ 1 file changed, 116 insertions(+) create mode 100644 spartan/testnet-runbook.md diff --git a/spartan/testnet-runbook.md b/spartan/testnet-runbook.md new file mode 100644 index 000000000000..d70518ebf9d0 --- /dev/null +++ b/spartan/testnet-runbook.md @@ -0,0 +1,116 @@ +# Aztec Protocol Testnet Engineering Runbook + +## Overview + +This runbook outlines the engineering team's responsibilities for managing Aztec Protocol testnets. The engineering team coordinates the building, testing, and deployment of testnet(s) for each release while providing technical support for protocol and product queries from the community. This document describes the team's responsibilities during a release cycle and outlines actions for various testnet scenarios. The process spans from code-freeze to deployment completion, including both the QA phase (internal testing) and the public release phase. + +## Releases + +The engineering team's testnet responsibilities begin after code-freeze. Here are the primary tasks: + +1. Confirm with engineering and product teams that all required PRs are merged +2. Create a release branch (eg: `-v..`, e.g., `aztec-packages-v0.62.0`) +3. Cherry-pick bug-fixes into the release branch for bugs discovered during release testing. +4. Initiate a final build by pushing an empty commit into the release branch to trigger the `release-please` CI workflow. + +### Release Notes and Artifact Builds + +Verify the `release-please` CI workflow completed successfully and that release notes have been published. +A successful CI run publishes the following Barretenberg artifacts with the release notes: + +- Barretenberg for Mac (x86 64-bit) +- Barretenberg for Mac (Arm 64-bit) +- Barretenberg for Linux (x86 64-bit) +- Barretenberg for WASM + +Additionally, the following NPM packages are published: + +- BB.js +- l1-contracts +- yarn-project (see [publish_npm.sh](https://github.com/AztecProtocol/aztec-packages/blob/aztec-packages-v0.63.0/yarn-project/publish_npm.sh)) + +The following Docker containers are also published: + +- aztecprotocol/aztec:latest +- aztecprotocol/aztec-nargo:latest +- aztecprotocol/cli-wallet:latest + +Lastly, any changes made to developer documentation are published to + +## Deployment + +After cutting a release, deploy a testnet (typically with 48 validators) using the new Docker containers. Verbose logging on Aztec nodes should be enabled by default using the following `ENV VARS`: + +- `LOG_JSON=1` +- `LOG_LEVEL=debug` +- `DEBUG=discv5*,aztec:*,-aztec:avm_simulator*,-aztec:circuits:artifact_hash,-json-rpc*,-aztec:world-state:database,-aztec:l2_block_stream*` + +Deployments are initiated from CI by manually running the (_name pending_) workflow. + +### Sanity Check + +After testnet deployment, perform these sanity checks (these items can also be script automated): + +1. Monitor for crashes and network-level health: + - Review testnet dashboard at `https://grafana.aztec.network/` to confirm node uptime and block production + - Verify overall TPS performance + - Create Github issues for new crash scenarios + +2. Spot check pod logs for component health: + - Tx gossiping (Bot: `Generated IVC proof`) + - Peer discovery (Validator (failure case): `Failed FINDNODE request`) + - Block proposal (Validator: `Can propose block`) + - Block processing (Validator: `l2BlockSourceHash`) + - Block proving (Prover: `Processed 1 new L2 blocks`) + - Epoch proving (Prover: `Submitted proof for epoch`) + +3. Test external node connection and sync + +### Network Connection Info + +After a successful sanity check, share the following network connection information in the `#team-alpha` slack channel and with the wider Aztec community: + +1. AZTEC_IMAGE (`aztecprotocol/aztec:latest`) +2. ETHEREUM_HOST (Kubernetes: `kubectl get services -n | (head -1; grep ethereum)`) + - ethereum-lb: `:8545` +3. BOOT_NODE_URL (Kubernetes: `kubectl get services -n | (head -3; grep boot)`) + - boot-node-lb-tcp: `:40400` + - boot-node-lb-udp: `:40400` + +This latest node connection information must also be updated in any existing node connection guides and where referenced at . + +## Support + +The following items are a shortlist of support items that may be required either during deployment or after a successful launch. + +### Issue Resolution Matrix + +| Event | Action | Criticality | Owner(s) | +|-------|---------|------------|-----------| +| Build failure | Rerun CI or revert problematic changes | Blocker | | +| Deployment issues | Reference deployment `README` or escalate to Delta Team | Blocker | Delta Team | +| Network instability* | Create detailed issue report for Alpha team | Blocker | Alpha Team | +| Challenge completion errors | Document issue and assess challenge viability | Major | Product Team | +| Minor operational issues | Create tracking issue | Minor | Delta Team | +| Hotfix deployment | Update testnet and verify fix | Major | Delta Team | + +_*Defining Network Instability:_ + +A testnet is considered unstable if experiencing any of the following: + +1. Block production stalls +2. Proof generation failures +3. Transaction inclusion issues +4. Node synchronization problems +5. Persistent crashes affecting network operation +6. Persistent chain reorgs affecting network operation +7. Bridge contract failures + +### Release Support Matrix + +| Event | Action | Criticality | Owner(s) | +|-------|---------|------------|-----------| +| Challenge completion issues | Provide guidance or create issue | Minor | DevRel Team | +| Node stability issues | Collect logs and create issue | Major | Delta Team | +| Network-wide problems | Escalate to Delta team | Critical | Alpha/Delta Teams | +| Bridge/Contract issues | Investigate and escalate if needed | Critical | Alpha Team | From d8c3d30befe27a700983efa66c0b5668b44284f3 Mon Sep 17 00:00:00 2001 From: Steven Platt Date: Wed, 20 Nov 2024 17:46:00 -0500 Subject: [PATCH 2/5] formatting nit --- spartan/testnet-runbook.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spartan/testnet-runbook.md b/spartan/testnet-runbook.md index d70518ebf9d0..645bc5627de9 100644 --- a/spartan/testnet-runbook.md +++ b/spartan/testnet-runbook.md @@ -1,4 +1,4 @@ -# Aztec Protocol Testnet Engineering Runbook +# Aztec Protocol: Testnet Engineering Runbook ## Overview From 54b776633a6899d5a24df25bf4d6d31feecaa920 Mon Sep 17 00:00:00 2001 From: Steven Platt Date: Sun, 24 Nov 2024 14:22:34 -0500 Subject: [PATCH 3/5] updated connection info section --- spartan/testnet-runbook.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spartan/testnet-runbook.md b/spartan/testnet-runbook.md index 645bc5627de9..3ab8013f3517 100644 --- a/spartan/testnet-runbook.md +++ b/spartan/testnet-runbook.md @@ -68,7 +68,7 @@ After testnet deployment, perform these sanity checks (these items can also be s ### Network Connection Info -After a successful sanity check, share the following network connection information in the `#team-alpha` slack channel and with the wider Aztec community: +After a successful sanity check, share the following network connection information in the `#team-alpha` slack channel. The Product / DevRel team then shares these connection details with the sequencer & prover discord channel. 1. AZTEC_IMAGE (`aztecprotocol/aztec:latest`) 2. ETHEREUM_HOST (Kubernetes: `kubectl get services -n | (head -1; grep ethereum)`) From 1dd0cb3e4a7d6f9c43e433c46c14abe7e98476b1 Mon Sep 17 00:00:00 2001 From: Steven Platt Date: Sun, 24 Nov 2024 14:28:04 -0500 Subject: [PATCH 4/5] Updated devrel validator note --- spartan/testnet-runbook.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/spartan/testnet-runbook.md b/spartan/testnet-runbook.md index 3ab8013f3517..240b4091fb71 100644 --- a/spartan/testnet-runbook.md +++ b/spartan/testnet-runbook.md @@ -68,7 +68,7 @@ After testnet deployment, perform these sanity checks (these items can also be s ### Network Connection Info -After a successful sanity check, share the following network connection information in the `#team-alpha` slack channel. The Product / DevRel team then shares these connection details with the sequencer & prover discord channel. +After a successful sanity check, share the following network connection information in the `#team-alpha` slack channel: 1. AZTEC_IMAGE (`aztecprotocol/aztec:latest`) 2. ETHEREUM_HOST (Kubernetes: `kubectl get services -n | (head -1; grep ethereum)`) @@ -79,6 +79,8 @@ After a successful sanity check, share the following network connection informat This latest node connection information must also be updated in any existing node connection guides and where referenced at . +The Product/DevRel team then shares these connection details with the sequencer & prover discord channel. Starting at epoch 5, Product/DevRel will coordinate with node operators who have already connected to the network using the information above. Product/DevRel verify that node operators are seeing correct logs, then pass on validator addresses of those ready to engineering so that engineering can add them to the validator set. We do this until we add all 48 validators. + ## Support The following items are a shortlist of support items that may be required either during deployment or after a successful launch. From 6ee4efdd617f3c5aed3270e396cb3150b91e128d Mon Sep 17 00:00:00 2001 From: Steven Platt Date: Sun, 24 Nov 2024 17:14:16 -0500 Subject: [PATCH 5/5] updated details to address review feedback. --- spartan/testnet-runbook.md | 30 ++++++++++++++++-------------- 1 file changed, 16 insertions(+), 14 deletions(-) diff --git a/spartan/testnet-runbook.md b/spartan/testnet-runbook.md index 240b4091fb71..30a224a33cf2 100644 --- a/spartan/testnet-runbook.md +++ b/spartan/testnet-runbook.md @@ -1,21 +1,22 @@ -# Aztec Protocol: Testnet Engineering Runbook +# Public Testnet Engineering Runbook ## Overview -This runbook outlines the engineering team's responsibilities for managing Aztec Protocol testnets. The engineering team coordinates the building, testing, and deployment of testnet(s) for each release while providing technical support for protocol and product queries from the community. This document describes the team's responsibilities during a release cycle and outlines actions for various testnet scenarios. The process spans from code-freeze to deployment completion, including both the QA phase (internal testing) and the public release phase. +This runbook outlines the engineering team's responsibilities for managing Aztec Protocol public testnets. The engineering team coordinates the building, testing, and deployment of public testnet(s) for each release while providing technical support for protocol and product queries from the community. This document describes the team's responsibilities during a release cycle and outlines actions for various public testnet scenarios. The process spans from code-freeze to deployment completion. -## Releases +## QA and Releases -The engineering team's testnet responsibilities begin after code-freeze. Here are the primary tasks: +The engineering team's public testnet responsibilities begin after code-freeze. Code-freeze is initiated by cutting a release branch from a `master` release and follows the below sequence: -1. Confirm with engineering and product teams that all required PRs are merged -2. Create a release branch (eg: `-v..`, e.g., `aztec-packages-v0.62.0`) -3. Cherry-pick bug-fixes into the release branch for bugs discovered during release testing. -4. Initiate a final build by pushing an empty commit into the release branch to trigger the `release-please` CI workflow. +1. Confirm with engineering and product teams that all required PRs are merged. +2. Create a named release branch (eg: `release/sassy-salamander`) from the desired `master` release (eg:`v0.64.0`). +3. Complete all QA testing against `release/sassy-salamander`. +4. For tests that do not pass, create a hotfix into the `release/sassy-salamander` release branch. +5. After testing is complete, initiate a `release-please` CI workflow from `release/sassy-salamander` to publish release artifacts. ### Release Notes and Artifact Builds -Verify the `release-please` CI workflow completed successfully and that release notes have been published. +Verify the `release-please` CI workflow completed successfully and that release notes have been published. If there were no hotfixes, then this simply moves the tags forward to `v0.64.0`, otherwise, it releases `v0.64.X` (and moves the tags). A successful CI run publishes the following Barretenberg artifacts with the release notes: - Barretenberg for Mac (x86 64-bit) @@ -39,7 +40,8 @@ Lastly, any changes made to developer documentation are published to