Skip to content

Commit 2591577

Browse files
committed
updating run book
1 parent d9da44e commit 2591577

File tree

2 files changed

+149
-53
lines changed

2 files changed

+149
-53
lines changed

pages/builders/chain-operators/tools/op-conductor.mdx

Lines changed: 147 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ This page will teach you what the `op-conductor` service is and how it works on
1212
a high level. It will also get you started on setting it up in your own
1313
environment.
1414

15-
## op-conductor: Enhancing Sequencer Reliability and Availability
15+
## Enhancing Sequencer Reliability and Availability
1616

1717
The [op-conductor](https://github.com/ethereum-optimism/optimism/tree/develop/op-conductor)
1818
is an auxiliary service designed to enhance the reliability and availability of
@@ -81,82 +81,177 @@ state transitions.
8181
## Setup
8282

8383
At OP Labs, op-conductor is deployed as a kubernetes statefulset because it
84-
requires a persistent volume to store the raft log.
84+
requires a persistent volume to store the raft log. This guide describes
85+
setting up conductor on an existing network without incurring downtime.
86+
87+
### Assumptions
88+
89+
This setup guide has the following assumptions:
90+
91+
* 3 deployed sequencers (sequencer-0, sequencer-1, sequencer-2) that are all
92+
in sync and in the same vpc network
93+
* sequencer-0 is currently the active sequencer
94+
* You can execute a blue/green style sequencer deployment workflow that
95+
involves no downtime (described below)
96+
* conductor and sequencers are running in k8s or some other container
97+
orchestrator (vm-based deployment may be slightly different and not covered
98+
here)
8599

86100
### Spin up op-conductor
87101

88102
<Steps>
89-
{<h3>Setup initial state</h3>}
103+
{<h3>Deploy conductor</h3>}
104+
105+
Deploy a conductor instance per sequencer with sequencer-1 as the raft cluster
106+
bootstrap node:
107+
108+
* suggested conductor configs:
109+
110+
```yaml
111+
OP_CONDUCTOR_CONSENSUS_ADDR: '<raft url or ip>'
112+
OP_CONDUCTOR_CONSENSUS_PORT: '50050'
113+
OP_CONDUCTOR_EXECUTION_RPC: '<op-geth url or ip>:8545'
114+
OP_CONDUCTOR_HEALTHCHECK_INTERVAL: '1'
115+
OP_CONDUCTOR_HEALTHCHECK_MIN_PEER_COUNT: '2' # set based on your internal p2p network peer count
116+
OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL: '5' # recommend a 2-3x multiple of your network block time to account for temporary performance issues
117+
OP_CONDUCTOR_LOG_FORMAT: logfmt
118+
OP_CONDUCTOR_LOG_LEVEL: info
119+
OP_CONDUCTOR_METRICS_ADDR: 0.0.0.0
120+
OP_CONDUCTOR_METRICS_ENABLED: 'true'
121+
OP_CONDUCTOR_METRICS_PORT: '7300'
122+
OP_CONDUCTOR_NETWORK: '<network>'
123+
OP_CONDUCTOR_NODE_RPC: '<op-node url or ip>:8545'
124+
OP_CONDUCTOR_RAFT_SERVER_ID: 'unique raft server id'
125+
OP_CONDUCTOR_RAFT_STORAGE_DIR: /conductor/raft
126+
OP_CONDUCTOR_RPC_ADDR: 0.0.0.0
127+
OP_CONDUCTOR_RPC_ENABLE_ADMIN: 'true'
128+
OP_CONDUCTOR_RPC_ENABLE_PROXY: 'true'
129+
OP_CONDUCTOR_RPC_PORT: '8547'
130+
```
131+
132+
* sequencer-1 op-conductor extra config:
133+
134+
```yaml
135+
OP_CONDUCTOR_PAUSED: "true"
136+
OP_CONDUCTOR_RAFT_BOOTSTRAP: "true"
137+
```
90138
91-
* Sequencer: the sequencer identifier.
92-
* Conductor Role: the op-conductor initial role.
93-
* [OP\_CONDUCTOR\_PAUSED](#conductor_paused)
94-
* [OP\_NODE\_SEQUENCER\_ENABLED](/builders/node-operators/configuration/consensus-config#sequencerenabled)
95-
* [OP\_NODE\_CONDUCTOR\_ENABLED](/builders/node-operators/configuration/consensus-config#conductorenabled)
139+
{<h3>Pause two conductors</h3>}
96140
97-
| Sequencer | OP\_CONDUCTOR\_PAUSED | OP\_NODE\_SEQUENCER\_ENABLED | OP\_NODE\_CONDUCTOR\_ENABLED |
98-
| ----------- | --------------------- | ---------------------------- | ---------------------------- |
99-
| sequencer-0 | true | true | false |
100-
| sequencer-1 | true | false | false |
101-
| sequencer-2 | true | false | false |
141+
Pause `sequencer-0` &` sequencer-1` conductors with [conductor\_pause](#conductor_pause)
142+
RPC request.
102143

103-
{<h3>Enable conductor on sequencers</h3>}
144+
{<h3>Update op-node configuration and switch the active sequencer</h3>}
104145

105-
* Set `OP_NODE_CONDUCTOR_ENABLED=true` on the sequencers' `op-node` instances
106-
* Ensure configuration persistence is turned off.
146+
Deploy an `op-node` config update to all sequencers that enables conductor. Use
147+
a blue/green style deployment workflow that switches the active sequencer to
148+
`sequencer-1`:
107149

108-
{<h3>Switch the active sequencer and enable conductor</h3>}
150+
* all sequencer op-node configs:
109151

110-
* Rollout `sequencer-2` as the active sequencer. You can use a blue/green
111-
style deployment to switch from `sequencer-0` to activate `sequencer-2` without
112-
any downtime.
113-
* The `sequencer-2` will now begin to commit unsafe payloads to the raft log
114-
* Confirm `sequencer-2` is active and successfully producing unsafe blocks.
152+
```yaml
153+
OP_NODE_CONDUCTOR_ENABLED: "true"
154+
OP_NODE_RPC_ADMIN_STATE: "" # this flag cant be used with conductor
155+
```
115156

116-
{<h3>Add voting nodes to the cluster</h3>}
157+
{<h3>Confirm sequencer switch was successful</h3>}
117158

118-
* Use [AddServerAsVoter](#conductor_addServerAsVoter) to add followers as
119-
voters
159+
Confirm `sequencer-1` is active and successfully producing unsafe blocks.
160+
Because `sequencer-1` was the raft cluster bootstrap node, it is now committing
161+
unsafe payloads to the raft log.
162+
163+
{<h3>Add voting nodes</h3>}
164+
165+
Add voting nodes to cluster using [conductor\_AddServerAsVoter](#conductor_addServerAsVoter)
166+
RPC request to the leader conductor (`sequencer-1`)
120167

121168
{<h3>Confirm state</h3>}
122169

123-
| Sequencer | Conductor Role | OP\_CONDUCTOR\_PAUSED | OP\_NODE\_SEQUENCER\_ENABLED | OP\_NODE\_CONDUCTOR\_ENABLED |
124-
| ----------- | -------------- | --------------------- | ---------------------------- | ---------------------------- |
125-
| sequencer-0 | follower | true | **false** | **true** |
126-
| sequencer-1 | follower | true | false | **true** |
127-
| sequencer-2 | leader | true | **true** | **true** |
170+
Confirm cluster membership and sequencer state:
171+
172+
* `sequencer-0` and `sequencer-2`:
173+
1. raft cluster follower
174+
2. sequencer is stopped
175+
3. conductor is paused
176+
4. conductor enabled in op-node config
177+
178+
* `sequencer-1`
179+
1. raft cluster leader
180+
2. sequencer is active
181+
3. conductor is paused
182+
4. conductor enabled in op-node config
128183

129-
{<h3>Resume conductor</h3>}
184+
{<h3>Resume conductors</h3>}
130185

131-
* Use [resume](#conductor_resume) on all nodes
186+
Resume all conductors with [conductor\_resume](#conductor_resume) RPC request to
187+
each conductor instance.
132188

133-
{<h3>Confirm conductor has resumed</h3>}
189+
{<h3>Confirm state</h3>}
190+
191+
Confirm all conductors successfully resumed with [conductor\_paused](#conductor_paused)
134192

135-
* Use [paused](#conductor_paused) to confirm all conductors have been
136-
successfully resumed
193+
{<h3>Tranfer leadership</h3>}
137194

138-
{<h3>Remove paused configuration</h3>}
195+
Trigger leadership transfer to `sequencer-0` using [conductor\_transferLeaderToServer](#conductor_transferLeaderToServer)
139196

140-
* Remove `OP_CONDUCTOR_PAUSED=true`.
197+
{<h3>Confirm state</h3>}
141198

142-
| Sequencer | Conductor Role | OP\_CONDUCTOR\_PAUSED | OP\_NODE\_SEQUENCER\_ENABLED | OP\_NODE\_CONDUCTOR\_ENABLED |
143-
| ----------- | -------------- | --------------------- | ---------------------------- | ---------------------------- |
144-
| sequencer-0 | follower | **false** | false | true |
145-
| sequencer-1 | follower | **false** | false | true |
146-
| sequencer-2 | leader | **false** | true | true |
199+
* `sequencer-1` and `sequencer-2`:
200+
1. raft cluster follower
201+
2. sequencer is stopped
202+
3. conductor is active
203+
4. conductor enabled in op-node config
147204

148-
{<h3>Set sequencer-0 to leader</h3>}
205+
* `sequencer-0`
206+
1. raft cluster leader
207+
2. sequencer is active
208+
3. conductor is active
209+
4. conductor enabled in op-node config
149210

150-
* Set sequencer-0 to be the leader
151-
* Confirm the state
211+
{<h3>Update configuration</h3>}
152212

153-
| Sequencer | Conductor Role | OP\_CONDUCTOR\_PAUSED | OP\_NODE\_SEQUENCER\_ENABLED | OP\_NODE\_CONDUCTOR\_ENABLED |
154-
| ----------- | -------------- | --------------------- | ---------------------------- | ---------------------------- |
155-
| sequencer-0 | leader | false | false | true |
156-
| sequencer-1 | follower | false | false | true |
157-
| sequencer-2 | follower | false | true | true |
213+
Deploy a config change to `sequencer-1` conductor to remove the
214+
`OP_CONDUCTOR_PAUSED: true` flag and `OP_CONDUCTOR_RAFT_BOOTSTRAP` flag.
158215
</Steps>
159216

217+
#### Blue/Green Deployment
218+
219+
In order to ensure there is no downtime when setting up conductor, you need to
220+
have a deployment script that can update sequencers without network downtime.
221+
222+
An example of this workflow might look like:
223+
224+
1. Query current state of the network and determine which sequencer is
225+
currently active (referred to as "original" sequencer below).
226+
From the other available sequencers, choose a candidate sequencer.
227+
2. Deploy the change to the candidate sequencer and then wait for it to sync
228+
up to the original sequencer's unsafe head. You may want to check peer counts
229+
and other important health metrics.
230+
3. Stop the original sequencer using `admin_stopSequencer` which returns the
231+
last inserted unsafe block hash. Wait for candidate sequencer to sync with
232+
this returned hash in case there is a delta.
233+
4. Start the candidate sequencer at the original's last inserted unsafe block
234+
hash.
235+
1. Here you can also execute additional check for unsafe head progression
236+
and decide to roll back the change (stop the candidate sequencer, start the
237+
original, rollback deployment of candidate, etc)
238+
5. Deploy the change to the original sequencer, wait for it to sync to the
239+
chain head. Execute health checks.
240+
241+
#### Post-Conductor Launch Deployments
242+
243+
After conductor is live, a similar canary style workflow is used to ensure
244+
minimal downtime in case there is an issue with deployment:
245+
246+
1. Choose a candidate sequencer from the raft-cluster followers
247+
2. Deploy to the candidate sequencer. Run health checks on the candidate.
248+
3. Transfer leadership to the candidate sequencer using
249+
`conductor_transferLeaderToServer`. Run health checks on the candidate.
250+
4. Test if candidate is still the leader using `conductor_leader` after some
251+
grace period (ex: 30 seconds)
252+
1. If not, then there is likely an issue with the deployment. Roll back.
253+
5. Upgrade the remaining sequencers, run healthchecks.
254+
160255
### Configuration Options
161256

162257
It is configured via its [flags / environment variables](https://github.com/ethereum-optimism/optimism/blob/develop/op-conductor/flags/flags.go)
@@ -495,14 +590,14 @@ AddServerAsVoter adds a server as a voter to the cluster.
495590
<Tabs.Tab>
496591
```sh
497592
curl -X POST -H "Content-Type: application/json" --data \
498-
'{"jsonrpc":"2.0","method":"conductor_addServerAsVoter","params":[],"id":1}' \
593+
'{"jsonrpc":"2.0","method":"conductor_addServerAsVoter","params":[<id>, <addr>, <version>],"id":1}' \
499594
http://127.0.0.1:50050
500595
```
501596
</Tabs.Tab>
502597

503598
<Tabs.Tab>
504599
```sh
505-
cast rpc conductor_addServerAsVoter --rpc-url http://127.0.0.1:50050
600+
cast rpc conductor_addServerAsVoter --rpc-url http://127.0.0.1:50050 <id> <addr> <version>
506601
```
507602
</Tabs.Tab>
508603
</Tabs>

words.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,7 @@ hardfork
126126
hardforks
127127
HEALTHCHECK
128128
healthcheck
129+
healthchecks
129130
heartbeating
130131
HISTORICALRPC
131132
historicalrpc
@@ -239,7 +240,6 @@ Permissionless
239240
permissionless
240241
permissionlessly
241242
Perps
242-
persistence
243243
personhood
244244
Pimlico
245245
POAP
@@ -341,6 +341,7 @@ therealbytes
341341
threadcreate
342342
tility
343343
timeseries
344+
Tranfer
344345
trustlessly
345346
trustrpc
346347
txfeecap

0 commit comments

Comments
 (0)