Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions NODE_MAINTENANCE_CHECKLIST.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Base node maintenance checklist

This document provides a practical, operator-focused checklist for running and maintaining a Base node.

It is intended as a companion to the official documentation and the main README of this repository, not a replacement for them.

---

## 1. Daily checks

These checks are quick and help you catch issues early.

1. **Sync and health**

- Confirm that the node is syncing and up to date with the network tip.
- Use your preferred JSON-RPC client to query:
- `eth_blockNumber` on the execution client
- any available sync status endpoint on the op-node (for example, an `optimism_syncStatus`-like method, where supported)
- Compare the reported block height or timestamps with a trusted public explorer or reference RPC.

2. **Logs**

- Inspect logs for the execution client and op-node:
- look for repeated errors
- check for connection issues to L1 RPCs
- watch for resource exhaustion warnings (disk, memory, file descriptors)
- If you use Docker, `docker compose logs` (with appropriate service names) is usually sufficient for a quick view.

3. **Disk usage**

- Check that the data volume still has enough free space to accommodate chain growth and snapshots.
- If possible, monitor both:
- filesystem-level free space
- size of the actual node data directory.

4. **Basic RPC functionality**

- Run a simple RPC call (for example, `eth_chainId` and a basic `eth_call` or `eth_getBalance`) to confirm the node is responding correctly.
- If the node is behind a load balancer or reverse proxy, test through the same path your applications use.

---

## 2. Weekly checks

Once a week, it is useful to do a deeper pass:

1. **Resource usage and trends**

- Review CPU, memory and network usage over the past week.
- Confirm that usage patterns match your expectations (for example, no steady upward drift in memory usage).

2. **Snapshots and backups**

- Verify that any snapshot or backup processes you rely on are completing successfully.
- Confirm that you can restore from a recent snapshot in a non-production environment.

3. **Configuration drift**

- Compare your current configuration (`.env`, Docker settings, orchestration configs) against your desired baseline.
- Make sure ad-hoc changes made during debugging are either reverted or documented.

4. **Software versions**

- Check for new releases of:
- this repository (Base node images)
- your execution client (reth, geth, nethermind)
- any additional tooling you rely on.
- Read release notes and decide whether an upgrade is appropriate.

---

## 3. Before performing an upgrade

When planning to upgrade your Base node:

1. **Read release notes carefully**

- Look for breaking changes, configuration migrations or new environment variables.
- Pay particular attention to sections describing node operators and infrastructure.

2. **Plan a maintenance window**

- If this node is part of a production setup or behind a load balancer, schedule a time when it can be taken out of rotation.
- Ensure that other nodes or fallback infrastructure can handle traffic while this instance is being upgraded.

3. **Take a snapshot or backup**

- Create a snapshot or backup of the node data and configuration before the upgrade.
- Verify that the snapshot is complete and accessible.

4. **Prepare rollback steps**

- Decide in advance how you will roll back if the upgrade exposes unexpected issues:
- previous container images
- previous configuration
- snapshot restore plan.

---

## 4. After an upgrade

Once the node has been upgraded:

1. **Verify sync and health**

- Confirm that the node starts successfully and resumes syncing.
- Re-run your health and sync checks (block height, RPC probes, logs).

2. **Validate configuration**

- Make sure new configuration options are set correctly.
- Confirm that deprecated options are either removed or updated.

3. **Run a small functional test**

- If possible, perform a small end-to-end test using your normal application workflow:
- connect via RPC
- send a simple transaction
- confirm it is processed as expected.

4. **Monitor closely**

- Monitor logs and metrics more closely than usual during the first hours after the upgrade.
- Watch for regressions in performance, memory usage, or error rates.

---

## 5. Incident response quick reference

If your Base node appears unhealthy or out of sync:

1. **Take it out of rotation**

- If the node is serving production traffic behind a load balancer, remove it from the pool to avoid impacting users.

2. **Collect information**

- Capture:
- recent logs
- resource utilisation snapshots
- configuration files or environment (with secrets removed)
- observations from your monitoring system.

3. **Check for known issues**

- Review:
- recent releases
- open issues or discussions related to the symptoms you are seeing.

4. **Decide between repair and rebuild**

- For some cases, it may be faster and safer to:
- stop the node
- restore from a known-good snapshot
- or resync from scratch,
rather than attempting to repair a potentially corrupted state.

5. **Document and share learnings**

- Keep a short incident log so that future maintenance and automation can prevent similar issues.
- When appropriate, share anonymized details with maintainers as part of a bug report.

---

## 6. Notes

This checklist is intentionally generic and should be adapted to your environment:

- the exact commands you use may differ depending on how you deploy the node
- you may have additional compliance, security, or observability requirements to follow.

Treat this document as a starting point and refine it as you gain operational experience with your Base nodes.