restatedev · pcholakov · Feb 5, 2025 · Feb 4, 2025 · Feb 5, 2025 · Feb 5, 2025
diff --git a/docs/deploy/snapshotting.mdx b/docs/deploy/snapshotting.mdx
@@ -0,0 +1,30 @@
+---
+sidebar_position: 5
+description: "Configure Restate snapshotting."
+---
+
+import Admonition from '@theme/Admonition';
+
+# Snapshotting
+
+Restate workers can be configured to periodically publish snapshots of their partition state to a shared destination. Snapshots act as a form of backup and allow nodes that had not previously served a partition to bootstrap a copy of its state. Without snapshots, rebuilding a partition processor would require the full replay of the partition's log. Replaying the log might take a long time or even be impossible if the log was trimmed.
+
+Restate clusters should always be configured with a snapshot repository to allow nodes to efficiently share partition state. Restate currently supports using Amazon S3 (or another API-compatible object store) as a shared snapshot repository. To set up a snapshot destination, update your server configuration as follows:
+
+```toml
+[worker.snapshots]
+destination = "s3://snapshots-bucket/cluster-prefix"
+snapshot-interval-num-records = 10000
+```
+
+This enables automated periodic snapshots to be written to the specified bucket. You can also trigger snapshot creation manually using the [`restatectl`](/operate/control) `create-snapshot` command. We recommend testing the snapshot configuration by requesting a snapshot and examining the contents of the bucket. You should see a new prefix with each partition's id, and a `latest.json` file pointing to the most recent snapshot.
+
+No additional configuration is required to enable restoring snapshots. When partition processors first start up, and no local partition state is found, the processor will attempt to restore the latest snapshot from the repository. This allows for efficient bootstrapping of additional partition workers.
+
+<Admonition type="tip" title="Experimenting with snapshots without an object store">
+    For testing purposes, you can also use the `file://` protocol to publish snapshots to a local directory. This is mostly useful when experimenting with multi-node configurations on a single machine. The `file` provider does not support conditional updates, which makes it unsuitable for potentially contended operation.
+</Admonition>
+
+For S3 bucket destinations, Restate will use the AWS credentials available from the environment, or the configuration profile specified by `AWS_PROFILE` environment variable, falling back to the default AWS profile.
+
+To learn more about how Restate uses snapshots, see [Logs and Snapshots](/operate/logs-snapshots).
diff --git a/docs/operate/architecture.mdx b/docs/operate/architecture.mdx
@@ -41,3 +41,5 @@ Restate uses a distributed log to durably record all events in the system before
 **Log server** nodes running the `log-server` role are responsible for durably persisting the log. If the log is the equivalent of a WAL, then partition stores are the materializations that enable efficient reads of the events (invocation journals, key-value data) that have been recorded. Depending on the configured **log replication** requirements, Restate will allocate multiple log servers to persist a given log, and this will change over time to support maintenance and resizing of the cluster.
 
 The **partition processor** is the Restate component responsible for maintaining the partition store. This runs on nodes assigned the `worker` role. Partition processors can operate in either leader or follower mode. Only a single leader for a given partition can be active at a time, and this is the sole processor that handles invocations to deployed services. Followers keep up with the log without taking action, and are ready to take over in the event that the partition's leader becomes unavailable. The overall number of processors per partition is configurable via the **partition replication** configuration option.
+
+Partition processors replicate their state by following and applying the log for their partition. If a processor needs to stop, for example for scheduled maintenance, it will typically catch up on the records it missed by reading them from the cluster's log servers once it comes back online. Occasionally, a worker node might lose a disk - or you might need to grow your cluster by adding fresh nodes to it. In these cases, it's far more efficient to obtain a **snapshot** of the partition state from a recent point in time than to replay all the missing log events. Restate clusters can be configured to use an external **object store** as the snapshot repository, allowing partition processors to skip ahead in the log. This also enables us to **trim logs** which might otherwise grow unboundedly.
diff --git a/docs/operate/data-backup.mdx b/docs/operate/data-backup.mdx
@@ -1,14 +1,14 @@
 ---
-sidebar_position: 10
-description: "Strategies for backing up and restoring the Restate data store"
+sidebar_position: 11
+description: "Backing up and restoring the Restate data store on single nodes"
 ---
 
 import Admonition from '@theme/Admonition';
 
 # Data Backup
 
 <Admonition type="note">
-    Future versions of Restate will support distributed deployment with spanning multiple machnes enhancing the availability you can achieve with your Restate cluster. This document only covers single-node Restate deployments.
+    This page covers backing up individual Restate server instances. For sharing snapshots in a Restate cluster environment, see [Logs and Snapshots](/operate/logs-snapshots).
 </Admonition>
 
 The Restate server persists both metadata (such as the details of deployed services, in-flight invocations) and data (e.g., virtual object and workflow state keys) in its data store, which is located in its base directory (by default, the `restate-data` path relative to the startup working directory). Restate is configured to perform write-ahead logging with fsync enabled to ensure that effects are fully persisted before being acknowledged to participating services.
@@ -19,7 +19,7 @@ In addition to the data store, you should also make sure you have a back up of t
 
 ## Restoring Backups
 
-<Admonition type={"caution"} title={"Avoid multiple instances of Restate"}>
+<Admonition type={"caution"} title={"Prevent multiple instances of the same node"}>
     Restate cannot guarantee that it is the only instance of the given node. You must ensure that only one instance of any given Restate node is running when restoring the data store from a backup. Running multiple instances could lead to a "split-brain" scenario where different servers process invocations for the same set of services, causing state divergence.
 </Admonition>
 

diff --git a/docs/operate/logs-snapshots.mdx b/docs/operate/logs-snapshots.mdx
@@ -0,0 +1,39 @@
+---
+sidebar_position: 10
+description: "How Restate uses logs and snapshots"
+---
+
+import Admonition from '@theme/Admonition';
+
+# Logs and Snapshots
+
+In a distributed environment, the shared log is the mechanism for replicating partition state among nodes. Therefore it is critical to that all cluster members can get all the relevant events recorded in the log, even newly built nodes that will join the cluster in the future. This requirement is at odds with an immutable log growing unboundedly. Snapshots enable log trimming - the proces of removing older segments of the log.
+
+When partition processors successfully publish a snapshot, they update their "archived" log sequence number (LSN). This reflects the position in the log at which the snapshot was taken and allows the cluster to safely trim its logs.
+
+## Log trimming
+
+By default, Restate will attempt to trim logs once an hour which you can override or disable in the server configuration:
+
+```toml
+[admin]
+log-trim-interval = "1h"
+```
+
+This interval is only the check and not a guarantee that logs will be trimmed. Restate will automatically determine the appropriate safe trim point for each partition's log.
+
+If replicated logs are in use in a clustered environment, the log safe trim point will be determined based on the archived LSN. If a snapshot repository is not configured, then archived LSNs are not reported. Instead, the safe trim point will be determined by the smallest reported persisted LSN across all known processors for the given partition. Single-node local-only logs are also trimmed based on the partitions' persisted LSNs.
+
+The presence of any dead nodes in a cluster will cause trimming to be suspended for all partitions, unless a snapshot repository is configured. This is because we can not know what partitions may reside on the unreachable nodes, which will become stuck when the node comes back.
+
+When a node starts up with pre-existing partition state and finds that the partition's log has been trimmed to a point beyond the most recent locally-applied LSN, the node will attempt to download the latest snapshot from the configured repository. If a suitable snapshot is available, the processor will re-bootstrap its local state and resume applying the log.
+
+<Admonition type="note" title="Handling log trim gap errors">
+    If you observe repeated `Shutting partition processor down because it encountered a trim gap in the log.` errors in the Restate server log, it is an indication that a processor is failing to start up due to missing log records. To recover, you must ensure that a snapshot repository is correctly configured and accessible from the the node reporting errors. You can still recover even if no snapshots were taken previously as long as there is at least one healthy node with a copy of the partition data. In that case, you must first configure the existing node(s) to publish snapshots for the affected partition(s) to a shared destination.
+</Admonition>
+
+## Pruning the snapshot repository
+
+<Admonition type="warning" title="Pruning">
+    Restate does not currently support pruning older snapshots from the snapshot repository. We recommend implementing an object lifecycle policy directly in the object store to manage retention.
+</Admonition>
diff --git a/docs/operate/operate.mdx b/docs/operate/operate.mdx
@@ -8,6 +8,11 @@ import ExampleWidget from "../../src/components/ExampleWidget";
 
 
 <ExampleWidget boxStyling={"community-box"} features={[
+    {
+        title: 'Restate Architecture',
+        description: "Learn more about Restate's internal architecture.",
+        singleLink: "/operate/architecture",
+    },
     {
         title: 'Registration',
         description: "Tell Restate where to reach your services.",
@@ -48,6 +53,11 @@ import ExampleWidget from "../../src/components/ExampleWidget";
         description: "Upgrade your Restate Server.",
         singleLink: "/operate/upgrading"
     },
+    {
+        title: 'Logs and Snapshots',
+        description: "Enable log trimming in clusters using snapshots.",
+        singleLink: "/operate/logs-snapshots"
+    },
     {
         title: 'Data Backup',
         description: "Backup the Restate Server data.",
Original file line number	Diff line number	Diff line change
Expand Up		@@ -41,3 +41,5 @@ Restate uses a distributed log to durably record all events in the system before
		Log server nodes running the `log-server` role are responsible for durably persisting the log. If the log is the equivalent of a WAL, then partition stores are the materializations that enable efficient reads of the events (invocation journals, key-value data) that have been recorded. Depending on the configured log replication requirements, Restate will allocate multiple log servers to persist a given log, and this will change over time to support maintenance and resizing of the cluster.

		The partition processor is the Restate component responsible for maintaining the partition store. This runs on nodes assigned the `worker` role. Partition processors can operate in either leader or follower mode. Only a single leader for a given partition can be active at a time, and this is the sole processor that handles invocations to deployed services. Followers keep up with the log without taking action, and are ready to take over in the event that the partition's leader becomes unavailable. The overall number of processors per partition is configurable via the partition replication configuration option.

		Partition processors replicate their state by following and applying the log for their partition. If a processor needs to stop, for example for scheduled maintenance, it will typically catch up on the records it missed by reading them from the cluster's log servers once it comes back online. Occasionally, a worker node might lose a disk - or you might need to grow your cluster by adding fresh nodes to it. In these cases, it's far more efficient to obtain a snapshot of the partition state from a recent point in time than to replay all the missing log events. Restate clusters can be configured to use an external object store as the snapshot repository, allowing partition processors to skip ahead in the log. This also enables us to trim logs which might otherwise grow unboundedly.