Skip to content

Add syncer error recovery troubleshooting documentation #1624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 29, 2025
Merged
49 changes: 45 additions & 4 deletions content/operate/rs/databases/active-active/syncer.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,19 +28,19 @@ When a new primary is appointed, the replication ID changes, but a partial sync


In a partial sync, the backlog of operations since the offset are transferred as raw operations.
In a full sync, the data from the primary is transferred to the replica as an RDB file which is followed by a partial sync.
In a full sync, the data from the primary is transferred to the replica as an RDB file which is followed by a partial sync.

Partial synchronization requires a backlog large enough to store the data operations until connection is restored. See [replication backlog]({{< relref "/operate/rs/databases/active-active/manage#replication-backlog" >}}) for more info on changing the replication backlog size.

### Syncer in Active-Active replication

In the case of an Active-Active database:

- Multiple past replication IDs and offsets are stored to allow for multiple syncs
- The [Active-Active replication backlog]({{< relref "/operate/rs/databases/active-active/manage#replication-backlog" >}}) is also sent to the replica during a full sync.
- Multiple past replication IDs and offsets are stored to allow for multiple syncs
- The [Active-Active replication backlog]({{< relref "/operate/rs/databases/active-active/manage#replication-backlog" >}}) is also sent to the replica during a full sync.

{{< warning >}}
Full sync triggers heavy data transfers between geo-replicated instances of an Active-Active database.
Full sync triggers heavy data transfers between geo-replicated instances of an Active-Active database.
{{< /warning >}}

An Active-Active database uses partial synchronization in the following situations:
Expand All @@ -53,4 +53,45 @@ An Active-Active database uses partial synchronization in the following situatio

{{< note >}}
Synchronization of data from the primary shard to the replica shard is always a full synchronization.
{{< /note >}}

## Troubleshooting syncer errors

### Unrecoverable syncer errors

Some syncer errors are unrecoverable and cause the syncer to exit with exit code 4. When this occurs, the Data Management Controller (DMC) automatically sets the `crdt_sync` or `replica_sync` value to `stopped`.

#### Restart syncer for regular databases

To restart a regular database's syncer after an unrecoverable error, [update the database configuration]({{<relref "/operate/rs/references/rest-api/requests/bdbs#put-bdbs">}}) with the REST API to enable `sync`:


```sh
curl -v -k -u <username>:<password> -X PUT \
-H "Content-Type: application/json" \
-d '{"sync": "enabled"}' \
https://<host>:<port>/v1/bdbs/<database-id>
```

#### Restart syncer for Active-Active databases

To restart an Active-Active database's syncer after an unrecoverable error, use one of the following methods.

- For each participating cluster, [update the database configuration]({{<relref "/operate/rs/references/rest-api/requests/bdbs#put-bdbs">}}) with the REST API to enable `sync`:

```sh
curl -v -k -u <username>:<password> -X PUT \
-H "Content-Type: application/json" \
-d '{"sync": "enabled"}' \
https://<host>:<port>/v1/bdbs/<database-id>
```

- Run [`crdb-cli crdb update`]({{<relref "/operate/rs/references/cli-utilities/crdb-cli/crdb/update">}}):

```sh
crdb-cli crdb update --crdb-guid <crdb-guid> --force
```

{{< note >}}
Replace `<username>`, `<password>`, `<host>`, `<port>`, `<database-id>`, and `<crdb-guid>` with your actual values.
{{< /note >}}