Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

troubleshoot excessive snapshot rebalance/recovery rates #14061

Merged
merged 2 commits into from
Jun 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion _includes/v22.1/prod-deployment/resolution-inverted-lsm.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
If LSM compaction falls behind, throttle your workload concurrency to allow compaction to catch up and restore a healthy LSM shape. {% include {{ page.version.version }}/prod-deployment/prod-guidance-connection-pooling.md %} If a node is severely impacted, you can [start a new node](cockroach-start.html) and then [decommission the problematic node](node-shutdown.html?filters=decommission#remove-nodes).
If compaction has fallen behind and caused an [inverted LSM](architecture/storage-layer.html#inverted-lsms), throttle your workload concurrency to allow compaction to catch up and restore a healthy LSM shape. {% include {{ page.version.version }}/prod-deployment/prod-guidance-connection-pooling.md %} If a node is severely impacted, you can [start a new node](cockroach-start.html) and then [decommission the problematic node](node-shutdown.html?filters=decommission#remove-nodes).
2 changes: 1 addition & 1 deletion v22.1/architecture/replication-layer.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Because this log is treated as serializable, it can be replayed to bring a node

In versions prior to v21.1, CockroachDB only supported _voting_ replicas: that is, [replicas](overview.html#architecture-replica) that participate as voters in the [Raft consensus protocol](#raft). However, the need for all replicas to participate in the consensus algorithm meant that increasing the [replication factor](../configure-replication-zones.html#num_replicas) came at a cost of increased write latency, since the additional replicas needed to participate in Raft [quorum](overview.html#architecture-overview-consensus).

In order to provide [better support for multi-region clusters](../multiregion-overview.html), (including the features that make [fast multi-region reads](../multiregion-overview.html#global-tables) and [surviving region failures](../multiregion-overview.html#surviving-region-failures) possible), a new type of replica is introduced: the _non-voting_ replica.
In order to provide [better support for multi-region clusters](../multiregion-overview.html) (including the features that make [fast multi-region reads](../multiregion-overview.html#global-tables) and [surviving region failures](../multiregion-overview.html#surviving-region-failures) possible), a new type of replica is introduced: the _non-voting_ replica.

Non-voting replicas follow the [Raft log](#raft-logs) (and are thus able to serve [follower reads](../follower-reads.html)), but do not participate in quorum. They have almost no impact on write latencies.

Expand Down
46 changes: 37 additions & 9 deletions v22.1/cluster-setup-troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ If you're having trouble starting or scaling your cluster, this page will help y

To use this guide, it's important to understand some of CockroachDB's terminology:

- A **Cluster** acts as a single logical database, but is actually made up of many cooperating nodes.
- A **cluster** acts as a single logical database, but is actually made up of many cooperating nodes.
- **Nodes** are single instances of the `cockroach` binary running on a machine. It's possible (though atypical) to have multiple nodes running on a single machine.

## Cannot run a single-node CockroachDB cluster
Expand All @@ -18,7 +18,7 @@ Try running:

{% include copy-clipboard.html %}
~~~ shell
$ cockroach start-single-node --insecure --logtostderr
$ cockroach start-single-node --insecure
~~~

If the process exits prematurely, check for the following:
Expand All @@ -41,7 +41,7 @@ When starting a node, the directory you choose to store the data in also contain
~~~
{% include copy-clipboard.html %}
~~~ shell
$ cockroach start-single-node --insecure --logtostderr
$ cockroach start-single-node --insecure
~~~

### Toolchain incompatibility
Expand Down Expand Up @@ -90,13 +90,13 @@ You should see a list of the built-in databases:

If you’re not seeing the output above, check for the following:

- `connection refused` error, which indicates you have not included some flag that you used to start the node. We have additional troubleshooting steps for this error [here](common-errors.html#connection-refused).
- The node crashed. To ascertain if the node crashed, run `ps | grep cockroach` to look for the `cockroach` process. If you cannot locate the `cockroach` process (i.e., it crashed), [file an issue](file-an-issue.html), including the logs from your node and any errors you received.
- `connection refused` error, which indicates you have not included some flag that you used to start the node. We have additional troubleshooting steps for this error [here](common-errors.html#connection-refused).
- The node crashed. To ascertain if the node crashed, run `ps | grep cockroach` to look for the `cockroach` process. If you cannot locate the `cockroach` process (i.e., it crashed), [file an issue](file-an-issue.html), including the [logs from your node](configure-logs.html#logging-directory) and any errors you received.

## Cannot run a multi-node CockroachDB cluster on the same machine

{{site.data.alerts.callout_info}}
Running multiple nodes on a single host is useful for testing out CockroachDB, but it's not recommended for production deployments. To run a physically distributed cluster in production, see [Manual Deployment](manual-deployment.html) or [Orchestrated Deployment](orchestration.html). Also be sure to review the [Production Checklist](recommended-production-settings.html).
Running multiple nodes on a single host is useful for testing CockroachDB, but it's not recommended for production deployments. To run a physically distributed cluster in production, see [Manual Deployment](manual-deployment.html) or [Kubernetes Overview](kubernetes-overview.html). Also be sure to review the [Production Checklist](recommended-production-settings.html).
{{site.data.alerts.end}}

If you are trying to run all nodes on the same machine, you might get the following errors:
Expand All @@ -119,9 +119,11 @@ ERROR: cockroach server exited with error: consider changing the port via --list

**Solution:** Change the `--port`, `--http-port` flags for each new node that you want to run on the same machine.

## Cannot join a node to an existing CockroachDB cluster
## Scaling issues

### Store directory already exists
### Cannot join a node to an existing CockroachDB cluster

#### Store directory already exists

When joining a node to a cluster, you might receive one of the following errors:

Expand Down Expand Up @@ -151,7 +153,7 @@ node belongs to cluster {"cluster hash"} but is attempting to connect to a gossi
$ cockroach start --join=<cluster host>:26257 <other flags>
~~~

### Incorrect `--join` address
#### Incorrect `--join` address

If you try to add another node to the cluster, but the `--join` address is not pointing at any of the existing nodes, then the process will never complete, and you'll see a continuous stream of warnings like this:

Expand All @@ -164,6 +166,32 @@ W180817 17:01:56.510430 914 vendor/google.golang.org/grpc/clientconn.go:1293 grp

**Solution:** To successfully join the node to the cluster, start the node again, but this time include a correct `--join` address.

### Performance is degraded when adding nodes

#### Excessive snapshot rebalance and recovery rates

The `kv.snapshot_rebalance.max_rate` and `kv.snapshot_recovery.max_rate` [cluster settings](cluster-settings.html) set the rate limits at which [snapshots](architecture/replication-layer.html#snapshots) are sent to nodes. These settings can be temporarily increased to expedite replication during an outage or when scaling a cluster up or down.

However, if the settings are too high when nodes are added to the cluster, this can cause degraded performance and node crashes. We recommend **not** increasing these values by more than 2 times their [default values](cluster-settings.html) without explicit approval from Cockroach Labs.

**Explanation:** If `kv.snapshot_rebalance.max_rate` and `kv.snapshot_recovery.max_rate` are set too high for the cluster during scaling, this can cause nodes to experience ingestions faster than compactions can keep up, and result in an [inverted LSM](architecture/storage-layer.html#inverted-lsms).

**Solution:** [Check LSM health](common-issues-to-monitor.html#lsm-health). {% include {{ page.version.version }}/prod-deployment/resolution-inverted-lsm.md %}

After compaction has completed, lower `kv.snapshot_rebalance.max_rate` and `kv.snapshot_recovery.max_rate` to their [default values](cluster-settings.html). As you add nodes to the cluster, slowly increase both cluster settings, if desired. This will control the rate of new ingestions for newly added nodes. Meanwhile, monitor the cluster for unhealthy increases in [IOPS](common-issues-to-monitor.html#disk-iops) and [CPU](common-issues-to-monitor.html#cpu).

Outside of performing cluster maintenance, return `kv.snapshot_rebalance.max_rate` and `kv.snapshot_recovery.max_rate` to their [default values](cluster-settings.html).

{% include_cached copy-clipboard.html %}
~~~ sql
RESET CLUSTER SETTING kv.snapshot_rebalance.max_rate;
~~~

{% include_cached copy-clipboard.html %}
~~~ sql
RESET CLUSTER SETTING kv.snapshot_recovery.max_rate;
~~~

## Client connection issues

If a client cannot connect to the cluster, check basic network connectivity (`ping`), port connectivity (`telnet`), and certificate validity.
Expand Down
2 changes: 1 addition & 1 deletion v22.1/common-issues-to-monitor.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ If workload concurrency exceeds CPU resources, you will observe:

#### LSM health

Issues at the [storage layer](architecture/storage-layer.html), including a misshapen LSM and high [read amplification](architecture/storage-layer.html#read-amplification), can be observed when compaction falls behind due to insufficient CPU.
Issues at the storage layer, including an [inverted LSM](architecture/storage-layer.html#inverted-lsms) and high [read amplification](architecture/storage-layer.html#read-amplification), can be observed when compaction falls behind due to insufficient CPU or excessively high [recovery and rebalance rates](cluster-setup-troubleshooting.html#excessive-snapshot-rebalance-and-recovery-rates).

- The [**LSM L0 Health**](ui-overload-dashboard.html#lsm-l0-health) graph on the Overload dashboard shows the health of the [persistent stores](architecture/storage-layer.html), which are implemented as log-structured merge (LSM) trees. Level 0 is the highest level of the LSM tree and consists of files containing the latest data written to the [Pebble storage engine](cockroach-start.html#storage-engine). For more information about LSM levels and how LSMs work, see [Log-structured Merge-trees](architecture/storage-layer.html#log-structured-merge-trees).

Expand Down
2 changes: 1 addition & 1 deletion v22.1/reset-cluster-setting.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ toc: true
docs_area: reference.sql
---

The `RESET` [statement](sql-statements.html) resets a [cluster setting](set-cluster-setting.html) to its default value for the client session..
The `RESET` [statement](sql-statements.html) resets a [cluster setting](set-cluster-setting.html) to its default value for the client session.


## Required privileges
Expand Down