From 2a8d00d5beefac5cb9f0051cc2cb2e4e9b733691 Mon Sep 17 00:00:00 2001 From: taroface Date: Tue, 7 Jun 2022 18:56:49 -0400 Subject: [PATCH 1/2] troubleshoot excessive snapshot rebalance/recovery rates --- .../resolution-inverted-lsm.md | 2 +- v22.1/architecture/replication-layer.md | 2 +- v22.1/cluster-setup-troubleshooting.md | 46 +++++++++++++++---- v22.1/common-issues-to-monitor.md | 2 +- v22.1/reset-cluster-setting.md | 2 +- 5 files changed, 41 insertions(+), 13 deletions(-) diff --git a/_includes/v22.1/prod-deployment/resolution-inverted-lsm.md b/_includes/v22.1/prod-deployment/resolution-inverted-lsm.md index 3ae9fb03626..ac505cc6b68 100644 --- a/_includes/v22.1/prod-deployment/resolution-inverted-lsm.md +++ b/_includes/v22.1/prod-deployment/resolution-inverted-lsm.md @@ -1 +1 @@ -If LSM compaction falls behind, throttle your workload concurrency to allow compaction to catch up and restore a healthy LSM shape. {% include {{ page.version.version }}/prod-deployment/prod-guidance-connection-pooling.md %} If a node is severely impacted, you can [start a new node](cockroach-start.html) and then [decommission the problematic node](node-shutdown.html?filters=decommission#remove-nodes). \ No newline at end of file +If compaction has fallen behind and caused an [inverted LSM](architecture/storage-layer.html#inverted-lsms), throttle your workload concurrency to allow compaction to catch up and restore a healthy LSM shape. {% include {{ page.version.version }}/prod-deployment/prod-guidance-connection-pooling.md %} If a node is severely impacted, you can [start a new node](cockroach-start.html) and then [decommission the problematic node](node-shutdown.html?filters=decommission#remove-nodes). \ No newline at end of file diff --git a/v22.1/architecture/replication-layer.md b/v22.1/architecture/replication-layer.md index b55655b2bb0..dc68b6a0db1 100644 --- a/v22.1/architecture/replication-layer.md +++ b/v22.1/architecture/replication-layer.md @@ -52,7 +52,7 @@ Because this log is treated as serializable, it can be replayed to bring a node In versions prior to v21.1, CockroachDB only supported _voting_ replicas: that is, [replicas](overview.html#architecture-replica) that participate as voters in the [Raft consensus protocol](#raft). However, the need for all replicas to participate in the consensus algorithm meant that increasing the [replication factor](../configure-replication-zones.html#num_replicas) came at a cost of increased write latency, since the additional replicas needed to participate in Raft [quorum](overview.html#architecture-overview-consensus). - In order to provide [better support for multi-region clusters](../multiregion-overview.html), (including the features that make [fast multi-region reads](../multiregion-overview.html#global-tables) and [surviving region failures](../multiregion-overview.html#surviving-region-failures) possible), a new type of replica is introduced: the _non-voting_ replica. + In order to provide [better support for multi-region clusters](../multiregion-overview.html) (including the features that make [fast multi-region reads](../multiregion-overview.html#global-tables) and [surviving region failures](../multiregion-overview.html#surviving-region-failures) possible), a new type of replica is introduced: the _non-voting_ replica. Non-voting replicas follow the [Raft log](#raft-logs) (and are thus able to serve [follower reads](../follower-reads.html)), but do not participate in quorum. They have almost no impact on write latencies. diff --git a/v22.1/cluster-setup-troubleshooting.md b/v22.1/cluster-setup-troubleshooting.md index 8f30aa0e654..80c76844257 100644 --- a/v22.1/cluster-setup-troubleshooting.md +++ b/v22.1/cluster-setup-troubleshooting.md @@ -9,7 +9,7 @@ If you're having trouble starting or scaling your cluster, this page will help y To use this guide, it's important to understand some of CockroachDB's terminology: - - A **Cluster** acts as a single logical database, but is actually made up of many cooperating nodes. + - A **cluster** acts as a single logical database, but is actually made up of many cooperating nodes. - **Nodes** are single instances of the `cockroach` binary running on a machine. It's possible (though atypical) to have multiple nodes running on a single machine. ## Cannot run a single-node CockroachDB cluster @@ -18,7 +18,7 @@ Try running: {% include copy-clipboard.html %} ~~~ shell -$ cockroach start-single-node --insecure --logtostderr +$ cockroach start-single-node --insecure ~~~ If the process exits prematurely, check for the following: @@ -41,7 +41,7 @@ When starting a node, the directory you choose to store the data in also contain ~~~ {% include copy-clipboard.html %} ~~~ shell - $ cockroach start-single-node --insecure --logtostderr + $ cockroach start-single-node --insecure ~~~ ### Toolchain incompatibility @@ -90,13 +90,13 @@ You should see a list of the built-in databases: If you’re not seeing the output above, check for the following: -- `connection refused` error, which indicates you have not included some flag that you used to start the node. We have additional troubleshooting steps for this error [here](common-errors.html#connection-refused). -- The node crashed. To ascertain if the node crashed, run `ps | grep cockroach` to look for the `cockroach` process. If you cannot locate the `cockroach` process (i.e., it crashed), [file an issue](file-an-issue.html), including the logs from your node and any errors you received. +- `connection refused` error, which indicates you have not included some flag that you used to start the node. We have additional troubleshooting steps for this error [here](common-errors.html#connection-refused). +- The node crashed. To ascertain if the node crashed, run `ps | grep cockroach` to look for the `cockroach` process. If you cannot locate the `cockroach` process (i.e., it crashed), [file an issue](file-an-issue.html), including the [logs from your node](configure-logs.html#logging-directory) and any errors you received. ## Cannot run a multi-node CockroachDB cluster on the same machine {{site.data.alerts.callout_info}} -Running multiple nodes on a single host is useful for testing out CockroachDB, but it's not recommended for production deployments. To run a physically distributed cluster in production, see [Manual Deployment](manual-deployment.html) or [Orchestrated Deployment](orchestration.html). Also be sure to review the [Production Checklist](recommended-production-settings.html). +Running multiple nodes on a single host is useful for testing CockroachDB, but it's not recommended for production deployments. To run a physically distributed cluster in production, see [Manual Deployment](manual-deployment.html) or [Kubernetes Overview](kubernetes-overview.html). Also be sure to review the [Production Checklist](recommended-production-settings.html). {{site.data.alerts.end}} If you are trying to run all nodes on the same machine, you might get the following errors: @@ -119,9 +119,11 @@ ERROR: cockroach server exited with error: consider changing the port via --list **Solution:** Change the `--port`, `--http-port` flags for each new node that you want to run on the same machine. -## Cannot join a node to an existing CockroachDB cluster +## Scaling issues -### Store directory already exists +### Cannot join a node to an existing CockroachDB cluster + +#### Store directory already exists When joining a node to a cluster, you might receive one of the following errors: @@ -151,7 +153,7 @@ node belongs to cluster {"cluster hash"} but is attempting to connect to a gossi $ cockroach start --join=:26257 ~~~ -### Incorrect `--join` address +#### Incorrect `--join` address If you try to add another node to the cluster, but the `--join` address is not pointing at any of the existing nodes, then the process will never complete, and you'll see a continuous stream of warnings like this: @@ -164,6 +166,32 @@ W180817 17:01:56.510430 914 vendor/google.golang.org/grpc/clientconn.go:1293 grp **Solution:** To successfully join the node to the cluster, start the node again, but this time include a correct `--join` address. +### Performance is degraded when adding nodes + +#### Excessive snapshot rebalance and recovery rates + +The `kv.snapshot_rebalance.max_rate` and `kv.snapshot_recovery.max_rate` [cluster settings](cluster-settings.html) set the rate limits at which [snapshots](architecture/replication-layer.html#snapshots) are sent to nodes. These settings can be temporarily increased to expedite replication during an outage or when scaling a cluster up or down. + +However, if the settings are too high when nodes are added to the cluster, this can cause degraded performance and node crashes. + +**Explanation:** If `kv.snapshot_rebalance.max_rate` and `kv.snapshot_recovery.max_rate` are set too high for the cluster during scaling, this can cause nodes to experience ingestions faster than compactions can keep up, and result in an [inverted LSM](architecture/storage-layer.html#inverted-lsms). + +**Solution:** [Check LSM health](common-issues-to-monitor.html#lsm-health). {% include {{ page.version.version }}/prod-deployment/resolution-inverted-lsm.md %} + +After compaction has completed, lower `kv.snapshot_rebalance.max_rate` and `kv.snapshot_recovery.max_rate` to their [default values](cluster-settings.html). As you add nodes to the cluster, slowly increase both cluster settings, if desired. This will control the rate of new ingestions for newly added nodes. Meanwhile, monitor the cluster for unhealthy increases in [IOPS](common-issues-to-monitor.html#disk-iops) and [CPU](common-issues-to-monitor.html#cpu). + +Outside of performing cluster maintenance, return `kv.snapshot_rebalance.max_rate` and `kv.snapshot_recovery.max_rate` to their [default values](cluster-settings.html). + +{% include_cached copy-clipboard.html %} +~~~ sql +RESET CLUSTER SETTING kv.snapshot_rebalance.max_rate; +~~~ + +{% include_cached copy-clipboard.html %} +~~~ sql +RESET CLUSTER SETTING kv.snapshot_recovery.max_rate; +~~~ + ## Client connection issues If a client cannot connect to the cluster, check basic network connectivity (`ping`), port connectivity (`telnet`), and certificate validity. diff --git a/v22.1/common-issues-to-monitor.md b/v22.1/common-issues-to-monitor.md index 98dc72026f4..9655024216a 100644 --- a/v22.1/common-issues-to-monitor.md +++ b/v22.1/common-issues-to-monitor.md @@ -78,7 +78,7 @@ If workload concurrency exceeds CPU resources, you will observe: #### LSM health -Issues at the [storage layer](architecture/storage-layer.html), including a misshapen LSM and high [read amplification](architecture/storage-layer.html#read-amplification), can be observed when compaction falls behind due to insufficient CPU. +Issues at the storage layer, including an [inverted LSM](architecture/storage-layer.html#inverted-lsms) and high [read amplification](architecture/storage-layer.html#read-amplification), can be observed when compaction falls behind due to insufficient CPU or excessively high [recovery and rebalance rates](cluster-setup-troubleshooting.html#excessive-snapshot-rebalance-and-recovery-rates). - The [**LSM L0 Health**](ui-overload-dashboard.html#lsm-l0-health) graph on the Overload dashboard shows the health of the [persistent stores](architecture/storage-layer.html), which are implemented as log-structured merge (LSM) trees. Level 0 is the highest level of the LSM tree and consists of files containing the latest data written to the [Pebble storage engine](cockroach-start.html#storage-engine). For more information about LSM levels and how LSMs work, see [Log-structured Merge-trees](architecture/storage-layer.html#log-structured-merge-trees). diff --git a/v22.1/reset-cluster-setting.md b/v22.1/reset-cluster-setting.md index b62c0b99f26..a25ff162781 100644 --- a/v22.1/reset-cluster-setting.md +++ b/v22.1/reset-cluster-setting.md @@ -5,7 +5,7 @@ toc: true docs_area: reference.sql --- -The `RESET` [statement](sql-statements.html) resets a [cluster setting](set-cluster-setting.html) to its default value for the client session.. +The `RESET` [statement](sql-statements.html) resets a [cluster setting](set-cluster-setting.html) to its default value for the client session. ## Required privileges From 755094454d5b717eba048f19b0f9928554dd7ffb Mon Sep 17 00:00:00 2001 From: taroface Date: Wed, 15 Jun 2022 17:47:28 -0400 Subject: [PATCH 2/2] add suggested upper limit --- v22.1/cluster-setup-troubleshooting.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/v22.1/cluster-setup-troubleshooting.md b/v22.1/cluster-setup-troubleshooting.md index 80c76844257..b4f1b0c4164 100644 --- a/v22.1/cluster-setup-troubleshooting.md +++ b/v22.1/cluster-setup-troubleshooting.md @@ -172,7 +172,7 @@ W180817 17:01:56.510430 914 vendor/google.golang.org/grpc/clientconn.go:1293 grp The `kv.snapshot_rebalance.max_rate` and `kv.snapshot_recovery.max_rate` [cluster settings](cluster-settings.html) set the rate limits at which [snapshots](architecture/replication-layer.html#snapshots) are sent to nodes. These settings can be temporarily increased to expedite replication during an outage or when scaling a cluster up or down. -However, if the settings are too high when nodes are added to the cluster, this can cause degraded performance and node crashes. +However, if the settings are too high when nodes are added to the cluster, this can cause degraded performance and node crashes. We recommend **not** increasing these values by more than 2 times their [default values](cluster-settings.html) without explicit approval from Cockroach Labs. **Explanation:** If `kv.snapshot_rebalance.max_rate` and `kv.snapshot_recovery.max_rate` are set too high for the cluster during scaling, this can cause nodes to experience ingestions faster than compactions can keep up, and result in an [inverted LSM](architecture/storage-layer.html#inverted-lsms).