Expand production recommendations #2850

jseldess · 2018-04-02T21:54:53Z

Update cluster topology to mention locality and cover common
cluster patterns. Addresses part of Basic topology patterns #2411.
Add a security section.
Add locality to manual deployment tutorials.

cockroach-teamcity · 2018-04-02T21:54:57Z

This change is

jseldess · 2018-04-02T21:55:45Z

@bdarnell and @a-robinson, can either or both of you review this and help me get it to a decent state prior to the 2.0 release? Sorry for the short notice.

cockroach-teamcity · 2018-04-02T21:56:25Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/d743989e1051fc9f9d2423029e3250843e146310/

cockroach-teamcity · 2018-04-02T22:02:52Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/706e2771391b3bd3b58297eb0e2705c0e76a95e3/

bdarnell · 2018-04-02T22:07:20Z

Reviewed 2 of 6 files at r1, 3 of 3 files at r2.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending.

v2.0/recommended-production-settings.md, line 34 at r1 (raw file):

- Use at least 3 nodes to ensure that the cluster can tolerate the failure of any one node.

- Although network latency between nodes should be minimal within a single datacenter (~1ms), it's recommended to set the `--locality` flag on each node to have the flexibility to customize [replication zones](configure-replication-zones.html) based on locality as you scale.

I wouldn't recommend setting the --locality tag in a single DC. You can always restart your nodes to add it later when you need it.

v2.0/recommended-production-settings.md, line 40 at r1 (raw file):

When deploying across multiple datacenters in one or more regions:

- Use at least 3 datacenters to ensure that the cluster can tolerate the failure of 1 entire datacenter.

This should probably be bulked up with explanations of how you need three datacenters to be able to survive the failure of one. (a common misconception, as in this forum thread today).

v2.0/recommended-production-settings.md, line 46 at r1 (raw file):

    - To optimize read and write latency, consider using the enterprise [table partitioning](partitioning.html) feature.

- If you expect region-specific concurrency and load characteristics, consider using different numbers and types of nodes per region. For example, in the following scenario, the Central region is closer to the West region than to the East region, which means that write latency would be optimized for workloads in the West and Central regions.

What does this example have to do with "using different numbers and types of nodes per region"? I'm not seeing anything actionable here and would probably just remove this paragraph and diagram.

Comments from Reviewable

cockroach-teamcity · 2018-04-02T22:07:21Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/d1d6a55e562b424e77c01db06cd1d93d3a6ce7fd/

a-robinson · 2018-04-03T14:38:00Z

Reviewed 2 of 6 files at r1, 3 of 3 files at r2.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

v2.0/recommended-production-settings.md, line 24 at r2 (raw file):

- When starting each node, use the [`--locality`](start-a-node.html#locality) flag to describe the node's location, for example, `--locality=region=west,datacenter=us-west-1`. The key-value pairs should be ordered from most to least inclusive, and the keys and order of key-value pairs must be the same on all nodes.
    - When there is high latency between nodes, CockroachDB uses locality to move range leases closer to the current workload, reducing network round trips and improving read performance, also known as ["follow-the-workload"](demo-follow-the-workload.html). Locality is also a prerequisite for using the [table partitioning](partitioning.html) and [**Node Map**](enable-node-map.html) enterprise features.

We probably also want to mention here that CockroachDB spreads the replicas of each piece of data across as diverse a set of localities as possible (unless configured to do otherwise via replication zones)

v2.0/recommended-production-settings.md, line 43 at r2 (raw file):

- The round-trip latency between datacenters will have a direct effect on your cluster's performance, with cross-continent clusters performing noticeably worse than clusters in which all nodes are geographically close together.
    - To optimize read latency for the location from which most of the workload is originating, also known as ["follow-the-workload"](demo-follow-the-workload.html), set the `--locality` flag when starting each node. When deploying across more than 3 datacenters, to ensure that all data benefits from "follow-the-workload", you must also increase your replication factor to match the total number of datacenters.

Follow the workload is great and all, but the biggest reason to use localities when you have multiple datacenters is to ensure each piece of data ends up with one copy in each datacenter. If you don't set localities, we don't differentiate between the nodes and will almost certainly end up with 2 out of 3 copies for some important ranges in just one datacenter, meaning that losing the datacenter would cause an outage.

v2.0/recommended-production-settings.md, line 44 at r2 (raw file):

- The round-trip latency between datacenters will have a direct effect on your cluster's performance, with cross-continent clusters performing noticeably worse than clusters in which all nodes are geographically close together.
    - To optimize read latency for the location from which most of the workload is originating, also known as ["follow-the-workload"](demo-follow-the-workload.html), set the `--locality` flag when starting each node. When deploying across more than 3 datacenters, to ensure that all data benefits from "follow-the-workload", you must also increase your replication factor to match the total number of datacenters.
    - To optimize read and write latency, consider using the enterprise [table partitioning](partitioning.html) feature.

Minor suggestion, feel free to ignore: I'd reword this to "To optimize read and write latency for specific subsets of rows within a table, ..."

Comments from Reviewable

- Update cluster topology to mention locality and cover common cluster patterns. Addresses part of #2411. - Add a security section. - Add locality to manual deployment tutorials.

jseldess · 2018-04-03T17:29:57Z

I simplified the cluster topology changes, trying to take into account the feedback provided. PTAL, Ben and Alex.

Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.

v2.0/recommended-production-settings.md, line 40 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This should probably be bulked up with explanations of how you need three datacenters to be able to survive the failure of one. (a common misconception, as in this forum thread today).

Tried to address this. PTAL.

v2.0/recommended-production-settings.md, line 46 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What does this example have to do with "using different numbers and types of nodes per region"? I'm not seeing anything actionable here and would probably just remove this paragraph and diagram.

OK, removed. I was trying to follow @rober-s-lee's guidance in #2411, but it seems best to review that with you or another engineer before attempting to document it.

v2.0/recommended-production-settings.md, line 24 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

We probably also want to mention here that CockroachDB spreads the replicas of each piece of data across as diverse a set of localities as possible (unless configured to do otherwise via replication zones)

Done. PTAL.

v2.0/recommended-production-settings.md, line 43 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Follow the workload is great and all, but the biggest reason to use localities when you have multiple datacenters is to ensure each piece of data ends up with one copy in each datacenter. If you don't set localities, we don't differentiate between the nodes and will almost certainly end up with 2 out of 3 copies for some important ranges in just one datacenter, meaning that losing the datacenter would cause an outage.

Tried to address this. PTAL.

v2.0/recommended-production-settings.md, line 44 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Minor suggestion, feel free to ignore: I'd reword this to "To optimize read and write latency for specific subsets of rows within a table, ..."

Done.

Comments from Reviewable

cockroach-teamcity · 2018-04-03T18:01:50Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/35a75863c742b8f23531ed788b6577e90f13e48a/

bdarnell · 2018-04-03T18:16:58Z

Review status: 3 of 4 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.

Comments from Reviewable

a-robinson · 2018-04-03T19:00:28Z

Review status: 3 of 4 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.

Comments from Reviewable

jseldess · 2018-04-03T19:44:15Z

TFTRs, @bdarnell and @a-robinson. Did some final tweaking, but I don't think it's necessary to review again.

cockroach-teamcity · 2018-04-03T19:45:10Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/8d35ba75f592e958f000936f4be64c79e6593dd4/

jseldess added the in progress label Apr 2, 2018

jseldess requested review from bdarnell and a-robinson April 2, 2018 21:55

jseldess force-pushed the prod-improvements branch from d743989 to 706e277 Compare April 2, 2018 22:00

jseldess force-pushed the prod-improvements branch from 706e277 to d1d6a55 Compare April 2, 2018 22:03

Jesse Seldess added 2 commits April 3, 2018 11:12

Expand production recommendations

8de7a11

- Update cluster topology to mention locality and cover common cluster patterns. Addresses part of #2411. - Add a security section. - Add locality to manual deployment tutorials.

Revisions

35a7586

jseldess force-pushed the prod-improvements branch from d1d6a55 to 35a7586 Compare April 3, 2018 17:59

More revisions

8d35ba7

jseldess merged commit 36f891e into master Apr 3, 2018

jseldess deleted the prod-improvements branch April 3, 2018 19:46

jseldess removed the in progress label Apr 3, 2018

This was referenced Apr 3, 2018

More robust production deployment docs #2120

Closed

Basic topology patterns #2411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expand production recommendations #2850

Expand production recommendations #2850

Uh oh!

jseldess commented Apr 2, 2018

Uh oh!

cockroach-teamcity commented Apr 2, 2018

Uh oh!

jseldess commented Apr 2, 2018

Uh oh!

cockroach-teamcity commented Apr 2, 2018

Uh oh!

cockroach-teamcity commented Apr 2, 2018

Uh oh!

bdarnell commented Apr 2, 2018

Uh oh!

cockroach-teamcity commented Apr 2, 2018

Uh oh!

a-robinson commented Apr 3, 2018

Uh oh!

jseldess commented Apr 3, 2018

Uh oh!

cockroach-teamcity commented Apr 3, 2018

Uh oh!

bdarnell commented Apr 3, 2018

Uh oh!

a-robinson commented Apr 3, 2018

Uh oh!

jseldess commented Apr 3, 2018

Uh oh!

cockroach-teamcity commented Apr 3, 2018

Uh oh!

Uh oh!

Expand production recommendations #2850

Expand production recommendations #2850

Uh oh!

Conversation

jseldess commented Apr 2, 2018

Uh oh!

cockroach-teamcity commented Apr 2, 2018

Uh oh!

jseldess commented Apr 2, 2018

Uh oh!

cockroach-teamcity commented Apr 2, 2018

Uh oh!

cockroach-teamcity commented Apr 2, 2018

Uh oh!

bdarnell commented Apr 2, 2018

Uh oh!

cockroach-teamcity commented Apr 2, 2018

Uh oh!

a-robinson commented Apr 3, 2018

Uh oh!

jseldess commented Apr 3, 2018

Uh oh!

cockroach-teamcity commented Apr 3, 2018

Uh oh!

bdarnell commented Apr 3, 2018

Uh oh!

a-robinson commented Apr 3, 2018

Uh oh!

jseldess commented Apr 3, 2018

Uh oh!

cockroach-teamcity commented Apr 3, 2018

Uh oh!

Uh oh!