Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand production recommendations #2850

Merged
merged 3 commits into from
Apr 3, 2018
Merged

Expand production recommendations #2850

merged 3 commits into from
Apr 3, 2018

Conversation

jseldess
Copy link
Contributor

@jseldess jseldess commented Apr 2, 2018

  • Update cluster topology to mention locality and cover common
    cluster patterns. Addresses part of Basic topology patterns #2411.
  • Add a security section.
  • Add locality to manual deployment tutorials.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jseldess
Copy link
Contributor Author

jseldess commented Apr 2, 2018

@bdarnell and @a-robinson, can either or both of you review this and help me get it to a decent state prior to the 2.0 release? Sorry for the short notice.

@jseldess jseldess requested review from bdarnell and a-robinson April 2, 2018 21:55
@jseldess jseldess force-pushed the prod-improvements branch from d743989 to 706e277 Compare April 2, 2018 22:00
@jseldess jseldess force-pushed the prod-improvements branch from 706e277 to d1d6a55 Compare April 2, 2018 22:03
@bdarnell
Copy link
Contributor

bdarnell commented Apr 2, 2018

Reviewed 2 of 6 files at r1, 3 of 3 files at r2.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending.


v2.0/recommended-production-settings.md, line 34 at r1 (raw file):

- Use at least 3 nodes to ensure that the cluster can tolerate the failure of any one node.

- Although network latency between nodes should be minimal within a single datacenter (~1ms), it's recommended to set the `--locality` flag on each node to have the flexibility to customize [replication zones](configure-replication-zones.html) based on locality as you scale.

I wouldn't recommend setting the --locality tag in a single DC. You can always restart your nodes to add it later when you need it.


v2.0/recommended-production-settings.md, line 40 at r1 (raw file):

When deploying across multiple datacenters in one or more regions:

- Use at least 3 datacenters to ensure that the cluster can tolerate the failure of 1 entire datacenter.

This should probably be bulked up with explanations of how you need three datacenters to be able to survive the failure of one. (a common misconception, as in this forum thread today).


v2.0/recommended-production-settings.md, line 46 at r1 (raw file):

    - To optimize read and write latency, consider using the enterprise [table partitioning](partitioning.html) feature.

- If you expect region-specific concurrency and load characteristics, consider using different numbers and types of nodes per region. For example, in the following scenario, the Central region is closer to the West region than to the East region, which means that write latency would be optimized for workloads in the West and Central regions.

What does this example have to do with "using different numbers and types of nodes per region"? I'm not seeing anything actionable here and would probably just remove this paragraph and diagram.


Comments from Reviewable

@a-robinson
Copy link
Contributor

Reviewed 2 of 6 files at r1, 3 of 3 files at r2.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


v2.0/recommended-production-settings.md, line 24 at r2 (raw file):

- When starting each node, use the [`--locality`](start-a-node.html#locality) flag to describe the node's location, for example, `--locality=region=west,datacenter=us-west-1`. The key-value pairs should be ordered from most to least inclusive, and the keys and order of key-value pairs must be the same on all nodes.
    - When there is high latency between nodes, CockroachDB uses locality to move range leases closer to the current workload, reducing network round trips and improving read performance, also known as ["follow-the-workload"](demo-follow-the-workload.html). Locality is also a prerequisite for using the [table partitioning](partitioning.html) and [**Node Map**](enable-node-map.html) enterprise features.

We probably also want to mention here that CockroachDB spreads the replicas of each piece of data across as diverse a set of localities as possible (unless configured to do otherwise via replication zones)


v2.0/recommended-production-settings.md, line 43 at r2 (raw file):

- The round-trip latency between datacenters will have a direct effect on your cluster's performance, with cross-continent clusters performing noticeably worse than clusters in which all nodes are geographically close together.
    - To optimize read latency for the location from which most of the workload is originating, also known as ["follow-the-workload"](demo-follow-the-workload.html), set the `--locality` flag when starting each node. When deploying across more than 3 datacenters, to ensure that all data benefits from "follow-the-workload", you must also increase your replication factor to match the total number of datacenters.

Follow the workload is great and all, but the biggest reason to use localities when you have multiple datacenters is to ensure each piece of data ends up with one copy in each datacenter. If you don't set localities, we don't differentiate between the nodes and will almost certainly end up with 2 out of 3 copies for some important ranges in just one datacenter, meaning that losing the datacenter would cause an outage.


v2.0/recommended-production-settings.md, line 44 at r2 (raw file):

- The round-trip latency between datacenters will have a direct effect on your cluster's performance, with cross-continent clusters performing noticeably worse than clusters in which all nodes are geographically close together.
    - To optimize read latency for the location from which most of the workload is originating, also known as ["follow-the-workload"](demo-follow-the-workload.html), set the `--locality` flag when starting each node. When deploying across more than 3 datacenters, to ensure that all data benefits from "follow-the-workload", you must also increase your replication factor to match the total number of datacenters.
    - To optimize read and write latency, consider using the enterprise [table partitioning](partitioning.html) feature.

Minor suggestion, feel free to ignore: I'd reword this to "To optimize read and write latency for specific subsets of rows within a table, ..."


Comments from Reviewable

Jesse Seldess added 2 commits April 3, 2018 11:12
- Update cluster topology to mention locality and cover common
  cluster patterns. Addresses part of #2411.
- Add a security section.
- Add locality to manual deployment tutorials.
@jseldess
Copy link
Contributor Author

jseldess commented Apr 3, 2018

I simplified the cluster topology changes, trying to take into account the feedback provided. PTAL, Ben and Alex.


Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


v2.0/recommended-production-settings.md, line 40 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This should probably be bulked up with explanations of how you need three datacenters to be able to survive the failure of one. (a common misconception, as in this forum thread today).

Tried to address this. PTAL.


v2.0/recommended-production-settings.md, line 46 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What does this example have to do with "using different numbers and types of nodes per region"? I'm not seeing anything actionable here and would probably just remove this paragraph and diagram.

OK, removed. I was trying to follow @rober-s-lee's guidance in #2411, but it seems best to review that with you or another engineer before attempting to document it.


v2.0/recommended-production-settings.md, line 24 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

We probably also want to mention here that CockroachDB spreads the replicas of each piece of data across as diverse a set of localities as possible (unless configured to do otherwise via replication zones)

Done. PTAL.


v2.0/recommended-production-settings.md, line 43 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Follow the workload is great and all, but the biggest reason to use localities when you have multiple datacenters is to ensure each piece of data ends up with one copy in each datacenter. If you don't set localities, we don't differentiate between the nodes and will almost certainly end up with 2 out of 3 copies for some important ranges in just one datacenter, meaning that losing the datacenter would cause an outage.

Tried to address this. PTAL.


v2.0/recommended-production-settings.md, line 44 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Minor suggestion, feel free to ignore: I'd reword this to "To optimize read and write latency for specific subsets of rows within a table, ..."

Done.


Comments from Reviewable

@jseldess jseldess force-pushed the prod-improvements branch from d1d6a55 to 35a7586 Compare April 3, 2018 17:59
@bdarnell
Copy link
Contributor

bdarnell commented Apr 3, 2018

:lgtm:


Review status: 3 of 4 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


Comments from Reviewable

@a-robinson
Copy link
Contributor

:lgtm:


Review status: 3 of 4 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


Comments from Reviewable

@jseldess
Copy link
Contributor Author

jseldess commented Apr 3, 2018

TFTRs, @bdarnell and @a-robinson. Did some final tweaking, but I don't think it's necessary to review again.

@jseldess jseldess merged commit 36f891e into master Apr 3, 2018
@jseldess jseldess deleted the prod-improvements branch April 3, 2018 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants