-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document how to check constraint conformance etc. #5679
Conversation
<span class="version-tag">New in v19.2:</span> Several new and updated tables (listed below) are available to help you verify your cluster's constraint conformance. For example, you can check the following types of constraints: | ||
|
||
- See what data is under-replicated or unavailable | ||
- Show localities that are at risk in the event of node failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not about "node failure" (in fact, unfortunately, data that would become unavailable if a single node fails is not currently being reported at all).
I like the term "critical localities". The report tells you which localities you cannot afford to lose. It's important to note in this doc somewhere that having critical localities is generally not a bad thing. In particular, when you partition you data, you get them by definition. This table is supposed to be used to see if your expectations about what's critical match the reality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by updating to use your explanation of critical localities (or, I hope - PTAL).
(I have another question about this critical localities stuff below.)
| zone_id | [`INT8`](int.html) | The ID of the [replication zone](configure-zone.html). | | ||
| subzone_id | [`INT8`](int.html) | The ID of the subzone (i.e., [partition](partition-by.html)). | | ||
| target | [`STRING`](string.html) | The "object" that the constraint is being applied to, e.g., `PARTITION us_west OF INDEX movr.public.users@primary`. | | ||
| range_name | [`STRING`](string.html) | The human-readable printed representation of the range. Unimplemented. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unimplemented? I think this is filled in for zones like the default one that don't refer to a database/table/...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that was my guess based on the fact that I was only seeing NULL
for that column. I've removed the word "unimplemented" but did not try to add any more explanation. Do you think anything else is needed beyond calling it "the human-readable printed representation"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or just "printed representation"? since I removed all uses of "human-readable")
|
||
### Find out which databases and tables have at-risk ranges | ||
|
||
By default, this geo-distributed demo cluster will not have any at-risk ranges. A range is considered "at-risk" if its data could be lost in case of a node failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/node failure/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just "if its data could be lost"? Or should it instead be "A range is considered at-risk if its data could be lost in the event of its entire locality failing", or something like that?
ALTER TABLE rides CONFIGURE ZONE USING num_replicas=9; | ||
~~~ | ||
|
||
Once the statement above runs, the cluster will rebalance so that it's storing 9 replicas of each range underlying the `rides` table. This puts that data at risk since you are storing 9 copies of the table's data on only 9 nodes. If a node fails, you will be left with 8 replicas, and there will be no additional nodes to upreplicate the range to to fulfill the `num_replicas=9` requirement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a node fails, you will be left with 8 replicas, and there will be no additional nodes to upreplicate the range to to fulfill the
num_replicas=9
requirement.
I don't think this is right. There might be a misunderstanding about what makes a locality be critical. It's not about whether the replication factor could still be met. In a 3 localities setup, 3 nodes per locality, and everything replicated 9x, no locality is critical. You can lose any one locality without losing quorum for any ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a misunderstanding :-)
I don't think I understand yet what makes a locality be critical beyond what you said earlier re: "you can't afford to lose it" (?)
To use a concrete example:
Given the 9-node cluster you get from cockroach demo movr --geo-partitioned-replicas
, running
SELECT * FROM system.replication_critical_localities ORDER BY locality, at_risk_ranges DESC;
returns
zone_id | subzone_id | locality | report_id | at_risk_ranges
+---------+------------+---------------------+-----------+----------------+
55 | 9 | region=europe-west1 | 2 | 9
54 | 6 | region=europe-west1 | 2 | 6
56 | 3 | region=europe-west1 | 2 | 3
53 | 3 | region=europe-west1 | 2 | 3
58 | 3 | region=europe-west1 | 2 | 3
55 | 6 | region=us-east1 | 2 | 9
54 | 4 | region=us-east1 | 2 | 6
56 | 2 | region=us-east1 | 2 | 3
53 | 2 | region=us-east1 | 2 | 3
58 | 2 | region=us-east1 | 2 | 3
55 | 3 | region=us-west1 | 2 | 9
54 | 2 | region=us-west1 | 2 | 6
56 | 1 | region=us-west1 | 2 | 3
58 | 1 | region=us-west1 | 2 | 3
53 | 1 | region=us-west1 | 2 | 3
(15 rows)
Does this mean that, for this cluster, we cannot afford to lose any one of these 3 localities without losing the data on the ranges in that locality? Hence each of the 3 localities in this demo cluster is "critical" by our definition?
Is there an example topology/configuration that would result in only a subset of a cluster's localities being critical?
Maybe a better way to ask the above is: is there an example of me running this report where I discover a "surprising" output, where localities are critical that I did not expect to be?
(sorry if dumb questions, but I am not that familiar with this area)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rmloveland)
v19.2/verify-constraint-conformance.md, line 7 at r1 (raw file):
toc: true ---
This is an experimental feature and should be marked as such.
Additionally, the user should be made aware that direct access to system
tables is not a supported way to inspect CockroachDB.
We should also inform them that we're committed to add stable ways to inspect these reports in the future, notably via SHOW statements and/or views and built-in functions in the crdb_internal
schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
v19.2/verify-constraint-conformance.md, line 7 at r1 (raw file):
Previously, knz (kena) wrote…
This is an experimental feature and should be marked as such.
Additionally, the user should be made aware that direct access to
system
tables is not a supported way to inspect CockroachDB.
We should also inform them that we're committed to add stable ways to inspect these reports in the future, notably via SHOW statements and/or views and built-in functions in thecrdb_internal
schema.
Thanks Raphael - I have added this information to the page.
fe6b138
to
cc8df51
Compare
cc8df51
to
58f0c71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks no further comment from me
Reviewed 3 of 3 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rmloveland)
v19.2/run-replication-reports.md, line 21 at r2 (raw file):
**This is an experimental feature.** The interface and output is subject to change. In particular, the direct access to `system` tables shown here will not be a supported way to inspect CockroachDB in future versions. We're committed to add stable ways to inspect these replication reports in the future, likely via `SHOW` statements and/or [views](views.html) and [built-in functions](functions-and-operators.html) in the `crdb_internal` schema.
The thing is I'm not sure who the "we" in the "we're committed" sentence is :p
@knz you've opposed having views in the system
database, but can you remind me why? It seems to me that we could transparently migrate these tables to views if we need to change their schema.
The thing is I'm not entirely sure about the way forward here, since I don't think that crdb_internal
is a very good namespace for these tables.
v19.2/run-replication-reports.md, line 38 at r2 (raw file):
{% include copy-clipboard.html %} ~~~ sql SET CLUSTER setting kv.replication_reports.INTERVAL = '00:05:00';
why capitalize INTERVAL?
v19.2/run-replication-reports.md, line 48 at r2 (raw file):
The replication reports are only generated for zones that meet the following criteria: - They override some replication attributes compared to their parent zone
I'd try to explain somehow that ranges falling into zones that don't get their own report entries are count towards the first ancestor that does
v19.2/run-replication-reports.md, line 49 at r2 (raw file):
- They override some replication attributes compared to their parent zone - They don't have a parent zone
well, everybody has a parent zone except for the "default" one, so I'd just refer to that one in particular. I should have phrased my original comment like that too.
v19.2/run-replication-reports.md, line 76 at r2 (raw file):
| subzone_id | [`INT8`](int.html) | The ID of the subzone (i.e., [partition](partition-by.html)). | | report_id | [`INT8`](int.html) | The ID of the [report](#system-reports_meta) that generated this data. | | total_ranges | [`INT8`](int.html) | Total [ranges](architecture/overview.html#architecture-range) in this zone and any of its children that have not overwritten the [necessary attributes](#necessary-attributes). |
language seems to be inconsistent here - you describe this field one way and the following two in a different way, although they're analogous. I prefer the latter, FWIW.
v19.2/run-replication-reports.md, line 143 at r2 (raw file):
### crdb_internal.zones The `crdb_internal.zones` table is useful for:
perhaps for later, but I think each table should have its own page (or one huge page for all tables). In any case, this is fairly random place to document the zones table.
v19.2/verify-constraint-conformance.md, line 8 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Fixed by renaming the page as "Run replication reports", and changing everything else to use that name.
I don't know about the "run" part. Depending on what you mean by "run", it's not the user that's running these reports. The cluster is "running" them automatically. If anything, maybe "inspect" or "query".
v19.2/verify-constraint-conformance.md, line 11 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Fixed by updating to use your explanation of critical localities (or, I hope - PTAL).
(I have another question about this critical localities stuff below.)
s/data loss/data unavailability
v19.2/verify-constraint-conformance.md, line 26 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Hm, I don't understand. From my user POV, I don't really know what "zone IDs" mean unless I can get at the table names etc. where the data "is located" (from my POV) that I can access from SQL. I only introduced the use of
crdb_internal.zones
because it allowed me to map from zone IDs to databases and table names.Are you saying any or all of:
- the support for this mapping is not correct?
- we should not try to do the mapping in this doc? (i.e., remove use of
crdb_internal.zones
?)- something completely else (?)
nvm, I think the text is fine
v19.2/verify-constraint-conformance.md, line 28 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Fixed, thanks. Added the example of tweaking the cluster setting as well.
thanks for adding the example. I still have a nit - I think that syntax is not the most common. Just use '5m'
.
v19.2/verify-constraint-conformance.md, line 59 at r1 (raw file):
Where can I learn more about what is a "critical locality"? (other than your previous comment)
Nowhere. It's a new concept introduced with the occasion of this table.
(I did find lots of "locality tier" comments in the code so I'll read them and learn something about that in the meantime.)
Localities are comma-separated strings, like region=eu,dc=paris
. The tiers are the comma-separated parts. The docs for the locality flag should probably use this term I think.
v19.2/verify-constraint-conformance.md, line 130 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
(or just "printed representation"? since I removed all uses of "human-readable")
I dunno, it's a pretty funky field. I'd say "The zone's name".
v19.2/verify-constraint-conformance.md, line 137 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
I'll remove the
raw_*
columns but FWIW they do show up in the default output of
SELECT * FROM crdb_internal.zones;
It returns the columns
- zone_id
- subzone_id
- target
- range_name
- database_name
- table_name
- index_name
- partition_name
- raw_config_yaml
- raw_config_sql
- raw_config_protobuf
- full_config_yaml
- full_config_sql
(This is in version
CockroachDB CCL v19.2.0-rc.1 (x86_64-apple-darwin14, built 2019/10/14 18:28:24, go1.12.5)
)
if they're not hidden, then I think they should be indeed documented.
v19.2/verify-constraint-conformance.md, line 394 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
So just "if its data could be lost"? Or should it instead be "A range is considered at-risk if its data could be lost in the event of its entire locality failing", or something like that?
First, it's not about "data being lost" but about "data becoming unavailable" (i.e. quorum being lost). If the locality comes back, your data is back (it hasn't been "lost"). I'd change this in the sentence below.
More below.
v19.2/verify-constraint-conformance.md, line 403 at r1 (raw file):
A locality is "critical" for a range if all the nodes in that locality becoming unreachable would cause the range to become unavailable. In other words, the locality contains a majority of the range's replicas.
In the example above, yes, we cannot afford to lose any localities while maintaining availability to all the data. But in this example, there's nothing bad about that; that's what you've asked for; you've asked for some data to be stored exclusively in us-east1
so naturally if us-east1
goes away then that data becomes unavailable.
However, if you had configured tiered localities - for example if you had split us-east1
into {region=us-east1,dc=dc1
, region=us-east1,dc=dc2
, region=us-east1,dc=dc3
, then you wouldn't expect and dc to be critical (because the cluster "diversifies" each range as much as possible). So, if you were to see a dc as being critical, you'd be surprised and you'd take some action. For example, perhaps the diversification process is failing because some localities are filled to capacity - there's no disk space to move replicas there.
Is there an example topology/configuration that would result in only a subset of a cluster's localities being critical?
For example, if you pin a table to a locality a
, but everything else is more diversified, then only a
is critical.
I'm happy to explain more offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @rmloveland)
v19.2/run-replication-reports.md, line 21 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
The thing is I'm not sure who the "we" in the "we're committed" sentence is :p
@knz you've opposed having views in thesystem
database, but can you remind me why? It seems to me that we could transparently migrate these tables to views if we need to change their schema.The thing is I'm not entirely sure about the way forward here, since I don't think that
crdb_internal
is a very good namespace for these tables.
-
"we" = CRL
-
the sentence add stable ways [..] likely via
SHOW
statements and/or views and built-in functions remains accurate regardless of what we end up doing. Notice the word "likely" (which means we can do something else entirely still) and "and/or" which means SHOW may be the best way.
That's the followup and it will happen in 20.1 likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rmloveland)
v19.2/run-replication-reports.md, line 21 at r2 (raw file):
Previously, knz (kena) wrote…
"we" = CRL
the sentence add stable ways [..] likely via
SHOW
statements and/or views and built-in functions remains accurate regardless of what we end up doing. Notice the word "likely" (which means we can do something else entirely still) and "and/or" which means SHOW may be the best way.That's the followup and it will happen in 20.1 likely.
I dunno, this warning kinda bothers me. I'd feel better if we agreed on what we want presently. The thing is, if indeed we don't want people using the system.replication_foo
names, then I don't know if we should have pages and pages of documentation about them. People will use it despite this box over here, and in fact we ourselves will tell people to use it because Andy and others were on my case about these tables. We did them in a hurry because "we've promised them", etc. I don't want to say that there's no room for "experimental" features, but those I guess tend to have experimental
in the name.
So I'd be inclined to do something to make you happy sooner rather than later, but the question is what. That's why I was asking tour opinion about future views (or perhaps even future virtual tables) in the system
database. Cause if that's a possibility, then it does give us a migration path and so people can probably just start using these names and we don't have to warn them.
The thing is, the UX design for this feature was simply not done. It's not that it was incorrect or that we were lacking the proper directions etc. I'd argue that it was simply not done, as in, nobody really cared about our API stability guidelines, the experimental-beta release process, etc. There's a process failure right there, and I am hard at work to figure out what to do about this so it does not happen in the future. In the meantime the proper thing to do is to acknowledge that the work was rushed and our current standard way to acknowledge this is to mark the thing as "experimental". But I don't have a quick answer for you as to what's going to happen afterwards i.e. how to make it non-experimental. There's something to be said about publicizing There are multiple things to be said against publicizing So in the meantime, let's keep it at the warning and the disclaimer. |
That's actually a legitimate concern. If PMs weren't so keen on pushing docs on this new feature, so it can be marketed/advertised as a flagship new feature, I'd also recommend not documenting it — because it was rushed and it's not polished enough yet. But I do also acknowledge that there is so much push from sales and support to have something like this feature that it'd be silly to be completely silent. I think @rmloveland did fantastic work here and all the content in these pages will naturally adapt (with a simple substitution) when we have the adequate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @rmloveland did fantastic work here and all the content in these pages will naturally adapt (with a simple substitution) ...
Thanks Raphael, I appreciate that!
Andrei, I've made more updates based on your feedback, PTAL at your convenience.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
v19.2/run-replication-reports.md, line 38 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
why capitalize INTERVAL?
An accidental artifact of my editor's SQL mode. Fixed.
v19.2/run-replication-reports.md, line 48 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I'd try to explain somehow that ranges falling into zones that don't get their own report entries are count towards the first ancestor that does
Added some language to that effect.
v19.2/run-replication-reports.md, line 49 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
well, everybody has a parent zone except for the "default" one, so I'd just refer to that one in particular. I should have phrased my original comment like that too.
Thanks, updated.
v19.2/run-replication-reports.md, line 76 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
language seems to be inconsistent here - you describe this field one way and the following two in a different way, although they're analogous. I prefer the latter, FWIW.
Updated to use matching language, i.e., "Total ranges in the zone this report entry is referring to."
v19.2/run-replication-reports.md, line 143 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
perhaps for later, but I think each table should have its own page (or one huge page for all tables). In any case, this is fairly random place to document the zones table.
Agreed, it's pretty random! We have another docs issue #2957 tracking the work to document crdb_internal
in general.
v19.2/verify-constraint-conformance.md, line 8 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I don't know about the "run" part. Depending on what you mean by "run", it's not the user that's running these reports. The cluster is "running" them automatically. If anything, maybe "inspect" or "query".
Updated to say "query the status of your cluster's data replication ..."
Prompted by your comment I also audited the other uses of "run" in this doc to make sure they only refer to the system running the reports automagically.
v19.2/verify-constraint-conformance.md, line 11 at r1 (raw file):
I have updated this and the other places where we describe critical localities with your very clear description
A locality is "critical" for a range if all of the nodes in that locality becoming unreachable would cause the range to become unavailable. In other words, the locality contains a majority of the range's replicas.
v19.2/verify-constraint-conformance.md, line 26 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nvm, I think the text is fine
Ah ok
v19.2/verify-constraint-conformance.md, line 28 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
thanks for adding the example. I still have a nit - I think that syntax is not the most common. Just use
'5m'
.
Updated.
v19.2/verify-constraint-conformance.md, line 59 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Where can I learn more about what is a "critical locality"? (other than your previous comment)
Nowhere. It's a new concept introduced with the occasion of this table.
(I did find lots of "locality tier" comments in the code so I'll read them and learn something about that in the meantime.)
Localities are comma-separated strings, like
region=eu,dc=paris
. The tiers are the comma-separated parts. The docs for the locality flag should probably use this term I think.
Thanks for the clarifications. I've updated:
-
This section with a discussion of a critical locality, that it's not necessarily bad, and added a pointer to the big list of 'Topology Patterns' with a note to the effect of "make sure your configure a topology that will give you the resiliency you want"
-
The discussions of localities here and on the 'Configure Replication Zones' and 'Start a Node' pages to mention locality tiers and add a general description of them
v19.2/verify-constraint-conformance.md, line 130 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I dunno, it's a pretty funky field. I'd say "The zone's name".
I've updated to say "the zone's name", but yeah that does seem pretty funky. I think we'll see some user confusion (including from this user :-} ) on the mismatch between the column being called range_name
but it meaning "the zone's name".
v19.2/verify-constraint-conformance.md, line 137 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
if they're not hidden, then I think they should be indeed documented.
OK, added them back in.
v19.2/verify-constraint-conformance.md, line 394 at r1 (raw file):
Changed here and elsewhere to use your description
A locality is "critical" for a range if all of the nodes in that locality becoming unreachable would cause the range to become unavailable. In other words, the locality contains a majority of the range's replicas.
v19.2/verify-constraint-conformance.md, line 403 at r1 (raw file):
A locality is "critical" for a range if all the nodes in that locality becoming unreachable would cause the range to become unavailable. In other words, the locality contains a majority of the range's replicas.
Ah OK, that makes much more sense. Thanks for explaining it.
But in this example, there's nothing bad about that; that's what you've asked for; you've asked for some data to be stored exclusively in us-east1 so naturally if us-east1 goes away then that data becomes unavailable.
That makes sense now. I guess that's basically what we're saying (or implying, really) using different words on the "geo-partitioned replicas" topology pattern page (https://www.cockroachlabs.com/docs/v19.2/topology-geo-partitioned-replicas.html)
if you had configured tiered localities - for example if you had split us-east1 into {region=us-east1,dc=dc1, region=us-east1,dc=dc2, region=us-east1,dc=dc3, then you wouldn't expect and dc to be critical (because the cluster "diversifies" each range as much as possible). So, if you were to see a dc as being critical, you'd be surprised and you'd take some action.
I've adapted this paragraph a bit and added it to the end of this section as a "for instance" to follow the main example, which I've also updated to say (basically) "by default, the movr geo-partitioned cluster has critical localities as is, but actually that's fine because it's expected in this topology pattern etc. etc."
PTAL though, I hope it's closer now.
58f0c71
to
e70ec20
Compare
@andreimatei are you ok with this doc after the latest round of changes? If so I'll pass along to Docs team for review so we can have it merged by next week's release. If not, I'm happy to make any necessary updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
looks really good
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @rmloveland)
v19.2/run-replication-reports.md, line 29 at r2 (raw file):
- [`system.replication_stats`](#system-replication_stats) shows information about whether data is under-replicated, over-replicated, or unavailable. - [`system.replication_constraint_stats`](#system-replication_constraint_stats) shows a list of violations to any data placement requirements you've configured.
it's not a "list of violations". It's the status of each constraint - which hopefully is a-OK.
v19.2/run-replication-reports.md, line 31 at r2 (raw file):
- [`system.replication_constraint_stats`](#system-replication_constraint_stats) shows a list of violations to any data placement requirements you've configured. - [`system.replication_critical_localities`](#system-replication_critical_localities) shows which localities in your cluster have data that is at risk in the event of node failures. - [`system.reports_meta`](#system-reports_meta) lists the IDs and times when the replication reports were produced that generated the data in the `system.replication_*` tables.
The IDs are actually static. I wouldn't speak about them. I'd list this table first and say and simply says that the table "Contains metadata about the different reports. Currently the cluster periodically generates three reports - contained in replication_stats
, replication_constraint_stats
and replication_critical_localities
. Currently the only metadata is the report's generation time."
v19.2/run-replication-reports.md, line 49 at r2 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Thanks, updated.
Well, it's not that zones that are children of the default
one get a report. It's that the default
zone itself gets a report. So I'd remove this second bullet, and state separately that the default
zone gets a report.
v19.2/run-replication-reports.md, line 62 at r3 (raw file):
### system.replication_stats The `system.replication_stats` table shows information about whether data is under-replicated, over-replicated, or unavailable.
nit: s/table shows/report contains
Similar for the other tables. I think talking more abstractly about these "reports" is useful. The fact that they correspond to tables is obvious enough around here.
Do with this what you will.
v19.2/run-replication-reports.md, line 66 at r3 (raw file):
For an example using this table, see [Find out which databases and tables have under-replicated ranges](#find-out-which-databases-and-tables-have-under-replicated-ranges). {% include copy-clipboard.html %}
nit: I find the transition to the lists of columns (here and elsewhere) to be a bit abrupt. You'd think that this snipped is more related to "an example" linked above than to what comes next. Consider introducing a header for Columns, or something.
Ignore if this is just how we do things.
v19.2/run-replication-reports.md, line 75 at r3 (raw file):
| zone_id | [`INT8`](int.html) | The ID of the [replication zone](configure-zone.html). | | subzone_id | [`INT8`](int.html) | The ID of the subzone (i.e., [partition](partition-by.html)). | | report_id | [`INT8`](int.html) | The ID of the [report](#system-reports_meta) that generated this data. |
here and elsewhere I'd suggest that the report_id
is the same for all the rows in the table (not across tables, though). I.e. this data is "part of that report". I.e. the totality of rows in the table / simply "the table" is the report.
v19.2/verify-constraint-conformance.md, line 8 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Updated to say "query the status of your cluster's data replication ..."
Prompted by your comment I also audited the other uses of "run" in this doc to make sure they only refer to the system running the reports automagically.
thanks! But what about the doc's title? :P
e70ec20
to
372ecb1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks really good
Thanks Andrei! And thanks for the thorough review.
I've made all the updates, please take a look and let me know if I've misunderstood anything you asked to be done.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
v19.2/run-replication-reports.md, line 29 at r2 (raw file):
Updated to
shows the status of any data placement constraints you've configured
v19.2/run-replication-reports.md, line 31 at r2 (raw file):
Makes sense. Can you live with the following?
contains metadata about the replication report data listed in the
system.replication_*
tables. Currently, the only metadata available is the report generation time.
v19.2/run-replication-reports.md, line 49 at r2 (raw file):
Deleted that bullet, and added the following to the bottom of the note, let me know if it works:
In addition to the above, the system's
default
zone always gets a report.
v19.2/run-replication-reports.md, line 62 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: s/table shows/report contains
Similar for the other tables. I think talking more abstractly about these "reports" is useful. The fact that they correspond to tables is obvious enough around here.
Do with this what you will.
Thanks, I think it's still sinking in, slowly. :-)
I have executed s/table shows/report contains/g
I've also audited the other mentions of "system.replication*
table" and changed them to say "report".
v19.2/run-replication-reports.md, line 66 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: I find the transition to the lists of columns (here and elsewhere) to be a bit abrupt. You'd think that this snipped is more related to "an example" linked above than to what comes next. Consider introducing a header for Columns, or something.
Ignore if this is just how we do things.
I agree with you. I think I put them there when I was just starting the draft as placeholders.
I've removed the SHOW COLUMNS
lines since we have examples now, and added a small 'Columns' header for each report.
v19.2/run-replication-reports.md, line 75 at r3 (raw file):
OK. Here and elsewhere I've changed it to say the following - hopefully it's ~what you said
The ID of the report that generated all of the rows in this table
v19.2/verify-constraint-conformance.md, line 8 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
thanks! But what about the doc's title? :P
Aha! :-) Renamed to "Query Replication Reports", which feels a bit awkward but I'm open to suggestions.
Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/372ecb174b3fca438abb368df1552de69f0a73aa/ Edited pages: |
@ericharmeling I think the engineering review is winding down, this should be ready for a look sometime Monday AM. 🤞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. A lot of doc here! LGTM with some very very minor nits.
Reviewed 3 of 6 files at r3, 3 of 3 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @rmloveland)
v19.2/query-replication-reports.md, line 19 at r4 (raw file):
is subject to change
"are subject to change"?
v19.2/query-replication-reports.md, line 21 at r4 (raw file):
to add
"to adding"?
v19.2/query-replication-reports.md, line 164 at r4 (raw file):
Quoted 6 lines of code…
The examples shown below are each using a [geo-partitioned demo cluster](cockroach-demo.html#run-cockroach-demo-with-geo-partitioned-replicas) started with the following command: {% include copy-clipboard.html %} ~~~ shell cockroach demo movr --geo-partitioned-replicas ~~~
There is an include file for this, BTW. That file includes a ### Setup
header as well.
{% include {{page.version.version}}/sql/movr-statements-geo-partitioned-replicas.md %}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, Eric!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @ericharmeling, and @knz)
v19.2/query-replication-reports.md, line 19 at r4 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
is subject to change
"are subject to change"?
Done.
v19.2/query-replication-reports.md, line 21 at r4 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
to add
"to adding"?
Done.
v19.2/query-replication-reports.md, line 164 at r4 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
The examples shown below are each using a [geo-partitioned demo cluster](cockroach-demo.html#run-cockroach-demo-with-geo-partitioned-replicas) started with the following command: {% include copy-clipboard.html %} ~~~ shell cockroach demo movr --geo-partitioned-replicas ~~~
There is an include file for this, BTW. That file includes a
### Setup
header as well.{% include {{page.version.version}}/sql/movr-statements-geo-partitioned-replicas.md %}
👍
372ecb1
to
167a87f
Compare
Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/167a87fabdce7e6239dddf24586d3837916d0021/ Edited pages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's create a follow-up issue to link to this new doc from relevant geo-partitioning docs, etc.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @ericharmeling, @jseldess, and @knz)
Thanks Jesse! Created #5831 to track adding linkage, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @ericharmeling, @knz, and @rmloveland)
v19.2/verify-constraint-conformance.md, line 8 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Aha! :-) Renamed to "Query Replication Reports", which feels a bit awkward but I'm open to suggestions.
I think we ended up on an unfortunate title here :). I just saw "Query Replication Reports" used in a document as the name of the feature (which is "replication reports"). The fact that "query" here was supposed to be a verb and not a noun got lost.
What do you think about renaming this to just Replication Reports? It looks like not all pages in these guides have actions in their titles.
Btw, do you think there's another place besides Guides in which we should list these tables?
Bummer! Understandable error, though.
I'm fine with it. @jseldess do you have any objection as long as we put in a redirect from the current name?
As far as the tables specifically, I'm not sure - I don't think we document system tables anywhere. However, I just took a look through some other docs, and I think we should write something about these reports on the main Configure Replication Zones page. Maybe a new section on that page that talks about the replication reports and links to the longer doc. What do you think of that approach? |
Created #6857 to work on the things mentioned in my previous note. Please comment over there if you have thoughts/opinions. |
Specifically, add docs for the new system tables that provide insight
into:
Whether data is under-replicated or unavailable
Violations of data placement requirements
Which localities have data that is at risk in the event of node
failure
Fixes #4730.
Addresses #2957, #5636.