kvserver: rebalancing between stores on the same node fails #60545

lunevalex · 2021-02-12T21:59:02Z

Describe the problem

This problem was reported here by @dankinder https://forum.cockroachlabs.com/t/under-replicated-ranges-after-decommission/4239/3. A range has the following descriptor (n1,s5):1, (n18,s51):2, (n7,s20):3, (n1,s2):4LEARNER and the allocator attempts to remove (n1, s2) but it fails. This is a valid operation and should be allowed.

To Reproduce

This has been reproduced in TestValidateReplicationChanges, by reversing the order of removal operations in Test Case 14.

Expected behavior

The removal of the replica on (n1, s2) should be allowed, as it returns the cluster to a healthy state.

…plicas already exist on the same node Fixes cockroachdb#60545 The allocator in some cases allows for a range to have a replica on multiple stores of the same node. If that happens, it should allow itself to fix the situation by removing one of the offending replicas. This was only half working due to an ordering problem in how the replicas appeared in the descriptor. It could remove the first replica, but not the second one. . Release note: None

59865: sql: add schema_name,table_id to crdb_internal.ranges r=rafiss a=jordanlewis ... and crdb_internal.ranges_no_leases Closes #59601. This commit adds schema_name to crdb_internal.ranges and crdb_internal.ranges_no_leases to ensure that it's possible to disambiguate between ranges that are contained by a table with the same name in two different user-defined schemas. In addition, it also adds the table_id column which allows unambiguous lookups of ranges for a given table id. This will also enable making a virtual index on the table_id column later, which should be a nice win for some introspection commands. Release note (sql change): add the schema_name and table_id columns to the crdb_internal.ranges and crdb_internal.ranges_no_leases virtual tables. 60546: kvserver: improve handling for removal of a replica, when multiple replicas already exist on the same node r=lunevalex a=lunevalex Fixes #60545 The allocator in some cases allows for a range to have a replica on multiple stores of the same node. If that happens, it should allow itself to fix the situation by removing one of the offending replicas. This was only half working due to an ordering problem in how the replicas appeared in the descriptor. It could remove the first replica, but not the second one. . Release note: None 60561: geo/wkt: simplify parser grammar and improve error messages r=otan a=andyyang890 This patch simplifies the yacc grammar for the WKT parser and also improves the error messages for mixed dimensionality problems. Refs: #53091 Release note: None Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Alex Lunev <alexl@cockroachlabs.com> Co-authored-by: Andy Yang <ayang@cockroachlabs.com>

…plicas already exist on the same node Fixes cockroachdb#60545 The allocator in some cases allows for a range to have a replica on multiple stores of the same node. If that happens, it should allow itself to fix the situation by removing one of the offending replicas. This was only half working due to an ordering problem in how the replicas appeared in the descriptor. It could remove the first replica, but not the second one. Release note (bug fix): 20.2 introduced an ability to rebalance replicas between multiple stores on the same node. This change fixed a problem with that feature, where ocassionaly an intra-node rebalance would fail and a range would get stuck permanently under replicated.

60633: release-20.2: kvserver: improve handling for removal of a replica, when multiple replicas already exist on the same node r=aayushshah15 a=lunevalex Backport 1/1 commits from #60546. /cc @cockroachdb/release --- Fixes #60545 The allocator in some cases allows for a range to have a replica on multiple stores of the same node. If that happens, it should allow itself to fix the situation by removing one of the offending replicas. This was only half working due to an ordering problem in how the replicas appeared in the descriptor. It could remove the first replica, but not the second one. . Release note: None Co-authored-by: Alex Lunev <alexl@cockroachlabs.com>

lunevalex added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 12, 2021

lunevalex self-assigned this Feb 12, 2021

nvanbenschoten mentioned this issue Feb 12, 2021

Unresolvable "Raft log too large" state #60538

Closed

lunevalex mentioned this issue Feb 12, 2021

kvserver: improve handling for removal of a replica, when multiple replicas already exist on the same node #60546

Merged

craig bot closed this as completed in 306d2e9 Feb 16, 2021

lunevalex mentioned this issue Feb 16, 2021

release-20.2: kvserver: improve handling for removal of a replica, when multiple replicas already exist on the same node #60633

Merged

nvanbenschoten mentioned this issue Feb 27, 2021

kv: in v20.2, bootstrapping multiple stores can result in duplicate store IDs #61218

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: rebalancing between stores on the same node fails #60545

kvserver: rebalancing between stores on the same node fails #60545

lunevalex commented Feb 12, 2021

kvserver: rebalancing between stores on the same node fails #60545

kvserver: rebalancing between stores on the same node fails #60545

Comments

lunevalex commented Feb 12, 2021