Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: unxpected replication change from 3 to 2 voters #64064

Closed
tbg opened this issue Apr 22, 2021 · 5 comments · Fixed by #64170
Closed

kvserver: unxpected replication change from 3 to 2 voters #64064

tbg opened this issue Apr 22, 2021 · 5 comments · Fixed by #64170
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker

Comments

@tbg
Copy link
Member

tbg commented Apr 22, 2021

Describe the problem

When @aliher1911 and I looked into unhappy restore2TB runs in the context of #61396, we noticed unexpected replication changes. From a three-voter configuration, for some reason we're demoting a voter:

[n7,s7,r39158/4:x] change replicas (add ‹[]› remove ‹[(n5,s5):1VOTER_DEMOTING_LEARNER]›): existing descriptor r39158:‹/Table/54/1/52061454{-/0}› [(n5,s5):1, (n7,s7):4, (n10,s10):3, next=5, gen=744, sticky=1617706884.328040814,0]

At that point, in the full snippet, the range goes unavailable because one of the two remaining voters (s10) is waiting for a snapshot.

To Reproduce

Run the restore2TB/nodes=10 roachtest.

This should reproduce on any SHA preceding #64060, such as
d85d49d, when running restore2TB.
It may not always happen but we saw it frequently, at least in
"unhappy" runs (as characterized by large pending snapshot counts).

Expected behavior

With 10 live nodes and atomic replication changes, there should never be a
reason to move from a three-voter to a two-voter configuration.o The only
explanation I have is that n5 might have been considered dead for 5 minutes
which would possibly trigger this issue (?!) but this is esssentially ruled out
by the full snippet, which indicates that n5 was live a minute after the
botched replication change (and it is thus unlikely to have been non-live for
the preceding minutes).

@aliher1911 if you have full logs from any of these experiments, mind going
through them to see if you have other examples of such replication changes,
and if so, posting the complete log directory (Google Drive).

Additional data / screenshots

Environment:

Additional context
Add any other context about the problem here.

cc @cockroachdb/kv

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 22, 2021
@aliher1911
Copy link
Contributor

I have full logs from that run still and I'll check if other runs have it just in case. Will upload later.

@tbg
Copy link
Member Author

tbg commented Apr 23, 2021

@aayushshah15
Copy link
Contributor

I could reproduce this scenario pretty readily with a log.Fatal inside changeReplicasImpl when it was about to execute such a change. @dt also noticed this today and, on his cluster, the range log for the unfortunate range looked like the following (notice the 4th item from the top):

Screenshot from 2021-04-22 22-20-57

It seemed to indicate that the caller of ChangeReplicas (whoever it was) was explicitly asking it to remove just that voter. The Reason: admin request part is also important because it tells us that it cannot be the allocator somehow making an incorrect decision, as the allocator would indicate something like range over-replicated as the Reason.

This pointed fingers at the callers of AdminRelocateRange (or a bug inside AdminRelocateRange itself, but I ruled this out). Sure enough, it turns out it's the store rebalancer calling AdminRelocateRange with slices that contain duplicate targets:

rebalancing r44266 (0.09 qps) from (n9,s9):1,(n5,s5):2,(n4,s4):3 to ‹[n10,s10 n8,s8 n8,s8]› to better balance load

I see what the bug is, it was introduced in #61600. I'm misusing a slice (:sweat:). I'll get a fix out for it asap.

Tobi, do you think we should mark this a release blocker until then?


Another unfortunate (but not "new") thing is that because of this check, the store rebalancer seems to let itself relocate replicas with very low QPS (notice the QPS in my example^). This doesn't seem "bad" per se, because presumably the replicate queue would eventually do the same thing, but feels to me like the store rebalancer should not be concerning itself with such a thing.

@aliher1911
Copy link
Contributor

Logs from the run where we observed this behaviour: drive link
@tbg @aayushshah15

@tbg
Copy link
Member Author

tbg commented Apr 23, 2021

Very nice, @aayushshah15! I'll mark this as a GA-blocker. It could well be a release blocker too - someone using a beta to test resilience might get miffed - but I think that is less than likely and if so should be easy to pin down if they contact us.

@tbg tbg added branch-master Failures and bugs on the master branch. branch-release-21.1 GA-blocker labels Apr 23, 2021
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 24, 2021
Prior to this commit, we had a bug inside one of the methods used by the
`StoreRebalancer` where we had two variables referring to a slice that
was subsequently being appended to.

The code was making the implicit assumption that both of these slices
would point to the same underlying array, which is not true since any of
the additions mentioned above could result in the underlying array for
one of the slices being resized.

This commit fixes this bug.

Resolves cockroachdb#64064

Release note (bug fix): A bug in previous 21.1 betas allowed the store
rebalancer to spuriously downreplicate a range during normal operation.
This bug is now fixed.
craig bot pushed a commit that referenced this issue Apr 27, 2021
64170: kvserver: prevent the StoreRebalancer from downreplicating a range r=aayushshah15 a=aayushshah15

This PR contains 2 commits:

**kvserver: add safeguard against spurious calls to AdminRelocateRange**

This commit adds checks inside of `AdminRelocateRange` to fail if the
lists of relocation targets passed in by the caller contain duplicates.
This is supposed to act as a safeguard against catastrophic outcomes
like a range getting spuriously downreplicated due to malformed input.

Release note: None

**kvserver: prevent the StoreRebalancer from downreplicating a range**

Prior to this commit, we had a bug inside one of the methods used by the
`StoreRebalancer` where we had two variables referring to a slice that
was subsequently being appended to.

The code was making the implicit assumption that both of these slices
would point to the same underlying array, which is not true since any of
the additions mentioned above could result in the underlying array for
one of the slices being resized.

This commit fixes this bug.

Resolves #64064

Release note (bug fix): A bug in previous 21.1 betas allowed the store
rebalancer to spuriously downreplicate a range during normal operation.
This bug is now fixed.


Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 27, 2021
Prior to this commit, we had a bug inside one of the methods used by the
`StoreRebalancer` where we had two variables referring to a slice that
was subsequently being appended to.

The code was making the implicit assumption that both of these slices
would point to the same underlying array, which is not true since any of
the additions mentioned above could result in the underlying array for
one of the slices being resized.

This commit fixes this bug.

Resolves cockroachdb#64064

Release note (bug fix): A bug in previous 21.1 betas allowed the store
rebalancer to spuriously downreplicate a range during normal operation.
This bug is now fixed.
@craig craig bot closed this as completed in b00f923 Apr 27, 2021
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue May 3, 2021
Previously, the `StoreRebalancer` would rebalance ranges off of a node if
it had more than the average number of replicas compared to the cluster
average. This was undesirable because the `StoreRebalancer` was not respecting
the {over,under}fullness thresholds that the replicateQueue does, which could
lead to redundant range relocations even in a well balanced cluster.

This commit is an opinionated fix to prevent this behavior and an attempt to
more clearly delineate the responsibilities of the replica rebalancing aspects
of the `StoreRebalancer` and the `replicateQueue`.

Noticed while looking into cockroachdb#64064.

Release note: None
craig bot pushed a commit that referenced this issue May 4, 2021
64251: changefeedccl: support arrays in avro encoding r=HonoreDB a=HonoreDB

Arrays are one of a few remaining types we don't support
in avro changefeeds. This serializes them.
Sadly can't use @miretskiy's new optimization since we
need to load multiple maps with the same key into an array.

Release note (bug fix): Changefeeds on tables with array columns support avro

64559: kvserver: stop reconciling range count imbalance in the StoreRebalancer r=aayushshah15 a=aayushshah15

Previously, the `StoreRebalancer` would rebalance ranges off of a node if
it had more than the average number of replicas compared to the cluster
average. This was undesirable because the `StoreRebalancer` was not respecting
the {over,under}fullness thresholds that the replicateQueue does, which could
lead to redundant range relocations even in a well balanced cluster.

This commit is an opinionated fix to prevent this behavior and an attempt to
more clearly delineate the responsibilities of the replica rebalancing aspects
of the `StoreRebalancer` and the `replicateQueue`.

Noticed while looking into #64064.

Release note: None

64614: ui: update cluster-ui version r=maryliag a=maryliag

Update cluster-ui version to include changes made on cockroachdb/ui#294 and minor fix to handle case when
no column is selected.

Release note (ui change): cluster-ui updated. Showing information about
database on Statement Page and ability to choose which columns to display.

64664: authors: add charlie to authors r=charlievieth a=charlievieth

Release note: none

64665: authors: add Wade Waldron to Authors r=jlinder a=WadeWaldron

Release note: None

64666: authors: add bryan@cockroachlabs.com to authors r=jlinder a=strangr525

release note: none

64668: authors: add theodore hyman to authors r=jlinder a=theodore-hyman

release note: None

Co-authored-by: Aaron Zinger <zinger@cockroachlabs.com>
Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Co-authored-by: Marylia Gutierrez <marylia@cockroachlabs.com>
Co-authored-by: Charlie Vieth <charlie@cockroachlabs.com>
Co-authored-by: Wade Waldron <wade@cockroachlabs.com>
Co-authored-by: Bryan Kwon <bryan@cockroachlabs.com>
Co-authored-by: Theodore HymHyman <theodore@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants