Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: don't enable internalClientAdapter until rpcServer is operational #39497

Conversation

nvanbenschoten
Copy link
Member

Fixes #39415.

This commit fixes the issue we saw in #39415, where a node would attempt
to allocate a store ID for a new store on an existing node and would
get stuck. The reason it would get stuck is because the KV request to
allocate the store ID could be routed to the node's own replica for
Range 1 (when doing a range descriptor lookup) over the node's
internalClientAdapter. This KV request could then get stuck attempting
to acquire a lease because we would still be blocking all Raft RPC
traffic (see rpcsAllowedWhileBootstrapping). This would cause the KV
request to stall indefinitely.

The fix is to tie the operationality of the internalClientAdapter to
that of the rpcServer. If we don't want to be serving external KV
traffic during this node initialization period then we shouldn't
be serving internal KV traffic.

DNM: this still needs a higher-level test that verifies that the bug
during the bootstrap problem we saw when adding a store to a node with
existing stores is no longer possible. I have only verified this
manually so far.

Release note: None

Fixes cockroachdb#39415.

This commit fixes the issue we saw in cockroachdb#39415, where a node would attempt
to allocate a store ID for a new store on an existing node and would
get stuck. The reason it would get stuck is because the KV request to
allocate the store ID could be routed to the node's own replica for
Range 1 (when doing a range descriptor lookup) over the node's
internalClientAdapter. This KV request could then get stuck attempting
to acquire a lease because we would still be blocking all Raft RPC
traffic (see rpcsAllowedWhileBootstrapping). This would cause the KV
request to stall indefinitely.

The fix is to tie the operationality of the internalClientAdapter to
that of the rpcServer. If we don't want to be serving external KV
traffic during this node initialization period then we shouldn't
be serving internal KV traffic.

DNM: this still needs a higher-level test that verifies that the bug
during the bootstrap problem we saw when adding a store to a node with
existing stores is no longer possible. I have only verified this
manually so far.

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member Author

@rkruze did you try out the binary I sent you with this commit in it? Did it work?

@rkruze
Copy link

rkruze commented Aug 16, 2019

We did try this out and was able to add additional stores to an existing store. Thanks!

@nvanbenschoten
Copy link
Member Author

@andreimatei I don't think the solution here is quite right. This works for multi-node clusters because it prevents the bootstrapping process from routing an IncremenetRequest (for store ID allocation) to an existing Replica on the existing Store that is unable to handle the request (because its Raft RPC handler is disabled). Instead, it forces this request to go to a different node in the cluster.

However, in a single node cluster, we actually do need to use the existing Store to allocate new store IDs because they're the only Replicas for the Range that contains the Store ID key. I think this worked by accident in the past, because a single-node cluster doesn't need to send any Raft RPCs to perform writes.

This indicates that the whole process is a little broken. If adding a new Store may rely on the existing Stores on a node being operational then this bootstrapping process is being run in the wrong order.

@irfansharif
Copy link
Contributor

We've changed out how we allocate store IDs as of #52526, so we'll have to re-work what this looks like. Thankfully we have @TheSamHuang to take a fresh stab at #39415!

@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/internalClientDisable branch October 1, 2020 04:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: support adding Stores to existing Nodes
4 participants