-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: support adding Stores to existing Nodes #39415
Comments
I'm able to reproduce this on master. I can see that we're correctly detecting the bootstrapped and empty engine here: cockroach/pkg/server/server.go Lines 1375 to 1379 in 311f7e4
|
The startup thread appears to be stuck here:
|
The range lookup that is trying to perform is stuck two meta lookups up in
This is bringing up memories of #37906. |
This write is stuck waiting for replication here:
Time to logspy and figure out what's going on on that Range. |
The replica on the restarted node keeps trying to call a Raft election but it's never getting a response. We see this in:
Meanwhile, the other two nodes seem to be having a hard time sending Raft RPCs to node 3:
This looks to be due to the following restriction on the RPCs that can be served during bootstrapping: cockroach/pkg/server/grpc_server.go Lines 65 to 70 in 2b23499
|
By the way, I tested this on |
Adding |
Fixes cockroachdb#39415. This commit fixes the issue we saw in cockroachdb#39415, where a node would attempt to allocate a store ID for a new store on an existing node and would get stuck. The reason it would get stuck is because the KV request to allocate the store ID could be routed to the node's own replica for Range 1 (when doing a range descriptor lookup) over the node's internalClientAdapter. This KV request could then get stuck attempting to acquire a lease because we would still be blocking all Raft RPC traffic (see rpcsAllowedWhileBootstrapping). This would cause the KV request to stall indefinitely. The fix is to tie the operationality of the internalClientAdapter to that of the rpcServer. If we don't want to be serving external KV traffic during this node initialization period then we shouldn't be serving internal KV traffic. DNM: this still needs a higher-level test that verifies that the bug during the bootstrap problem we saw when adding a store to a node with existing stores is no longer possible. I have only verified this manually so far. Release note: None
Here's a script to reproduce this issue using roachprod and a local cluster:
|
Will this be fixed on v20 ? |
Hi @israellot. This has not yet been fixed yet in v20.1 but is still on our near-term roadmap. Do you mind describing the sequence of events that led you to this issue? Is it currently blocking you? |
Hi, |
I think we should pull The point is that adding the additional stores should have completed by the time The other option is to just make it async (in which case it can live anywhere but I would try to avoid that as well). |
Not exactly a starter starter project, but tagging it as such because it'd make for a good intern project this fall. Seems like we've been wanting to do this for a while. |
Hm, I tried the following setup off of
Maybe all that's remaining here is to add a test. I'll try finding out when exactly this started working again (a strange bisect to do, ha). |
Can't take any credit here. I can repro it in 19.2, but can't in 20.1 and beyond. |
I don't know why this readily repros in 19.2 but not in 20.1+, but I'm pretty sure that the current code still has the same problems. We are bootstrapping additional stores (i.e. using KV) in cockroach/pkg/server/server.go Line 1565 in 50ffec8
It makes sense to me that the problem wouldn't occur reliably, though. If the bootstrap requests don't hit a local replica, all is well - the rest of the cluster is operational (in this example), so it will do the KV operation, and things smooth out. That being as it may, though, there is still a general problem with allocating these extra NodeIDs in
Basically we have to allocate these extra nodeIDs asynchronously, so that's what we should do. |
(or we do it after switching to modeOperational - however this will mean sequential restarts won't work, hence I opt for fully throwing it into a goroutine. BTW this is how it worked in the early days of CRDB until I got us into this mess with a well-meaning drive-by change) |
Asynchronously allocating store IDs/bootstrapping new stores makes sense to me. My reproduction steps above were just noise, I had assumed the problem was deterministic. I was able to repro it 20.1+, though a bit less reliably. |
We need to bootstrap additional stores asynchronously. Consider the range that houses the store ID allocator. When restarting the set of nodes that holds a quorum of these replicas, when restarting them with additional stores, those additional stores will require store IDs to get fully bootstrapped. But if we're gating node start (specifically opening up the RPC floodgates) on having all stores fully bootstrapped, we'll simply hang when trying to allocate store IDs. See TestAddNewStoresToExistingNodes and cockroachdb#39415 for more details. Instead we opt to bootstrap additional stores asynchronously, and rely on the blocking function to signal to the caller that all stores have been fully bootstrapped. Release note: None Co-authored-by: Sam Huang <samh@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
We need to bootstrap additional stores asynchronously. Consider the range that houses the store ID allocator. When restarting the set of nodes that holds a quorum of these replicas, when restarting them with additional stores, those additional stores will require store IDs to get fully bootstrapped. But if we're gating node start (specifically opening up the RPC floodgates) on having all stores fully bootstrapped, we'll simply hang when trying to allocate store IDs. See TestAddNewStoresToExistingNodes and cockroachdb#39415 for more details. Instead we opt to bootstrap additional stores asynchronously, and rely on the blocking function to signal to the caller that all stores have been fully bootstrapped. Release note: None Co-authored-by: Sam Huang <samh@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
55350: server: support adding stores to existing nodes r=irfansharif a=TheSamHuang We need to bootstrap additional stores asynchronously. Consider the range that houses the store ID allocator. When restarting the set of nodes that holds a quorum of these replicas, when restarting them with additional stores, those additional stores will require store IDs to get fully bootstrapped. But if we're gating node start (specifically opening up the RPC floodgates) on having all stores fully bootstrapped, we'll simply hang when trying to allocate store IDs. See TestAddNewStoresToExistingNodes and #39415 for more details. Instead we opt to bootstrap additional stores asynchronously, and rely on the blocking function to signal to the caller that all stores have been fully bootstrapped. Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Describe the problem
Please describe the issue you observed, and any steps we can take to reproduce it:
In an existing cluster, if I take one node down and then add a new store to the node, the node comes back up but it is not able to connect to the cluster.
To Reproduce
What did you do? Describe in your own words.
If possible, provide steps to reproduce the behavior:
Expected behavior
The node should come back online and connect to the cluster with the new store.
Additional data / screenshots
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.
If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).
Note that log files can contain confidential information. Please continue
creating this issue, but contact support@cockroachlabs.com to submit the log
files in private.
If applicable, add screenshots to help explain your problem.
Logs are here: cockroach.log
Environment:
Additional context
What was the impact?
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: