sql: distsql plans against unavailable node #23601

solongordon · 2018-03-08T19:44:12Z

After stopping one node in a four node cluster, I noticed that distsql inconsistently continues to include that node in physical plans, which causes errors in query execution. This is reproducible on a local cluster.

# Start a local four node cluster.
roachprod create local --nodes 4
roachprod start local

# Create a table with leaseholders on all four nodes.
roachprod sql local:1 <<EOF
CREATE DATABASE IF NOT EXISTS d;
DROP TABLE IF EXISTS d.t;
CREATE TABLE d.t (x INT PRIMARY KEY);
INSERT INTO d.t VALUES (1), (2), (3), (4);
ALTER TABLE d.t SPLIT AT VALUES (2), (3), (4);
EOF

sleep 1

roachprod sql local:1 <<EOF
ALTER TABLE d.t TESTING_RELOCATE VALUES
  (ARRAY[1,2,3], 1),
  (ARRAY[2,3,4], 2),
  (ARRAY[3,4,1], 3),
  (ARRAY[4,1,2], 4);
EOF

# Stop one of the nodes.
roachprod stop local:4

Now if you repeatedly run SELECT * FROM d.t on node 1, about half the time it will give the correct result, but the other half it will throw an rpc error:

roachprod sql local:1 <<EOF
SELECT * FROM d.t;
EOF

pq: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure

And if you repeatedly request the distsql explain plan, about half the time it will incorrectly include the stopped node:

roachprod sql local:1 <<EOF
SELECT "URL" FROM [EXPLAIN (DISTSQL) SELECT * FROM d.t];
EOF

https://cockroachdb.github.io/distsqlplan/decode.html?eJzEkT1rwzAQhvf-jHdWwR_poqlrlrSEbkWDah3B4OjESYaW4P9eLA2pg1taAumou3vuOfGe4NnRzh4pQr-ihkIDhRYKGxiFINxRjCxzuwxv3Tt0pdD7MKa5bBQ6FoI-IfVpIGi82LeB9mQdCRQcJdsPWRCkP1r5eEwwkwKP6bwjJnsg6HpS33jO60fP4kjILZabaeWSHd9zuBhbFzcLcX2jDzY38rT_ENiKZ08xsI_0q0SqOVByByrpRx6lo2fhLmvK8ylzueAoptKty2Prcysf-BWuf4QfFnB1CTfXmNtr4M2fYDPdfQYAAP__FSJJrw==

The behavior persists after decommissioning the stopped node.

From looking at session traces, it looks to me like distsql always tries to plan against the stopped node, but there is a race condition where sometimes it discovers that it is unhealthy while planning and adjusts, but sometimes it doesn't and fails upon execution.

The text was updated successfully, but these errors were encountered:

tbg · 2018-03-08T20:14:24Z

Good catch! Ping #21882

tbg · 2018-03-08T20:15:53Z

The decommissioned node shouldn't have any leases and for non-DistSQL requests, this would be discovered and the lease holder cache cleared. Perhaps DistSQL is failing to close the loop back to DistSender to tell it to eject from the lease holder cache and so it's stuck trying again and again.

tbg · 2018-03-08T20:17:42Z

PS the title suggests that this is specific to decommissioning, but it seems that the bug is rather that DistSQL plans against unavailable nodes. It shouldn't do that in the first place; I don't think there's anything special that decommissioning the nodes will do. That just makes sure they cease to have leases and data, so that there is no reason to ever plan on them unless there's a bug like described in my last post. Consider s/decommissioning/unavailable/ in the title.

solongordon · 2018-03-08T20:23:22Z

Done, thanks.

The error goes away if I force local execution via SET distsql=off, but comes back if I turn distsql back on. So if local execution updates the lease holder cache, maybe that's not the issue.

I just confirmed that this issue is not present in v2.0-alpha.20180129, so it's at least a somewhat recent regression.

solongordon · 2018-03-08T20:58:22Z

It looks like the issue occurs as of #22658. (I was able to reproduce it with 0c433fe but not 61ac7fe.) @bdarnell any ideas about why that would cause this behavior?

jordanlewis · 2018-03-08T21:00:36Z

I wonder if the behavior change as of #22658 merely exposed a previously existing problem. It seems to me that a problem in which DistSQL plans on unavailable nodes would have existed both before and after that PR.

solongordon · 2018-03-08T21:06:05Z

I was thinking something similar. I wouldn't be surprised if DistSQL was already trying to plan on the unavailable node, but prior to that commit it would reliably figure out that it was unavailable and adjust the plan. I'll see if I can test out that hypothesis via tracing.

solongordon · 2018-03-08T21:16:50Z

Yep, this seems to be the case. Every time the query is run I see this in the trace:

querying next range at /Table/51/1/3
marking n3 as unhealthy for this plan: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
not planning on node 3. unhealthy: true, incompatible version: false

So why is DistSQL trying to plan on an unavailable node which is no longer a leaseholder or replica for any ranges? Presumably some cache is never getting updated.

solongordon · 2018-03-08T21:45:13Z

@tschottdorf's first instinct looks correct: DistSender.leaseHolderCache is serving up stale replica info. Interestingly I see stale cache hits for local execution as well though.

solongordon · 2018-03-09T20:39:41Z

Summing up some findings on this. I think this issue comes down to the intersection of a few things:

Stopping a node can leave the remaining nodes with a stale leaseholder cache which might never get updated. (More on that later.)
The DistSQL planner is OK with planning against nodes which don't yet have an RPC connection:

cockroach/pkg/sql/distsql_physical_planner.go

Line 532 in 747d488

if err != nil && err != rpc.ErrNotConnected && err != rpc.ErrNotHeartbeated {
We recently updated the RPC layer to kick failed connections out of the pool rather than allowing gRPC to redial. So when a node is unavailable, it cycles in and out of the RPC connection pool as other nodes try to ping it.

The sum effect is that the DistSQL planner will try and plan against an unavailable node which is no longer a leaseholder. If its connection is in an error state, the planner will plan on a different node and the query will succeed. But if it's in the "not connected" state, the planner will go ahead and plan on it, which errors out on execution.

How do we end up with a stale leaseholder cache? My theory is that DistSender needs to be smarter about updating the cache. Currently it only updates if it receives a NotLeaseHolderError:

cockroach/pkg/kv/dist_sender.go

Lines 1339 to 1343 in 747d488

    
           case *roachpb.NotLeaseHolderError: 
        
           	ds.metrics.NotLeaseHolderErrCount.Inc(1) 
        
           	if lh := tErr.LeaseHolder; lh != nil { 
        
           		// If the replica we contacted knows the new lease holder, update the cache. 
        
           		ds.leaseHolderCache.Update(ctx, rangeID, lh.StoreID)

Receiving a successful response from the leaseholder has no effect on the cache, even if we thought someone else was the leaseholder. We rely on the non-leaseholders to report this info, and they may very well have stale caches too.

So I think the action items here are:

Investigate improvements to DistSender's leaseholder cache invalidation. (Maybe as simple as updating the cache when we successfully contact the leaseholder.)
Update the DistSQL planner to not plan on nodes until they have a healthy heartbeat.
Consider whether we are OK with unhealthy connections being repeatedly added to and kicked out of our RPC connection pool.

vivekmenezes · 2018-03-12T02:26:48Z

Even if you fix this it does look like it's only going to make the execution path failure rarer. You still need to recover from the execution path seeing an unhealthy node.

related issue: #15637

vivekmenezes · 2018-03-12T03:00:01Z

moving to 2.1 since this is not a regression

solongordon · 2018-03-12T14:15:32Z

Good point about the execution path. But I'd personally still consider this behavior a regression. When a node becomes unavailable, distsql queries which would previously have included that node start erroring out ~50% of the time. That's not the case in v1.1.5.

jordanlewis · 2018-03-12T14:17:37Z

I agree with @solongordon that this is a regression. It's not acceptable to require retries half of the time after you decommission a node, which is supposed to be an ordinary operational command.

Fixing all 3 of these things probably can't happen by 2.0, though. Is there a simple fix that we can apply to prevent the user-visible issue, at least?

solongordon · 2018-03-12T14:31:41Z

A simple fix would be to change DistSQL not to plan against nodes whose RPC connection is in the ErrNotConnected state. I'm not sure if that would have broader implications though. Maybe @RaduBerinde would know?

jordanlewis · 2018-03-12T14:37:34Z

I suppose one problem with that is that those connections wouldn't ever be resurrected in the case of a transient network failure, then, unless there's an external actor trying to resurrect them.

tbg · 2018-03-12T14:55:10Z

Yeah, checking for connection health ((*Ctx).ConnHealth(target string) error) alone isn't the right thing. The regression occurred because the connection cache was previously more stateful than it is now: it would hold on to any connection forever and would try it periodically, so checking connection health made sense. Now when the connection fails, we lose all knowledge about it.

ISTM that the most fundamental badness here is that we're not evicting the lease holders. Fixing that would also fix many of these errors but while it's probably not technically hard, perhaps there's some refactoring involved. (iirc the leaseholder cache lives above the code that tries the other replicas).

For the planning layer, rpcContext offers a preconfigured circuit breaker. The gossip layer has example usage. Making sure we don't plan on nodes for which the breaker is open should be a good fix for the other issue, but it motivates certain restructurings of the code so that the breaker is actually a per-target address singleton used for circuit breaking the connections).

RaduBerinde · 2018-03-12T15:12:03Z

CC @andreimatei

andreimatei · 2018-03-12T15:57:52Z

My bad for not having a test for whatever happened here. But these tests are hard to write...
Solon, do you want to do something here? Otherwise I'll take it, but it'll have to wait a bit.

a-robinson · 2018-03-12T16:08:21Z

ISTM that the most fundamental badness here is that we're not evicting the lease holders. Fixing that would also fix many of these errors but while it's probably not technically hard, perhaps there's some refactoring involved. (iirc the leaseholder cache lives above the code that tries the other replicas).

Seconded. This is also a performance issue, as in #23543

andreimatei · 2018-03-12T16:31:51Z

When exactly do you suggest to do evictions from the leaseholders cache?

Receiving a successful response from the leaseholder has no effect on the cache, even if we thought someone else was the leaseholder. We rely on the non-leaseholders to report this info, and they may very well have stale caches too.

I'm not sure I follow this. If we manage to read from a range, either through the local engine of DistSQL, the correct leaseholder for that should make it to the cache. In the DistSQL case, the gateway gets info for all the mis-planned ranges (planned on a node other than the leaseholder).

solongordon · 2018-03-12T16:58:08Z

Hm, that's not the behavior I'm seeing. The leaseholder cache can remain stale even after successful reads for the range, both with local and distsql execution. I think that's related to the fact that DistSender only updates the cache based on NotLeaseHolderErrors.

andreimatei · 2018-03-12T17:05:43Z

If true that the cache remains stale, that's a bug I think. You're saying that the DistSender never updates the cache if the leaseholder it knows about is not responding (and also if we never contact it because the conn is not healthy?)?

solongordon · 2018-03-12T19:11:02Z

Yes, though I think it's a bit more specific than that. DistSender doesn't update the cache if the cached leaseholder is unavailable and the next replica it tries is the new leaseholder.

The recent PR cockroachdb#22658 introduced a regression in `(*rpcContext).ConnHealth` which caused DistSQL to continue planning on unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the leaseholder cache happened to not be updated by other non-DistSQL requests. Instead, consult node liveness and avoid planning on dead nodes. This reduces the problem to a <10s window. The defunct `ConnHealth` mechanism still protects against planning in some of cases (supposedly due to a once-per-second reconnection policy) and is retained for that reason, with issue cockroachdb#23829 filed to decide its future. NB: I'm not putting a release note since this was introduced after 1.1. We released it in a beta, though, so it may be worth calling out there. Touches cockroachdb#23601. (Not fixing it because this issue should only close when there's a roachtest). Release note (bug fix): NB: this fixes a regression introduced in 2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against unavailable nodes.

23916: backport-2.0: distsql: consult liveness during physical planning r=tschottdorf a=tschottdorf Backport 1/1 commits from #23834. /cc @cockroachdb/release --- The recent PR #22658 introduced a regression in `(*rpcContext).ConnHealth` which caused DistSQL to continue planning on unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the leaseholder cache happened to not be updated by other non-DistSQL requests. Instead, consult node liveness and avoid planning on dead nodes. This reduces the problem to a <10s window. The defunct `ConnHealth` mechanism still protects against planning in some of cases (supposedly due to a once-per-second reconnection policy) and is retained for that reason, with issue #23829 filed to decide its future. NB: I'm not putting a release note since this was introduced after 1.1. We released it in a beta, though, so it may be worth calling out there. Touches #23601. (Not fixing it because this issue should only close when there's a roachtest). Release note (bug fix): NB: this fixes a regression introduced in 2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against unavailable nodes.

jordanlewis · 2018-07-19T17:17:49Z

Is this still an issue? Should we revisit after #23834?

solongordon · 2018-08-14T19:00:11Z

Looks like this is all set from a DistSQL perspective thanks to #23834. However it looks like the stale leaseholder cache fix (#23885) never got merged, which could cause unnecessary performance hits when a node goes down.

Sending back to @tschottdorf in case he wants to wrap that up before closing this issue.

This addresses a situation in which we would not evict a stale leaseholder for a long time. Consider the replica [s1,s2,s3] and s1 is down but is the cached leaseholder, while s2 is the actual lease holder. The RPC layer will try s1, get an RPC error, try s2 and succeed. Since there is no NotLeaseHolderError involved, the cache would not get updated, and so every request pays the overhead of trying s1 first. Fixes cockroachdb#23601. Release note: None

23885: kv: evict leaseholder on RPC error r=solongoron a=tschottdorf This addresses a situation in which we would not evict a stale leaseholder for a long time. Consider the replicas [s1,s2,s3] and s1 is down but is the cached leaseholder, while s2 is the actual lease holder. The RPC layer will try s1, get an RPC error, try s2 and succeed. Since there is no NotLeaseHolderError involved, the cache would not get updated, and so every request pays the overhead of trying s1 first. WIP because needs testing. Touches #23601. Release note (bug fix): Improve request routing during node outages. 28609: opt: Make additional perf improvements r=andy-kimball a=andy-kimball Make several more fixes: 1. Do not qualify column names in metadata, since that requires expensive string formatting up-front (also cleanup the factoring of this code, which had gotten messy). 2. Inline Metadata into Memo. 3. Inline logicalPropsBuilder into the Memo. Together, these changes improve KV perf from: ``` Phases/kv-read/OptBuild 18.4µs ± 1% ``` to: ``` Phases/kv-read/OptBuild 17.8µs ± 1% ``` 28661: storage: don't include RHS data in merge trigger r=bdarnell,tschottdorf a=benesch Now that we require ranges to be collocated during a merge and the RHS replicas to be up-to-date before the merge commits, we no longer need to include a snapshot of the RHS in the merge trigger. We know that the copy of the data that already exists in the local store is perfectly up-to-date. So, stop sending the data in the merge trigger. Release note: None 28689: sqlbase: avoid using SERIAL in system tables r=knz a=knz Needed for #28575. We'll soon want special behavior for SERIAL. We can't afford the definition of system tables to be subject to a discussion about what SERIAL means. So this patch ensures system tables don't use SERIAL. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com> Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>

solongordon added this to the 2.0 milestone Mar 8, 2018

solongordon changed the title ~~sql: distsql plans against decommissioned node~~ sql: distsql plans against unavailable node Mar 8, 2018

a-robinson mentioned this issue Mar 8, 2018

Update 'Stop a Node' with more draining info cockroachdb/docs#2671

Merged

vivekmenezes assigned solongordon Mar 12, 2018

vivekmenezes modified the milestones: 2.0, 2.1 Mar 12, 2018

tbg mentioned this issue Mar 15, 2018

kv: evict leaseholder on RPC error #23885

Merged

tbg mentioned this issue Mar 15, 2018

backport-2.0: distsql: consult liveness during physical planning #23916

Merged

solongordon removed their assignment Mar 15, 2018

tbg mentioned this issue Mar 16, 2018

wip: kv: evict leaseholder on RPC error #23938

Closed

knz added S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-execution Relating to SQL execution. labels May 14, 2018

tbg removed their assignment Jul 24, 2018

jordanlewis assigned solongordon Jul 31, 2018

solongordon removed the A-sql-execution Relating to SQL execution. label Aug 14, 2018

solongordon assigned tbg and unassigned solongordon Aug 14, 2018

jordanlewis closed this as completed Aug 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: distsql plans against unavailable node #23601

sql: distsql plans against unavailable node #23601

solongordon commented Mar 8, 2018 •

edited

Loading

tbg commented Mar 8, 2018

tbg commented Mar 8, 2018

tbg commented Mar 8, 2018

solongordon commented Mar 8, 2018

solongordon commented Mar 8, 2018

jordanlewis commented Mar 8, 2018

solongordon commented Mar 8, 2018

solongordon commented Mar 8, 2018

solongordon commented Mar 8, 2018

solongordon commented Mar 9, 2018

vivekmenezes commented Mar 12, 2018

vivekmenezes commented Mar 12, 2018

solongordon commented Mar 12, 2018

jordanlewis commented Mar 12, 2018

solongordon commented Mar 12, 2018

jordanlewis commented Mar 12, 2018

tbg commented Mar 12, 2018

RaduBerinde commented Mar 12, 2018

andreimatei commented Mar 12, 2018

a-robinson commented Mar 12, 2018

andreimatei commented Mar 12, 2018

solongordon commented Mar 12, 2018

andreimatei commented Mar 12, 2018

solongordon commented Mar 12, 2018

jordanlewis commented Jul 19, 2018

solongordon commented Aug 14, 2018

sql: distsql plans against unavailable node #23601

sql: distsql plans against unavailable node #23601

Comments

solongordon commented Mar 8, 2018 • edited Loading

tbg commented Mar 8, 2018

tbg commented Mar 8, 2018

tbg commented Mar 8, 2018

solongordon commented Mar 8, 2018

solongordon commented Mar 8, 2018

jordanlewis commented Mar 8, 2018

solongordon commented Mar 8, 2018

solongordon commented Mar 8, 2018

solongordon commented Mar 8, 2018

solongordon commented Mar 9, 2018

vivekmenezes commented Mar 12, 2018

vivekmenezes commented Mar 12, 2018

solongordon commented Mar 12, 2018

jordanlewis commented Mar 12, 2018

solongordon commented Mar 12, 2018

jordanlewis commented Mar 12, 2018

tbg commented Mar 12, 2018

RaduBerinde commented Mar 12, 2018

andreimatei commented Mar 12, 2018

a-robinson commented Mar 12, 2018

andreimatei commented Mar 12, 2018

solongordon commented Mar 12, 2018

andreimatei commented Mar 12, 2018

solongordon commented Mar 12, 2018

jordanlewis commented Jul 19, 2018

solongordon commented Aug 14, 2018

solongordon commented Mar 8, 2018 •

edited

Loading