Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: distsql plans against unavailable node #23601

Closed
solongordon opened this issue Mar 8, 2018 · 36 comments
Closed

sql: distsql plans against unavailable node #23601

solongordon opened this issue Mar 8, 2018 · 36 comments
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@solongordon
Copy link
Contributor

solongordon commented Mar 8, 2018

After stopping one node in a four node cluster, I noticed that distsql inconsistently continues to include that node in physical plans, which causes errors in query execution. This is reproducible on a local cluster.

# Start a local four node cluster.
roachprod create local --nodes 4
roachprod start local

# Create a table with leaseholders on all four nodes.
roachprod sql local:1 <<EOF
CREATE DATABASE IF NOT EXISTS d;
DROP TABLE IF EXISTS d.t;
CREATE TABLE d.t (x INT PRIMARY KEY);
INSERT INTO d.t VALUES (1), (2), (3), (4);
ALTER TABLE d.t SPLIT AT VALUES (2), (3), (4);
EOF

sleep 1

roachprod sql local:1 <<EOF
ALTER TABLE d.t TESTING_RELOCATE VALUES
  (ARRAY[1,2,3], 1),
  (ARRAY[2,3,4], 2),
  (ARRAY[3,4,1], 3),
  (ARRAY[4,1,2], 4);
EOF

# Stop one of the nodes.
roachprod stop local:4

Now if you repeatedly run SELECT * FROM d.t on node 1, about half the time it will give the correct result, but the other half it will throw an rpc error:

roachprod sql local:1 <<EOF
SELECT * FROM d.t;
EOF

pq: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure

And if you repeatedly request the distsql explain plan, about half the time it will incorrectly include the stopped node:

roachprod sql local:1 <<EOF
SELECT "URL" FROM [EXPLAIN (DISTSQL) SELECT * FROM d.t];
EOF

https://cockroachdb.github.io/distsqlplan/decode.html?eJzEkT1rwzAQhvf-jHdWwR_poqlrlrSEbkWDah3B4OjESYaW4P9eLA2pg1taAumou3vuOfGe4NnRzh4pQr-ihkIDhRYKGxiFINxRjCxzuwxv3Tt0pdD7MKa5bBQ6FoI-IfVpIGi82LeB9mQdCRQcJdsPWRCkP1r5eEwwkwKP6bwjJnsg6HpS33jO60fP4kjILZabaeWSHd9zuBhbFzcLcX2jDzY38rT_ENiKZ08xsI_0q0SqOVByByrpRx6lo2fhLmvK8ylzueAoptKty2Prcysf-BWuf4QfFnB1CTfXmNtr4M2fYDPdfQYAAP__FSJJrw==

The behavior persists after decommissioning the stopped node.

From looking at session traces, it looks to me like distsql always tries to plan against the stopped node, but there is a race condition where sometimes it discovers that it is unhealthy while planning and adjusts, but sometimes it doesn't and fails upon execution.

@solongordon solongordon added this to the 2.0 milestone Mar 8, 2018
@tbg
Copy link
Member

tbg commented Mar 8, 2018

Good catch! Ping #21882

@tbg
Copy link
Member

tbg commented Mar 8, 2018

The decommissioned node shouldn't have any leases and for non-DistSQL requests, this would be discovered and the lease holder cache cleared. Perhaps DistSQL is failing to close the loop back to DistSender to tell it to eject from the lease holder cache and so it's stuck trying again and again.

@tbg
Copy link
Member

tbg commented Mar 8, 2018

PS the title suggests that this is specific to decommissioning, but it seems that the bug is rather that DistSQL plans against unavailable nodes. It shouldn't do that in the first place; I don't think there's anything special that decommissioning the nodes will do. That just makes sure they cease to have leases and data, so that there is no reason to ever plan on them unless there's a bug like described in my last post. Consider s/decommissioning/unavailable/ in the title.

@solongordon solongordon changed the title sql: distsql plans against decommissioned node sql: distsql plans against unavailable node Mar 8, 2018
@solongordon
Copy link
Contributor Author

Done, thanks.

The error goes away if I force local execution via SET distsql=off, but comes back if I turn distsql back on. So if local execution updates the lease holder cache, maybe that's not the issue.

I just confirmed that this issue is not present in v2.0-alpha.20180129, so it's at least a somewhat recent regression.

@solongordon
Copy link
Contributor Author

It looks like the issue occurs as of #22658. (I was able to reproduce it with 0c433fe but not 61ac7fe.) @bdarnell any ideas about why that would cause this behavior?

@jordanlewis
Copy link
Member

I wonder if the behavior change as of #22658 merely exposed a previously existing problem. It seems to me that a problem in which DistSQL plans on unavailable nodes would have existed both before and after that PR.

@solongordon
Copy link
Contributor Author

I was thinking something similar. I wouldn't be surprised if DistSQL was already trying to plan on the unavailable node, but prior to that commit it would reliably figure out that it was unavailable and adjust the plan. I'll see if I can test out that hypothesis via tracing.

@solongordon
Copy link
Contributor Author

Yep, this seems to be the case. Every time the query is run I see this in the trace:

querying next range at /Table/51/1/3
marking n3 as unhealthy for this plan: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
not planning on node 3. unhealthy: true, incompatible version: false

So why is DistSQL trying to plan on an unavailable node which is no longer a leaseholder or replica for any ranges? Presumably some cache is never getting updated.

@solongordon
Copy link
Contributor Author

@tschottdorf's first instinct looks correct: DistSender.leaseHolderCache is serving up stale replica info. Interestingly I see stale cache hits for local execution as well though.

@solongordon
Copy link
Contributor Author

Summing up some findings on this. I think this issue comes down to the intersection of a few things:

  • Stopping a node can leave the remaining nodes with a stale leaseholder cache which might never get updated. (More on that later.)
  • The DistSQL planner is OK with planning against nodes which don't yet have an RPC connection:
    if err != nil && err != rpc.ErrNotConnected && err != rpc.ErrNotHeartbeated {
  • We recently updated the RPC layer to kick failed connections out of the pool rather than allowing gRPC to redial. So when a node is unavailable, it cycles in and out of the RPC connection pool as other nodes try to ping it.

The sum effect is that the DistSQL planner will try and plan against an unavailable node which is no longer a leaseholder. If its connection is in an error state, the planner will plan on a different node and the query will succeed. But if it's in the "not connected" state, the planner will go ahead and plan on it, which errors out on execution.

How do we end up with a stale leaseholder cache? My theory is that DistSender needs to be smarter about updating the cache. Currently it only updates if it receives a NotLeaseHolderError:

case *roachpb.NotLeaseHolderError:
ds.metrics.NotLeaseHolderErrCount.Inc(1)
if lh := tErr.LeaseHolder; lh != nil {
// If the replica we contacted knows the new lease holder, update the cache.
ds.leaseHolderCache.Update(ctx, rangeID, lh.StoreID)

Receiving a successful response from the leaseholder has no effect on the cache, even if we thought someone else was the leaseholder. We rely on the non-leaseholders to report this info, and they may very well have stale caches too.

So I think the action items here are:

  1. Investigate improvements to DistSender's leaseholder cache invalidation. (Maybe as simple as updating the cache when we successfully contact the leaseholder.)
  2. Update the DistSQL planner to not plan on nodes until they have a healthy heartbeat.
  3. Consider whether we are OK with unhealthy connections being repeatedly added to and kicked out of our RPC connection pool.

@vivekmenezes
Copy link
Contributor

Even if you fix this it does look like it's only going to make the execution path failure rarer. You still need to recover from the execution path seeing an unhealthy node.

related issue: #15637

@vivekmenezes
Copy link
Contributor

moving to 2.1 since this is not a regression

@solongordon
Copy link
Contributor Author

Good point about the execution path. But I'd personally still consider this behavior a regression. When a node becomes unavailable, distsql queries which would previously have included that node start erroring out ~50% of the time. That's not the case in v1.1.5.

@jordanlewis
Copy link
Member

I agree with @solongordon that this is a regression. It's not acceptable to require retries half of the time after you decommission a node, which is supposed to be an ordinary operational command.

Fixing all 3 of these things probably can't happen by 2.0, though. Is there a simple fix that we can apply to prevent the user-visible issue, at least?

@solongordon
Copy link
Contributor Author

A simple fix would be to change DistSQL not to plan against nodes whose RPC connection is in the ErrNotConnected state. I'm not sure if that would have broader implications though. Maybe @RaduBerinde would know?

@jordanlewis
Copy link
Member

I suppose one problem with that is that those connections wouldn't ever be resurrected in the case of a transient network failure, then, unless there's an external actor trying to resurrect them.

@tbg
Copy link
Member

tbg commented Mar 12, 2018

Yeah, checking for connection health ((*Ctx).ConnHealth(target string) error) alone isn't the right thing. The regression occurred because the connection cache was previously more stateful than it is now: it would hold on to any connection forever and would try it periodically, so checking connection health made sense. Now when the connection fails, we lose all knowledge about it.

ISTM that the most fundamental badness here is that we're not evicting the lease holders. Fixing that would also fix many of these errors but while it's probably not technically hard, perhaps there's some refactoring involved. (iirc the leaseholder cache lives above the code that tries the other replicas).

For the planning layer, rpcContext offers a preconfigured circuit breaker. The gossip layer has example usage. Making sure we don't plan on nodes for which the breaker is open should be a good fix for the other issue, but it motivates certain restructurings of the code so that the breaker is actually a per-target address singleton used for circuit breaking the connections).

@RaduBerinde
Copy link
Member

CC @andreimatei

@andreimatei
Copy link
Contributor

My bad for not having a test for whatever happened here. But these tests are hard to write...
Solon, do you want to do something here? Otherwise I'll take it, but it'll have to wait a bit.

@a-robinson
Copy link
Contributor

ISTM that the most fundamental badness here is that we're not evicting the lease holders. Fixing that would also fix many of these errors but while it's probably not technically hard, perhaps there's some refactoring involved. (iirc the leaseholder cache lives above the code that tries the other replicas).

Seconded. This is also a performance issue, as in #23543

@andreimatei
Copy link
Contributor

When exactly do you suggest to do evictions from the leaseholders cache?

Receiving a successful response from the leaseholder has no effect on the cache, even if we thought someone else was the leaseholder. We rely on the non-leaseholders to report this info, and they may very well have stale caches too.

I'm not sure I follow this. If we manage to read from a range, either through the local engine of DistSQL, the correct leaseholder for that should make it to the cache. In the DistSQL case, the gateway gets info for all the mis-planned ranges (planned on a node other than the leaseholder).

@solongordon
Copy link
Contributor Author

Hm, that's not the behavior I'm seeing. The leaseholder cache can remain stale even after successful reads for the range, both with local and distsql execution. I think that's related to the fact that DistSender only updates the cache based on NotLeaseHolderErrors.

@andreimatei
Copy link
Contributor

If true that the cache remains stale, that's a bug I think. You're saying that the DistSender never updates the cache if the leaseholder it knows about is not responding (and also if we never contact it because the conn is not healthy?)?

@solongordon
Copy link
Contributor Author

Yes, though I think it's a bit more specific than that. DistSender doesn't update the cache if the cached leaseholder is unavailable and the next replica it tries is the new leaseholder.

tbg added a commit to tbg/cockroach that referenced this issue Mar 15, 2018
The recent PR cockroachdb#22658 introduced a regression in
`(*rpcContext).ConnHealth` which caused DistSQL to continue planning on
unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the
leaseholder cache happened to not be updated by other non-DistSQL
requests.

Instead, consult node liveness and avoid planning on dead nodes. This
reduces the problem to a <10s window. The defunct `ConnHealth` mechanism
still protects against planning in some of cases (supposedly due to a
once-per-second reconnection policy) and is retained for that reason,
with issue cockroachdb#23829 filed to decide its future.

NB: I'm not putting a release note since this was introduced after 1.1.
We released it in a beta, though, so it may be worth calling out there.

Touches cockroachdb#23601. (Not fixing it because this issue should only close
when there's a roachtest).

Release note (bug fix): NB: this fixes a regression introduced in
2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against
unavailable nodes.
tbg added a commit to tbg/cockroach that referenced this issue Mar 15, 2018
The recent PR cockroachdb#22658 introduced a regression in
`(*rpcContext).ConnHealth` which caused DistSQL to continue planning on
unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the
leaseholder cache happened to not be updated by other non-DistSQL
requests.

Instead, consult node liveness and avoid planning on dead nodes. This
reduces the problem to a <10s window. The defunct `ConnHealth` mechanism
still protects against planning in some of cases (supposedly due to a
once-per-second reconnection policy) and is retained for that reason,
with issue cockroachdb#23829 filed to decide its future.

NB: I'm not putting a release note since this was introduced after 1.1.
We released it in a beta, though, so it may be worth calling out there.

Touches cockroachdb#23601. (Not fixing it because this issue should only close
when there's a roachtest).

Release note (bug fix): NB: this fixes a regression introduced in
2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against
unavailable nodes.
@solongordon solongordon removed their assignment Mar 15, 2018
craig bot added a commit that referenced this issue Apr 6, 2018
23916: backport-2.0: distsql: consult liveness during physical planning r=tschottdorf a=tschottdorf

Backport 1/1 commits from #23834.

/cc @cockroachdb/release

---

The recent PR #22658 introduced a regression in
`(*rpcContext).ConnHealth` which caused DistSQL to continue planning on
unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the
leaseholder cache happened to not be updated by other non-DistSQL
requests.

Instead, consult node liveness and avoid planning on dead nodes. This
reduces the problem to a <10s window. The defunct `ConnHealth` mechanism
still protects against planning in some of cases (supposedly due to a
once-per-second reconnection policy) and is retained for that reason,
with issue #23829 filed to decide its future.

NB: I'm not putting a release note since this was introduced after 1.1.
We released it in a beta, though, so it may be worth calling out there.

Touches #23601. (Not fixing it because this issue should only close
when there's a roachtest).

Release note (bug fix): NB: this fixes a regression introduced in
2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against
unavailable nodes.
@knz knz added S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-execution Relating to SQL execution. labels May 14, 2018
@jordanlewis
Copy link
Member

Is this still an issue? Should we revisit after #23834?

@tbg tbg removed their assignment Jul 24, 2018
@solongordon
Copy link
Contributor Author

Looks like this is all set from a DistSQL perspective thanks to #23834. However it looks like the stale leaseholder cache fix (#23885) never got merged, which could cause unnecessary performance hits when a node goes down.

Sending back to @tschottdorf in case he wants to wrap that up before closing this issue.

@solongordon solongordon removed the A-sql-execution Relating to SQL execution. label Aug 14, 2018
@solongordon solongordon assigned tbg and unassigned solongordon Aug 14, 2018
@solongordon solongordon added C-performance Perf of queries or internals. Solution not expected to change functional behavior. and removed S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Aug 14, 2018
tbg added a commit to tbg/cockroach that referenced this issue Aug 16, 2018
This addresses a situation in which we would not evict a stale
leaseholder for a long time. Consider the replica [s1,s2,s3] and
s1 is down but is the cached leaseholder, while s2 is the actual lease
holder. The RPC layer will try s1, get an RPC error, try s2 and succeed.
Since there is no NotLeaseHolderError involved, the cache would not get
updated, and so every request pays the overhead of trying s1 first.

Fixes cockroachdb#23601.

Release note: None
craig bot pushed a commit that referenced this issue Aug 16, 2018
23885: kv: evict leaseholder on RPC error r=solongoron a=tschottdorf

This addresses a situation in which we would not evict a stale
leaseholder for a long time. Consider the replicas [s1,s2,s3] and
s1 is down but is the cached leaseholder, while s2 is the actual lease
holder. The RPC layer will try s1, get an RPC error, try s2 and succeed.
Since there is no NotLeaseHolderError involved, the cache would not get
updated, and so every request pays the overhead of trying s1 first.

WIP because needs testing.

Touches #23601.

Release note (bug fix): Improve request routing during node outages.

28609: opt: Make additional perf improvements r=andy-kimball a=andy-kimball

Make several more fixes:

1. Do not qualify column names in metadata, since that
requires expensive string formatting up-front (also cleanup
the factoring of this code, which had gotten messy).
2. Inline Metadata into Memo.
3. Inline logicalPropsBuilder into the Memo.

Together, these changes improve KV perf from:
```
Phases/kv-read/OptBuild  18.4µs ± 1%
```
to:
```
Phases/kv-read/OptBuild  17.8µs ± 1%
```

28661: storage: don't include RHS data in merge trigger r=bdarnell,tschottdorf a=benesch

Now that we require ranges to be collocated during a merge and the RHS
replicas to be up-to-date before the merge commits, we no longer need to
include a snapshot of the RHS in the merge trigger. We know that the
copy of the data that already exists in the local store is perfectly
up-to-date.

So, stop sending the data in the merge trigger.

Release note: None

28689: sqlbase: avoid using SERIAL in system tables r=knz a=knz

Needed for  #28575.

We'll soon want special behavior for SERIAL. We can't afford the
definition of system tables to be subject to a discussion about what
SERIAL means. So this patch ensures system tables don't use SERIAL.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>
Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

8 participants