-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: excessive not lease holder errors running tpcc #23543
Comments
The caching in DistSender is not very smart. We're pretty quick to invalidate the cache if we get any error (especially a timeout). So if the cluster is just barely able to handle its load, we might get a timeout, invalidate the cache, and then the next query would have a good chance of a NotLeaseHolderError. I don't think we're setting short deadlines on the TPCC requests themselves, but we do have short deadlines on some internal operations such as liveness updates. |
The not leaseholder errors seem to have decreased over time as the cluster has remained up. They are now down in the 1-2/sec range. So perhaps my efforts to populate the leaseholder cache were failing. |
What were these efforts exactly? |
Running |
@petermattis what's the appropriate (i.e. smallest) setting in which you think I could repro this? |
@tschottdorf I don't know. I'm going to set up a 3-node cluster for tpcc-1k tonight. I'll point you at it if it repros there. I don't recall seeing this problem at tpcc-2k or tpcc-5k, but I also wasn't paying attention. |
I set up a TPCC-1 3-node cluster and ran the following on all nodes before applying load:
After doing so, I turned on the loadgen. I saw only one not leaseholder error spike, which amounted to about 8 errors. After this point, I did not see a single error. One thing I'm curious about is how many nodes were in the cluster that saw this initial spike. @a-robinson, would you expect rebalancing to be less stable on a larger cluster? Also, doing this experiment brought up an interesting problem. When I tried it initially on a TPCC-5k cluster, it took about half an hour because of the large amount of data that each |
I saw this on a 24-node cluster. Testing tpcc-2k last night on an 8-node cluster also showed a spike, but it went away quickly.
Agreed. |
Moving to 2.1 as I don't think there is anything to do here for 2.0. |
No, I wouldn't expect it to be any less stable. I wish I had written something down about it, but I half-remember looking at @petermattis's roachprod cluster after he complained about this on slack and not seeing any rebalancing activity.
Agreed given where we are in the release cycle, but we should be more actively updating the leaseholder cache when a replica successfully serves a request, as alluded to by the discussion on #23601. |
@a-robinson Didn't you fix some bugs related to leaseholder errors? Can this issue be closed? |
I did fix a couple problems, although I also found a problem that it looks like I never opened an issue for that can cause large spikes of the errors. During a lease transfer, there can be a period of time in which the old leaseholder redirects requests to the new leaseholder, but the new leaseholder hasn't applied the lease yet and thus redirects requests to the old leaseholder. This can cause requests to ping pong back and forth from the dist sender to one replica then the other, over and over and over. When network latencies are low, this can lead to thousands of not-leaseholder errors. I managed to grab these verbose logs from logspy while looking into the spikes: https://gist.github.com/a-robinson/eb931e4987219060f3107e6bd371c625 Here's a brief excerpt demonstrating the problem:
The spike of errors: Proof of a lease transfer happening right around that time on that range: |
Yeah, this is exactly what I saw in #22837. We'll want to do something about this sometime soon. |
Alright, I'm going to close this in favor of #22837 then. Thanks @nvanbenschoten. |
29692: ui: various glue fixes r=vilterp,couchand a=benesch This PR restores the "no UI installed" message in short binaries: ![image](https://user-images.githubusercontent.com/882976/45196553-e713b880-b22a-11e8-928a-06c7a2da0f63.png) I came across a few minor nits that seemed worth fixing too. 30135: sql: add '2.0' setting for distsql r=jordanlewis a=jordanlewis The 2.0 setting for distsql (both a cluster setting and a session setting) instructs the executor to use the 2.0 method of determining how to execute a query: the query runs via local sql unless the query is both distributable and recommended to be distributed, in which case it runs via the distsql and is actually distributed. Release note (sql change): add the '2.0' value for both the distsql session setting and the sql.defaults.distsql cluster setting, which instructs the database to use the 2.0 'auto' behavior for determining whether queries run via distsql or not. 30148: storage: add new metrics for the RaftEntryCache r=nvanbenschoten a=nvanbenschoten Four new metrics are introduced: - `raft.entrycache.bytes` - `raft.entrycache.size` - `raft.entrycache.accesses` - `raft.entrycache.hits` 30163: kv: Don't evict from leaseholder cache on context cancellations r=a-robinson a=a-robinson This was a major contributor to the hundreds of NotLeaseHolderErrors per second that we see whenever we run tpc-c at scale. A non-essential batch request like a QueryTxn would get cancelled, causing the range to be evicted from the leaseholder cache and the next request to that range to have to guess at the leaseholder. This is effectively an extension of #26764 that we should have thought to inspect more closely at the time. Actually fixes #23543, which was not fully fixed before. Although I still haven't seen the errors drop all the way to 0, so I'm letting tpc-c 10k continue to run for a while longer to verify that they do. They are continuing to decrease about 15 minutes in. I don't think getting to 0 will be possible because there are still occasional splits and lease transfers), but it looks like it should be able to get down to single digit errors per second from the hundreds it was at before this change. Also, avoid doing unnecessary sorting by latency of replicas in the dist_sender in the common case when we know who the leaseholder is and plan on sending our request there. 30197: sql/parser: fix the action for empty rules r=knz a=knz Fixes #30141. Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com> Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
When starting up
tpcc
we see a spike in not lease holder errors that eventually settles down to a steady stream that never completely goes away. This was noticed on a cluster running tpcc-10k. I'm not sure if it happens on smaller tpcc datasets. There might be a simple explanation for this (e.g. the cluster has 54k ranges). My simple attempts to ensure that the leaseholder cache is fully populated on each node in the cluster were unable to completely eliminate the errors.The text was updated successfully, but these errors were encountered: