-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load #36748
txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load #36748
Conversation
The changes in cockroachdb#33396 made it so that a `HeartbeatTxnRequest` that finds a missing transaction record will attempt to create the record. This means that it will discover if a transaction is not "committable" and return a TransactionAbortedError. Unfortunately, after a transaction commits and GCs its transaction record, it will also be considered not "committable". There will always be cases where a heartbeat request races with an EndTransaction request and incorrectly considers the transaction aborted (which is touched upon in a TODO a few lines down). However, in many of these cases, the coordinator already knows that the transaction is committed, so it doesn't need to attempt to roll back the transaction and clean up its intents. This commit checks for these cases and avoids sending useless rollbacks. The intention is to backport this to 19.1. Release note: None
1410680
to
385e082
Compare
This change increases the duration between transaction heartbeats before a transaction is considered expired from 2 seconds to 5 seconds. This has been found to dramatically reduce the frequency of transaction aborts when a cluster is under significant load. This is not expected to noticeably hurt cluster availability in the presence of dead nodes because we already have availability loss on the order of 9 seconds due to the epoch-based lease duration. This is especially important now that the hack in cockroachdb#25034 is gone. That hack was hiding some of this badness and giving transactions a bit more room to avoid being aborted. Release note: None
385e082
to
9147c58
Compare
@tbg mind weighing in on this? I'd like to get this baking on master as soon as we feel comfortable. The PR itself is almost all changes to tests outside of switching the constant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 4 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/txn_interceptor_heartbeater.go, line 396 at r1 (raw file):
// If the txn is no longer pending, ignore the result of the heartbeat. if h.mu.txn.Status != roachpb.PENDING {
That's new, right? Just a cleanup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/txn_interceptor_heartbeater.go, line 396 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
That's new, right? Just a cleanup?
Yeah, this is just a cleanup. The commit message for it is pretty detailed.
36748: txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load r=nvanbenschoten a=nvanbenschoten This change increases the duration between transaction heartbeats before a transaction is considered expired from 2 seconds to 5 seconds. This has been found to dramatically reduce the frequency of transaction aborts when a cluster is under significant load. This is not expected to noticeably hurt cluster availability in the presence of dead nodes because we already have availability loss on the order of 9 seconds due to the epoch-based lease duration. This is especially important now that the hack in #25034 is gone. That hack was hiding some of this badness and giving transactions a bit more room to avoid being aborted. Here's some testing with TPC-C 2300 on AWS with 3 `c5d.4xlarge` machines. I first ran with the the 2 second TxnLivenessThreshold and then with the 5 second: <img width="872" alt="Screen Shot 2019-04-10 at 4 51 28 PM" src="https://user-images.githubusercontent.com/5438456/55926033-ae332c00-5bdd-11e9-8d6d-184ba1d5ffcd.png"> The top graph is transaction aborts and the bottom graph is p99 service latency. During my tests I actually saw an even bigger difference most of the time. In this trial, leases were still moving around because I didn't use a ramp period (which skews the txn aborts when it stops). I ran another 15 minute run again half an hour later once the leases had balanced (again with a 5 second TxnLivenessThreshold) and saw the latencies I was more accustomed to seeing with the change: <img width="869" alt="Screen Shot 2019-04-10 at 5 59 19 PM" src="https://user-images.githubusercontent.com/5438456/55926104-09651e80-5bde-11e9-9540-9d7c697a9181.png"> cc. @danhhz, who picked this up back in January when monitoring the performance of an alpha release. I dug into the changes made in 8143b45 that seemed to be causing issues with transaction aborts but never developed a good theory for what could be causing them and had trouble reliably reproducing them. Little did I know that it wasn't what was added in 8143b45 that made the difference, it was what [was removed](8143b45#diff-3c31a6497771c33db423a4cead092979L130). The first commit is from #36729 and seems like a generally good change to make. There's no reason we should be sending async aborts if we already know that the transaction was finalized. This was probably also made a little worse by 8143b45 because we now check to see whether a transaction is commitable when its record is missing when evaluating a `HeartbeatTxn` requests. Before we would simply return a `TransactionNotFoundStatusError`, which was ignored. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Build succeeded |
37199: storage: propagate errors from contentionQueue, catch stalls in roachtest r=nvanbenschoten a=nvanbenschoten Informs #36089. The PR is split into a series of commits. The first fixes part of a bug that was causing #36089 to fail (thanks to #36748) and the second improves the test to have a more obvious failure condition for this class of bug in the future. The third, fifth, and sixth clean up code. Finally, the fourth fixes another bug that could cause issues with #36089. Before the first commit, requests could get stuck repeatedly attempting to push a transaction only to repeatedly find that they themselves were already aborted. The error would not propagate up to the transaction coordinator and the request would get stuck. This commit fixes this behavior by correctly propagating errors observed by the `contentionQueue`. The second commit bumps the TxnLivenessThreshold for clusters running `kv/contention/nodes=4` to 10 minutes. This is sufficiently large such that if at any point a transaction is abandoned then all other transactions will begin waiting for it, throughput will drop to 0 for 10 straight minutes, and the test will fail to achieve its minimum QPS requirement. The fourth commit instructs pushers in the `txnwait.Queue` to inform all other pushers that are waiting for the same transaction when it observes an ABORTED transaction. I never saw this cause issues with #36089, but it seems very possible that it could given frequent tscache rotations. 38397: sql: deflake TestLogic//crdb_internal/max_retry_counter r=knz a=knz Fixes #38062. Release note: None 38654: exec: Handle NULLS in TopK sorter r=rohany a=rohany This commit fixes NULLs in the TopK sorter by avoiding use of the vec copy method, which has a bug. Instead, we add a set method to the vec comparator, and use the templatized comparator to perform the sets that the TopK sorter needs. To facilitate this, we add an UnsetNull method to the Nulls object. However, use of this method results in HasNull() maybe returning true even if the vector doesn't have nulls. This behavior already occurs when selection vectors are used. Based on discussions with @solongordon and @asubiotto, this behavior is OK, and future PR's will attempt to make this behavior better, and address the bugs within the Vec Copy method. 38725: cli/dump: more clearly inform the user upon tables with no visible columns r=knz a=knz Informs #37768. Informs #28948. This is coming up quite often on support, lately again on gitter and forum https://forum.cockroachlabs.com/t/error-while-dumping-core-backup/2901/3. This PR aims to lessen the burden on support and propose a clear "next action" for the user. Before: ``` kena@kenax ~/cockroach % ./cockroach dump --insecure defaultdb CREATE TABLE t (, FAMILY "primary" (rowid) ); Error: pq: at or near "from": syntax error Failed running "dump" ``` After: ``` kena@kenax ~/cockroach % ./cockroach dump --insecure defaultdb CREATE TABLE t (, FAMILY "primary" (rowid) ); Error: table "defaultdb.public.t" has no visible columns HINT: To proceed with the dump, either omit this table from the list of tables to dump, drop the table, or add some visible columns. -- See: #37768 Failed running "dump" ``` Release note (cli change): `cockroach dump` will now more clearly refer to issue #37768 when it encounters a table with no visible columns, which (currently) cannot be dumped successfully. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
This change increases the duration between transaction heartbeats before a transaction is considered expired from 2 seconds to 5 seconds. This has been found to dramatically reduce the frequency of transaction aborts when a cluster is under significant load. This is not expected to noticeably hurt cluster availability in the presence of dead nodes because we already have availability loss on the order of 9 seconds due to the epoch-based lease duration.
This is especially important now that the hack in #25034 is gone. That hack was hiding some of this badness and giving transactions a bit more room to avoid being aborted.
Here's some testing with TPC-C 2300 on AWS with 3
c5d.4xlarge
machines. I first ran with the the 2 second TxnLivenessThreshold and then with the 5 second:The top graph is transaction aborts and the bottom graph is p99 service latency. During my tests I actually saw an even bigger difference most of the time. In this trial, leases were still moving around because I didn't use a ramp period (which skews the txn aborts when it stops). I ran another 15 minute run again half an hour later once the leases had balanced (again with a 5 second TxnLivenessThreshold) and saw the latencies I was more accustomed to seeing with the change:
cc. @danhhz, who picked this up back in January when monitoring the performance of an alpha release. I dug into the changes made in 8143b45 that seemed to be causing issues with transaction aborts but never developed a good theory for what could be causing them and had trouble reliably reproducing them. Little did I know that it wasn't what was added in 8143b45 that made the difference, it was what was removed.
The first commit is from #36729 and seems like a generally good change to make. There's no reason we should be sending async aborts if we already know that the transaction was finalized. This was probably also made a little worse by 8143b45 because we now check to see whether a transaction is commitable when its record is missing when evaluating a
HeartbeatTxn
requests. Before we would simply return aTransactionNotFoundStatusError
, which was ignored.