sql: TransactionStatusError: transaction deadline exceeded #18684

jordanlewis · 2017-09-21T21:07:28Z

When running TPCC with more than a single thread, occasionally a transaction will fail with the following error:

pq: TransactionStatusError: transaction deadline exceeded

What is this about? What is the referenced deadline? It seems like a bug that this message is propagated to the user, but what internal condition is causing this scenario and why?

Any ideas? cc @andreimatei @bdarnell

The text was updated successfully, but these errors were encountered:

bdarnell · 2017-09-21T21:09:18Z

I believe that message means a table descriptor lease has expired. cc @vivekmenezes @LEGO

lgo · 2017-09-21T23:54:51Z

@jordanlewis how frequent is the error? If it is frequent enough it will hopefully be easy to test whether the lease expiring and re-acquisition is the issue since I have an in-flight PR right now (relevant PR #18606, issue #17227).

That is peculiar that it happens specifically with more than one thread. There are a few places in the lease acquisition where I can see race conditions causing starvation but only if the lease is immediately released or maybe under heavy load. This could go either two ways: the above PR fixes it because lease acquisition is taking too long, or it's a race-condition causing a thread to be starved. I would think the former would happen with also one thread though, so it's less likely to be that.

(scratch that on the race conditions, I was misunderstanding parts of the code 😅)

vivekmenezes · 2017-09-22T01:48:22Z

This bug is happening because we are using a lease very close to its expiration deadline (we changed that recently to do that in #16842 by taking out the notion of MinLeaseDuration). I suspect we are seeing this here because a transaction with SNAPSHOT ISOLATION is using a table descriptor with an unexpired lease but is having its COMMIT pushed beyond the lease expiration (which is what the txn expiration is set to).

lego's PR will fix this because it will reacquire a lease way before it expires and thus transactions will use leases with expirations further out into the future (like a minute).

jordanlewis · 2017-09-22T02:01:22Z

All transactions in TPCC are set to Serializable isolation.

vivekmenezes · 2017-09-22T02:18:14Z

if you can log the txn.OrigTimestamp() and expiration in getTableVersion() after the call to AcquireByName() and panic instead of returning the error with the deadline timestamp and the txn timestamp we might get a better clue. Thanks!

vivekmenezes · 2017-09-22T02:24:25Z

of course there is also the possibility that we should not be using OrigTimestamp() when acquiring leases. I think I need to ask around as to what API to use to get the COMMIT timestamp for a transaction as it's running

vivekmenezes · 2017-09-22T16:48:25Z

Turns out using a timestamp near the deadline, while valid when picked, is bad in general, because the transaction timestamp can be pushed in the background by another transaction and a COMMIT will see the new pushed timestamp (when it reads the transaction record from the store) and return a violation of the deadline.

So #18606 is a good fix for this issue

jordanlewis · 2017-09-22T16:54:25Z

@LEGO it's pretty easy to reproduce that error with TPCC. Perhaps you could try it out on your branch? Instructions:

Start a local cockroach instance
Clone the cockroachdb/loadgen repo, enter the tpcc directory, go build
Load the data with ./tpcc -load
Once the data's loaded, it'll start the load generator automatically. Ctrl-C
Run with high concurrency: ./tpcc -concurrency=128. That usually triggers the above error for me.

andreimatei · 2017-09-22T17:03:48Z

#18606 seems like an improvement to things but ideally we wouldn't rely on some async process to ensure that our transactions can commit.
The general idea, I believe, should be that you can verify if a timestamp is valid or not for a txn (wrt schema changes) whenever you want. So, if some layer is about to say "deadline exceeded", it should also give SQL a chance to say "no, actually, the txn timestamp is still fine because I've verified that the descriptors are used are valid for the timestamp in question". I don't know what the best way to transcribe that into code would be, but we can investigate.

bdarnell · 2017-09-23T18:11:30Z

Prior to #16842, we checked the time left on the lease and would (synchronously) renew it if there was less than a minute left. Why was this removed? Without it it's easy to run into this error because there's no guarantee that there's a useful amount of time left on the lease. This is a regression compared to 1.0 so I'm tagging this for 1.1.

I think we need both MinLeaseDuration and #18606: synchronously renew the lease when it's too close to expiring, and asynchronously renew it somewhat before it reaches that point.

Additionally, to support arbitrarily-long-running transactions, we would need some way for the renewed lease to be fed into the coordinator (and from there probably written to the transaction record via HeartbeatTxn). That's much lower priority, though.

andreimatei · 2017-09-23T23:06:24Z

First of all, let's agree that these txn deadlines should only generate errors for SNAPSHOT transactions. For SERIALIZABLE txns, the TransactionRetryError caused by the push should take precedence. It is, in my opinion, a bug that this line generating the deadline error:

cockroach/pkg/storage/replica_command.go

Line 769 in 57d142d

if isEndTransactionExceedingDeadline(reply.Txn.Timestamp, *args) {

is above the lines just below that would generate the TransactionRetryError. I think this is the first thing that we should fix; then all the discussion remaining would be restricted to SNAPSHOT txns.
Second of all, I think we should turn these deadline errors into retriable errors so that SQL retries.

Third, I'd like to respond to Ben's message:

Without it it's easy to run into this error because there's no guarantee that there's a useful amount of time left on the lease. This is a regression compared to 1.0 so I'm tagging this for 1.1.

What you say is true, but just in case we're not all on the same page, note that this deadline does not prevent "arbitrarily-long-running transactions". It just prevents arbitrarily-big-pushes. You can have a transaction run arbitrarily long without changing its timestamp (and the deadline is checked against the txn timestamp, not the clock).

So I don't think that what we want to do is synchronously renew leases (or, rather, I'd ask you when do you renew the lease - when it's close to expiring or when the txn in question has been pushed close to the deadline, or even beyond it?). In all this discussion, keep in mind that the client may not even know that the transaction has been pushed.

Another thing I want to note is that, for a transaction to be able to commit at a particular timestamp, it doesn't necessarily need for the gateway to have valid leases on the descriptors that it used. Remember from the newer version of the descriptors and leases RFC that the conditions sufficient for a transaction to commit are divorced from the leases; they're about the validity window of the descriptors it used.

How about this: we augment the client.Txn interface with a callback through which KV can ask whomever has set the txn deadline (i.e. SQL) about whether the txn can commit at a timestamp T_pushed beyond the originally-set deadline. SQL would set this callback to check whether T_pushed is within the validity window of all the descriptors; it would do this by looking at the descriptor versions that the node knows about and, where it doesn't have enough information, it'd read the database. Reading the database may, as a byproduct, also renew a lease, but that's not essential.
The callback would be called by txn.Send() whenever it gets a "deadline exceeded" error. We'd rely on these errors only being generated by EndTransaction and so, if the callback says that we're still golden at the new timestamp, then client.Txn would re-issue the EndTransaction (just this, no other requests from the last batch) with an updated deadline.

Otherwise, what you're hinting at around renewing leases in the background and updating txn deadlines in the background sounds good - it'd prevent the "deadline error" from happening in most cases. But note that a lease expiration is not trivially translatable into a txn deadline: the deadline for a txn is set based on a set of descriptors not just one (and also, again, based on validity windows which may not be related to leases).

bdarnell · 2017-09-25T13:05:49Z

First of all, let's agree that these txn deadlines should only generate errors for SNAPSHOT transactions.

Agreed.

So I don't think that what we want to do is synchronously renew leases (or, rather, I'd ask you when do you renew the lease - when it's close to expiring or when the txn in question has been pushed close to the deadline, or even beyond it?)

I'd only synchronously get the lease when it is first accessed. After that we can asynchronously renew before expiration for as long as the transaction's open.

The callback would be called by txn.Send() whenever it gets a "deadline exceeded" error. We'd rely on these errors only being generated by EndTransaction and so, if the callback says that we're still golden at the new timestamp, then client.Txn would re-issue the EndTransaction (just this, no other requests from the last batch) with an updated deadline.

I don't like the idea of retrying after an EndTransaction. Per comments in the code, we'd like to formally abort the transaction when this happens (but this is currently prevented by other issues). If we did that, we wouldn't be able to retry. Maybe we just change the comments and guarantee that attempting to commit with a failed deadline is safe, but this seems subtle to me.

andreimatei · 2017-09-26T16:55:54Z

I don't like the idea of retrying after an EndTransaction. Per comments in the code, we'd like to formally abort the transaction when this happens (but this is currently prevented by other issues). If we did that, we wouldn't be able to retry. Maybe we just change the comments and guarantee that attempting to commit with a failed deadline is safe, but this seems subtle to me.

The comments you're referring to were written (presumably) under the belief that what we want to do when a transaction's deadline was exceeded is to abort the transaction. I don't think they should be used to inform decisions in the situation where we decide that we, in fact, do not want to abort.
To take this further, I in fact believe that we never want to abort; in the worst case, the transaction should be prepared for a retry (increment epoch, leave intents in place).

lgo · 2017-09-27T17:01:32Z

Merged #18824 which will help resolve the immediate issue at hand, the TransactionStatusError error. Do we want to close this issue and open another regarding the other problem around serializable restarts?

andreimatei · 2017-09-27T17:26:11Z

This issue has not been resolved by #18824; your PR has only reduced the incidence of the problem (a txn pushed sufficiently will still encounter the same error; you've just made it so that the required push is now expected to be larger than before).
For Serializable txns, I've sent a PR that makes them never encounter this error; it's linked above.

We can remove the 1.1 milestone, but otherwise the issue is still valid.
But, it seems to me like the patch in question still needs some work. I've left some comments on the original PR. If I'm right, I think we should rollback the cherry-pick until the patch gets a bit more love.

…iration Before this patch, if a Serializable txn was pushed and the push caused the txn to exceed its deadline, the client would get a "deadline exceeded" error back. What the client should have gotten is the retriable error, as that would allow it to retry. This patch fixes this by inverting the order of error generation. As a result, Serializable txns shouldn't see "deadline exceeded" errors any more. Touches cockroachdb#18684

vivekmenezes · 2018-11-01T14:34:03Z

One plan here is to always set the deadline at commit time so that the transaction uses the deadline of a later refreshed lease with an updated (refreshed) deadline that is more likely to succeed. This can be achieved by ensuring that the transaction Commit API always uses a deadline instead of making the deadline a property of the transaction itself. At commit time the sql layer can call the API with the latest refreshed deadline for all the descriptors used in the transaction.

The above will also get rid of needing to pass the transaction as a flag to the schema resolver API.

danhhz · 2019-02-13T20:57:32Z

FWIW, I just hit this in a local run of the tpcc/nodes=3/w=max roachtest running a binary at sha 72edf20.

vivekmenezes · 2019-02-13T20:59:55Z

@danhhz any chance you can add the error message because it now provides more information. Thanks

danhhz · 2019-02-13T21:07:39Z

Yup. I actually saw this twice. Here are both:

error in payment: ERROR: TransactionStatusError: transaction deadline exceeded by 1.653254848s (1550089726.208602717,1 > 1550089724.555347869,0), original timestamp 1m15.162854897s ago (1550089651.045747820,0): id=cf75031c key=/Table/56/1/428/0 rw=true pri=0.01464384 stat=PENDING epo=0 ts=1550089726.208602717,1 orig=1550089651.045747820,0 max=1550089651.045747820,0 wto=false seq=7 (REASON_UNKNOWN) (SQLSTATE XX000)

error in payment: ERROR: TransactionStatusError: transaction deadline exceeded by 7.360572099s (1550089138.588048773,1 > 1550089131.227476674,0), original timestamp 2m23.181141999s ago (1550088995.406906774,0): id=26d9e30d key=/Table/61/1/414/0 rw=true pri=0.07253801 stat=PENDING epo=0 ts=1550089138.588048773,1 orig=1550088995.406906774,0 max=1550088995.407374179,0 wto=false seq=7 (REASON_UNKNOWN) (SQLSTATE XX000)

vivekmenezes · 2019-02-13T21:26:58Z

thanks. This surely indicates to me that I need to first extend the deadline (reducing the chance of this error) before making this a retryable error.

Before this patch, they were opaque TransactionStatusErrors. The belief is that we should only be seeing such errors when a transaction is pushed by minutes. Shockingly, this seems to hapen enough in our tests, for example as described here: cockroachdb#18684 (comment) This patch marks the error as retriable, since it technically is. Touches cockroachdb#18684 Release note (sql change): "transaction deadline exceeded" errors are now returned to the client with a retriable code.

Before this patch, they were opaque TransactionStatusErrors. The belief is that we should only be seeing such errors when a transaction is pushed by minutes. Shockingly, this seems to hapen enough in our tests, for example as described here: cockroachdb#18684 (comment) This patch marks the error as retriable, since it technically is. This patch also changes the semantics of the EndTransactionRequest.Deadline field to make it exclusive so that it matches the nature of SQL leases. No migration needed. Touches cockroachdb#18684 Release note (sql change): "transaction deadline exceeded" errors are now returned to the client with a retriable code.

35284: storage,kv: make transaction deadline exceeded errors retriable r=andreimatei a=andreimatei Before this patch, they were opaque TransactionStatusErrors. The belief is that we should only be seeing such errors when a transaction is pushed by minutes. Shockingly, this seems to hapen enough in our tests, for example as described here: #18684 (comment) This patch marks the error as retriable, since it technically is. This patch also changes the semantics of the EndTransactionRequest.Deadline field to make it exclusive so that it matches the nature of SQL leases. No migration needed. Touches #18684 Release note (sql change): "transaction deadline exceeded" errors are now returned to the client with a retriable code. 35793: storage: fix TestRangeInfo flake and re-enable follower reads by default r=ajwerner a=ajwerner This PR addresses a test flake introduced by enabling follower reads in conjunction with #35130 which makes follower reads more generally possible in the face of lease transfer. Fixes #35758. Release note: None 35865: roachtest: Skip flaky jepsen nemesis r=tbg a=bdarnell See #35599 Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: Ben Darnell <ben@bendarnell.com>

vivekmenezes · 2019-03-18T20:40:39Z

@andreimatei I've confirmed that the tpcc tests in the presence of schema changes where schema versions are changes will retry transactions when they see this error. I think we should close this issue.

andreimatei · 2019-03-18T21:09:13Z

Well there's a lot of history in here about how we shouldn't be pushing transactions over their deadline
#18684 (comment)

and also an analysis of very long transactions: #18684 (comment)

If we close this, we should be opening separate issues for those I think.

jordanlewis · 2019-03-18T22:53:01Z

I don't think this issue is closeable until random workloads don't see "transaction deadline exceeded" anymore. #35337 (comment)

vivekmenezes · 2019-04-16T16:39:44Z

@andreimatei I've created an issue for early detection of the deadline exceeded condition. The other long running transaction analysis is no longer applicable.

@jordanlewis we are no longer seeing test flakes

Closing this issue!

jordanlewis · 2019-04-16T16:51:53Z

Could it be true? The troll is slain? Congratulations!

andreimatei · 2019-04-16T17:23:49Z

@nvanbenschoten do you want to extract the intent resolving pathology in #18684 (comment) into a separate issue?

nvanbenschoten · 2019-04-16T17:54:32Z

Pulled into #36876.

jordanlewis added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels Sep 21, 2017

vivekmenezes assigned lgo Sep 22, 2017

bdarnell added this to the 1.1 milestone Sep 23, 2017

andreimatei mentioned this issue Sep 26, 2017

storage: have serializable restarts take precedence over deadline expiration #18804

Merged

lgo mentioned this issue Sep 27, 2017

cherrypick-1.1: sql: Refresh table leases asynchronously on access #18824

Merged

andreimatei mentioned this issue Sep 27, 2017

cherrypick 1.1: storage: have serializable restarts take precedence over deadline expiration #18852

Merged

vivekmenezes mentioned this issue Nov 19, 2018

roachtest: schemachange/kv failed #32418

Closed

vivekmenezes mentioned this issue Feb 15, 2019

sql: set the deadline at commit time #34939

Closed

andreimatei mentioned this issue Feb 28, 2019

storage,kv: make transaction deadline exceeded errors retriable #35284

Merged

vivekmenezes mentioned this issue Apr 16, 2019

kv: client should look for deadline exceeded error not waiting until commit time #36871

Closed

vivekmenezes closed this as completed Apr 16, 2019

vivekmenezes mentioned this issue Apr 16, 2019

Sysbench test cases "tpcc" and "bulk_insert" against CockroachDB failed #35934

Closed

nvanbenschoten mentioned this issue Apr 16, 2019

storage: replayed large transactions must serially resolve each intent #36876

Closed

nvanbenschoten mentioned this issue Aug 31, 2021

sql: bulk UPDATE statement in implicit txn will retry indefinitely due to txn commit deadline errors #69089

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: TransactionStatusError: transaction deadline exceeded #18684

sql: TransactionStatusError: transaction deadline exceeded #18684

jordanlewis commented Sep 21, 2017

bdarnell commented Sep 21, 2017

lgo commented Sep 21, 2017 •

edited

Loading

vivekmenezes commented Sep 22, 2017

jordanlewis commented Sep 22, 2017

vivekmenezes commented Sep 22, 2017

vivekmenezes commented Sep 22, 2017 •

edited

Loading

vivekmenezes commented Sep 22, 2017

jordanlewis commented Sep 22, 2017

andreimatei commented Sep 22, 2017

bdarnell commented Sep 23, 2017

andreimatei commented Sep 23, 2017

bdarnell commented Sep 25, 2017

andreimatei commented Sep 26, 2017

lgo commented Sep 27, 2017

andreimatei commented Sep 27, 2017

vivekmenezes commented Nov 1, 2018

danhhz commented Feb 13, 2019

vivekmenezes commented Feb 13, 2019

danhhz commented Feb 13, 2019

vivekmenezes commented Feb 13, 2019

vivekmenezes commented Mar 18, 2019

andreimatei commented Mar 18, 2019

jordanlewis commented Mar 18, 2019

vivekmenezes commented Apr 16, 2019

jordanlewis commented Apr 16, 2019

andreimatei commented Apr 16, 2019

nvanbenschoten commented Apr 16, 2019

sql: TransactionStatusError: transaction deadline exceeded #18684

sql: TransactionStatusError: transaction deadline exceeded #18684

Comments

jordanlewis commented Sep 21, 2017

bdarnell commented Sep 21, 2017

lgo commented Sep 21, 2017 • edited Loading

vivekmenezes commented Sep 22, 2017

jordanlewis commented Sep 22, 2017

vivekmenezes commented Sep 22, 2017

vivekmenezes commented Sep 22, 2017 • edited Loading

vivekmenezes commented Sep 22, 2017

jordanlewis commented Sep 22, 2017

andreimatei commented Sep 22, 2017

bdarnell commented Sep 23, 2017

andreimatei commented Sep 23, 2017

bdarnell commented Sep 25, 2017

andreimatei commented Sep 26, 2017

lgo commented Sep 27, 2017

andreimatei commented Sep 27, 2017

vivekmenezes commented Nov 1, 2018

danhhz commented Feb 13, 2019

vivekmenezes commented Feb 13, 2019

danhhz commented Feb 13, 2019

vivekmenezes commented Feb 13, 2019

vivekmenezes commented Mar 18, 2019

andreimatei commented Mar 18, 2019

jordanlewis commented Mar 18, 2019

vivekmenezes commented Apr 16, 2019

jordanlewis commented Apr 16, 2019

andreimatei commented Apr 16, 2019

nvanbenschoten commented Apr 16, 2019

lgo commented Sep 21, 2017 •

edited

Loading

vivekmenezes commented Sep 22, 2017 •

edited

Loading