Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: TransactionStatusError: transaction deadline exceeded #18684

Closed
jordanlewis opened this issue Sep 21, 2017 · 107 comments
Closed

sql: TransactionStatusError: transaction deadline exceeded #18684

jordanlewis opened this issue Sep 21, 2017 · 107 comments
Assignees
Labels
A-schema-descriptors Relating to SQL table/db descriptor handling. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@jordanlewis
Copy link
Member

When running TPCC with more than a single thread, occasionally a transaction will fail with the following error:

pq: TransactionStatusError: transaction deadline exceeded

What is this about? What is the referenced deadline? It seems like a bug that this message is propagated to the user, but what internal condition is causing this scenario and why?

Any ideas? cc @andreimatei @bdarnell

@jordanlewis jordanlewis added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels Sep 21, 2017
@bdarnell
Copy link
Contributor

I believe that message means a table descriptor lease has expired. cc @vivekmenezes @LEGO

@lgo
Copy link
Contributor

lgo commented Sep 21, 2017

@jordanlewis how frequent is the error? If it is frequent enough it will hopefully be easy to test whether the lease expiring and re-acquisition is the issue since I have an in-flight PR right now (relevant PR #18606, issue #17227).

That is peculiar that it happens specifically with more than one thread. There are a few places in the lease acquisition where I can see race conditions causing starvation but only if the lease is immediately released or maybe under heavy load. This could go either two ways: the above PR fixes it because lease acquisition is taking too long, or it's a race-condition causing a thread to be starved. I would think the former would happen with also one thread though, so it's less likely to be that.

(scratch that on the race conditions, I was misunderstanding parts of the code 😅)

@vivekmenezes
Copy link
Contributor

This bug is happening because we are using a lease very close to its expiration deadline (we changed that recently to do that in #16842 by taking out the notion of MinLeaseDuration). I suspect we are seeing this here because a transaction with SNAPSHOT ISOLATION is using a table descriptor with an unexpired lease but is having its COMMIT pushed beyond the lease expiration (which is what the txn expiration is set to).

lego's PR will fix this because it will reacquire a lease way before it expires and thus transactions will use leases with expirations further out into the future (like a minute).

@jordanlewis
Copy link
Member Author

All transactions in TPCC are set to Serializable isolation.

@vivekmenezes
Copy link
Contributor

if you can log the txn.OrigTimestamp() and expiration in getTableVersion() after the call to AcquireByName() and panic instead of returning the error with the deadline timestamp and the txn timestamp we might get a better clue. Thanks!

@vivekmenezes
Copy link
Contributor

vivekmenezes commented Sep 22, 2017

of course there is also the possibility that we should not be using OrigTimestamp() when acquiring leases. I think I need to ask around as to what API to use to get the COMMIT timestamp for a transaction as it's running

@vivekmenezes
Copy link
Contributor

Turns out using a timestamp near the deadline, while valid when picked, is bad in general, because the transaction timestamp can be pushed in the background by another transaction and a COMMIT will see the new pushed timestamp (when it reads the transaction record from the store) and return a violation of the deadline.

So #18606 is a good fix for this issue

@jordanlewis
Copy link
Member Author

@LEGO it's pretty easy to reproduce that error with TPCC. Perhaps you could try it out on your branch? Instructions:

  1. Start a local cockroach instance
  2. Clone the cockroachdb/loadgen repo, enter the tpcc directory, go build
  3. Load the data with ./tpcc -load
  4. Once the data's loaded, it'll start the load generator automatically. Ctrl-C
  5. Run with high concurrency: ./tpcc -concurrency=128. That usually triggers the above error for me.

@andreimatei
Copy link
Contributor

#18606 seems like an improvement to things but ideally we wouldn't rely on some async process to ensure that our transactions can commit.
The general idea, I believe, should be that you can verify if a timestamp is valid or not for a txn (wrt schema changes) whenever you want. So, if some layer is about to say "deadline exceeded", it should also give SQL a chance to say "no, actually, the txn timestamp is still fine because I've verified that the descriptors are used are valid for the timestamp in question". I don't know what the best way to transcribe that into code would be, but we can investigate.

@bdarnell
Copy link
Contributor

Prior to #16842, we checked the time left on the lease and would (synchronously) renew it if there was less than a minute left. Why was this removed? Without it it's easy to run into this error because there's no guarantee that there's a useful amount of time left on the lease. This is a regression compared to 1.0 so I'm tagging this for 1.1.

I think we need both MinLeaseDuration and #18606: synchronously renew the lease when it's too close to expiring, and asynchronously renew it somewhat before it reaches that point.

Additionally, to support arbitrarily-long-running transactions, we would need some way for the renewed lease to be fed into the coordinator (and from there probably written to the transaction record via HeartbeatTxn). That's much lower priority, though.

@bdarnell bdarnell added this to the 1.1 milestone Sep 23, 2017
@andreimatei
Copy link
Contributor

  1. First of all, let's agree that these txn deadlines should only generate errors for SNAPSHOT transactions. For SERIALIZABLE txns, the TransactionRetryError caused by the push should take precedence. It is, in my opinion, a bug that this line generating the deadline error:

    if isEndTransactionExceedingDeadline(reply.Txn.Timestamp, *args) {

    is above the lines just below that would generate the TransactionRetryError. I think this is the first thing that we should fix; then all the discussion remaining would be restricted to SNAPSHOT txns.

  2. Second of all, I think we should turn these deadline errors into retriable errors so that SQL retries.

Third, I'd like to respond to Ben's message:

Without it it's easy to run into this error because there's no guarantee that there's a useful amount of time left on the lease. This is a regression compared to 1.0 so I'm tagging this for 1.1.

What you say is true, but just in case we're not all on the same page, note that this deadline does not prevent "arbitrarily-long-running transactions". It just prevents arbitrarily-big-pushes. You can have a transaction run arbitrarily long without changing its timestamp (and the deadline is checked against the txn timestamp, not the clock).

So I don't think that what we want to do is synchronously renew leases (or, rather, I'd ask you when do you renew the lease - when it's close to expiring or when the txn in question has been pushed close to the deadline, or even beyond it?). In all this discussion, keep in mind that the client may not even know that the transaction has been pushed.

Another thing I want to note is that, for a transaction to be able to commit at a particular timestamp, it doesn't necessarily need for the gateway to have valid leases on the descriptors that it used. Remember from the newer version of the descriptors and leases RFC that the conditions sufficient for a transaction to commit are divorced from the leases; they're about the validity window of the descriptors it used.

How about this: we augment the client.Txn interface with a callback through which KV can ask whomever has set the txn deadline (i.e. SQL) about whether the txn can commit at a timestamp T_pushed beyond the originally-set deadline. SQL would set this callback to check whether T_pushed is within the validity window of all the descriptors; it would do this by looking at the descriptor versions that the node knows about and, where it doesn't have enough information, it'd read the database. Reading the database may, as a byproduct, also renew a lease, but that's not essential.
The callback would be called by txn.Send() whenever it gets a "deadline exceeded" error. We'd rely on these errors only being generated by EndTransaction and so, if the callback says that we're still golden at the new timestamp, then client.Txn would re-issue the EndTransaction (just this, no other requests from the last batch) with an updated deadline.

Otherwise, what you're hinting at around renewing leases in the background and updating txn deadlines in the background sounds good - it'd prevent the "deadline error" from happening in most cases. But note that a lease expiration is not trivially translatable into a txn deadline: the deadline for a txn is set based on a set of descriptors not just one (and also, again, based on validity windows which may not be related to leases).

@bdarnell
Copy link
Contributor

First of all, let's agree that these txn deadlines should only generate errors for SNAPSHOT transactions.

Agreed.

So I don't think that what we want to do is synchronously renew leases (or, rather, I'd ask you when do you renew the lease - when it's close to expiring or when the txn in question has been pushed close to the deadline, or even beyond it?)

I'd only synchronously get the lease when it is first accessed. After that we can asynchronously renew before expiration for as long as the transaction's open.

The callback would be called by txn.Send() whenever it gets a "deadline exceeded" error. We'd rely on these errors only being generated by EndTransaction and so, if the callback says that we're still golden at the new timestamp, then client.Txn would re-issue the EndTransaction (just this, no other requests from the last batch) with an updated deadline.

I don't like the idea of retrying after an EndTransaction. Per comments in the code, we'd like to formally abort the transaction when this happens (but this is currently prevented by other issues). If we did that, we wouldn't be able to retry. Maybe we just change the comments and guarantee that attempting to commit with a failed deadline is safe, but this seems subtle to me.

@andreimatei
Copy link
Contributor

I don't like the idea of retrying after an EndTransaction. Per comments in the code, we'd like to formally abort the transaction when this happens (but this is currently prevented by other issues). If we did that, we wouldn't be able to retry. Maybe we just change the comments and guarantee that attempting to commit with a failed deadline is safe, but this seems subtle to me.

The comments you're referring to were written (presumably) under the belief that what we want to do when a transaction's deadline was exceeded is to abort the transaction. I don't think they should be used to inform decisions in the situation where we decide that we, in fact, do not want to abort.
To take this further, I in fact believe that we never want to abort; in the worst case, the transaction should be prepared for a retry (increment epoch, leave intents in place).

@lgo
Copy link
Contributor

lgo commented Sep 27, 2017

Merged #18824 which will help resolve the immediate issue at hand, the TransactionStatusError error. Do we want to close this issue and open another regarding the other problem around serializable restarts?

@andreimatei
Copy link
Contributor

This issue has not been resolved by #18824; your PR has only reduced the incidence of the problem (a txn pushed sufficiently will still encounter the same error; you've just made it so that the required push is now expected to be larger than before).
For Serializable txns, I've sent a PR that makes them never encounter this error; it's linked above.

We can remove the 1.1 milestone, but otherwise the issue is still valid.
But, it seems to me like the patch in question still needs some work. I've left some comments on the original PR. If I'm right, I think we should rollback the cherry-pick until the patch gets a bit more love.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 27, 2017
…iration

Before this patch, if a Serializable txn was pushed and the push caused
the txn to exceed its deadline, the client would get a "deadline
exceeded" error back. What the client should have gotten is the
retriable error, as that would allow it to retry. This patch fixes this
by inverting the order of error generation. As a result, Serializable
txns shouldn't see "deadline exceeded" errors any more.

Touches cockroachdb#18684
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 27, 2017
…iration

Before this patch, if a Serializable txn was pushed and the push caused
the txn to exceed its deadline, the client would get a "deadline
exceeded" error back. What the client should have gotten is the
retriable error, as that would allow it to retry. This patch fixes this
by inverting the order of error generation. As a result, Serializable
txns shouldn't see "deadline exceeded" errors any more.

Touches cockroachdb#18684
@vivekmenezes
Copy link
Contributor

One plan here is to always set the deadline at commit time so that the transaction uses the deadline of a later refreshed lease with an updated (refreshed) deadline that is more likely to succeed. This can be achieved by ensuring that the transaction Commit API always uses a deadline instead of making the deadline a property of the transaction itself. At commit time the sql layer can call the API with the latest refreshed deadline for all the descriptors used in the transaction.

The above will also get rid of needing to pass the transaction as a flag to the schema resolver API.

@danhhz
Copy link
Contributor

danhhz commented Feb 13, 2019

FWIW, I just hit this in a local run of the tpcc/nodes=3/w=max roachtest running a binary at sha 72edf20.

@vivekmenezes
Copy link
Contributor

@danhhz any chance you can add the error message because it now provides more information. Thanks

@danhhz
Copy link
Contributor

danhhz commented Feb 13, 2019

Yup. I actually saw this twice. Here are both:

error in payment: ERROR: TransactionStatusError: transaction deadline exceeded by 1.653254848s (1550089726.208602717,1 > 1550089724.555347869,0), original timestamp 1m15.162854897s ago (1550089651.045747820,0): id=cf75031c key=/Table/56/1/428/0 rw=true pri=0.01464384 stat=PENDING epo=0 ts=1550089726.208602717,1 orig=1550089651.045747820,0 max=1550089651.045747820,0 wto=false seq=7 (REASON_UNKNOWN) (SQLSTATE XX000)
error in payment: ERROR: TransactionStatusError: transaction deadline exceeded by 7.360572099s (1550089138.588048773,1 > 1550089131.227476674,0), original timestamp 2m23.181141999s ago (1550088995.406906774,0): id=26d9e30d key=/Table/61/1/414/0 rw=true pri=0.07253801 stat=PENDING epo=0 ts=1550089138.588048773,1 orig=1550088995.406906774,0 max=1550088995.407374179,0 wto=false seq=7 (REASON_UNKNOWN) (SQLSTATE XX000)

@vivekmenezes
Copy link
Contributor

thanks. This surely indicates to me that I need to first extend the deadline (reducing the chance of this error) before making this a retryable error.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Feb 28, 2019
Before this patch, they were opaque TransactionStatusErrors.
The belief is that we should only be seeing such errors when a
transaction is pushed by minutes. Shockingly, this seems to hapen enough
in our tests, for example as described here: cockroachdb#18684 (comment)

This patch marks the error as retriable, since it technically is.

Touches cockroachdb#18684

Release note (sql change): "transaction deadline exceeded" errors are
now returned to the client with a retriable code.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Feb 28, 2019
Before this patch, they were opaque TransactionStatusErrors.
The belief is that we should only be seeing such errors when a
transaction is pushed by minutes. Shockingly, this seems to hapen enough
in our tests, for example as described here: cockroachdb#18684 (comment)

This patch marks the error as retriable, since it technically is.

Touches cockroachdb#18684

Release note (sql change): "transaction deadline exceeded" errors are
now returned to the client with a retriable code.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Mar 17, 2019
Before this patch, they were opaque TransactionStatusErrors.
The belief is that we should only be seeing such errors when a
transaction is pushed by minutes. Shockingly, this seems to hapen enough
in our tests, for example as described here: cockroachdb#18684 (comment)

This patch marks the error as retriable, since it technically is.

This patch also changes the semantics of the
EndTransactionRequest.Deadline field to make it exclusive so that it
matches the nature of SQL leases. No migration needed.

Touches cockroachdb#18684

Release note (sql change): "transaction deadline exceeded" errors are
now returned to the client with a retriable code.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Mar 18, 2019
Before this patch, they were opaque TransactionStatusErrors.
The belief is that we should only be seeing such errors when a
transaction is pushed by minutes. Shockingly, this seems to hapen enough
in our tests, for example as described here: cockroachdb#18684 (comment)

This patch marks the error as retriable, since it technically is.

This patch also changes the semantics of the
EndTransactionRequest.Deadline field to make it exclusive so that it
matches the nature of SQL leases. No migration needed.

Touches cockroachdb#18684

Release note (sql change): "transaction deadline exceeded" errors are
now returned to the client with a retriable code.
craig bot pushed a commit that referenced this issue Mar 18, 2019
35284: storage,kv: make transaction deadline exceeded errors retriable r=andreimatei a=andreimatei

Before this patch, they were opaque TransactionStatusErrors.
The belief is that we should only be seeing such errors when a
transaction is pushed by minutes. Shockingly, this seems to hapen enough
in our tests, for example as described here: #18684 (comment)

This patch marks the error as retriable, since it technically is.

This patch also changes the semantics of the
EndTransactionRequest.Deadline field to make it exclusive so that it
matches the nature of SQL leases. No migration needed.

Touches #18684

Release note (sql change): "transaction deadline exceeded" errors are
now returned to the client with a retriable code.

35793: storage: fix TestRangeInfo flake and re-enable follower reads by default r=ajwerner a=ajwerner

This PR addresses a test flake introduced by enabling follower reads in
conjunction with #35130 which makes follower reads more generally possible
in the face of lease transfer.

Fixes #35758.

Release note: None

35865: roachtest: Skip flaky jepsen nemesis r=tbg a=bdarnell

See #35599

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Co-authored-by: Ben Darnell <ben@bendarnell.com>
@vivekmenezes
Copy link
Contributor

@andreimatei I've confirmed that the tpcc tests in the presence of schema changes where schema versions are changes will retry transactions when they see this error. I think we should close this issue.

@andreimatei
Copy link
Contributor

Well there's a lot of history in here about how we shouldn't be pushing transactions over their deadline
#18684 (comment)

and also an analysis of very long transactions: #18684 (comment)

If we close this, we should be opening separate issues for those I think.

@jordanlewis
Copy link
Member Author

I don't think this issue is closeable until random workloads don't see "transaction deadline exceeded" anymore. #35337 (comment)

@vivekmenezes
Copy link
Contributor

@andreimatei I've created an issue for early detection of the deadline exceeded condition. The other long running transaction analysis is no longer applicable.

@jordanlewis we are no longer seeing test flakes

Closing this issue!

@jordanlewis
Copy link
Member Author

Could it be true? The troll is slain? Congratulations!

@andreimatei
Copy link
Contributor

@nvanbenschoten do you want to extract the intent resolving pathology in #18684 (comment) into a separate issue?

@nvanbenschoten
Copy link
Member

Pulled into #36876.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-descriptors Relating to SQL table/db descriptor handling. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests