Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: don't perform one-phase commit transactions after restarts #40518

Merged
merged 5 commits into from
Sep 6, 2019

Conversation

nvanbenschoten
Copy link
Member

Fixes #40466.

This commit adjusts the Replica-side logic around executing one-phase commit transactions. It prevents batches that could be executed on the 1PC fast-path from doing so if they contain transactions in any epoch other than their first epoch. We saw in #40466 that this could lead to intent abandonment if the txn had written intents in earlier epochs.

I believe that this is a new issue that was introduced when we moved from using the presence of a BeginTxn and EndTxn in the same batch to detect 1PC transactions to using the sequence numbers of writes and the presence of an EndTxn in a batch to detect 1PC transactions. That change was necessary for parallel commits. Before that change, I think logic in DistSender was preventing us from hitting this case because it would always split EndTxn requests from the rest of its batch when a txn had been restarted:

// To ensure that we lay down intents to prevent starvation, always
// split the end transaction request into its own batch on retries.
// Txns requiring 1PC are an exception and should never be split.
if ba.Txn != nil && ba.Txn.Epoch > 0 && !require1PC {
splitET = true
}

This was no longer in use, as of 374413f.

Release note: None
Fixes cockroachdb#40466.

This commit adjusts the Replica-side logic around executing one-phase commit
transactions. It prevents batches that could be executed on the 1PC fast-path
from doing so if they contain transactions in any epoch other than their first
epoch. We saw in cockroachdb#40466 that this could lead to intent abandonment if the txn
had written intents in earlier epochs.

I believe that this is a new issue that was introduced when we moved from using
the presence of a BeginTxn and EndTxn in the same batch to detect 1PC transactions
to using the sequence numbers of writes and the presence of an EndTxn in a batch
to detect 1PC transactions. That change was necessary for parallel commits. Before
that change, I think logic in DistSender was preventing us from hitting this case
because it would always split EndTxn requests from the rest of its batch when a
txn had been restarted:
 https://github.com/cockroachdb/cockroach/blob/18bdfe1a691fa6d785d510e86d27cecdac9c436e/pkg/kv/dist_sender.go#L692-L697

Release note: None
This change correctly sets sequence numbers in TestIsOnePhaseCommit
so that we can test the detection of one-phase commit transactions
whose batches do not contain BeginTransaction requests.

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

As of cockroachdb#35284, this was already checked by IsEndTransactionTriggeringRetryError.

Release note: None
We can then use this to centralize its logic and avoid leaking it
around 1PC transaction handling.

Release note: None
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, 3 of 3 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 3 of 3 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@nvanbenschoten
Copy link
Member Author

bors r+

craig bot pushed a commit that referenced this pull request Sep 6, 2019
40518: storage: don't perform one-phase commit transactions after restarts r=nvanbenschoten a=nvanbenschoten

Fixes #40466.

This commit adjusts the Replica-side logic around executing one-phase commit transactions. It prevents batches that could be executed on the 1PC fast-path from doing so if they contain transactions in any epoch other than their first epoch. We saw in #40466 that this could lead to intent abandonment if the txn had written intents in earlier epochs.

I believe that this is a new issue that was introduced when we moved from using the presence of a BeginTxn and EndTxn in the same batch to detect 1PC transactions to using the sequence numbers of writes and the presence of an EndTxn in a batch to detect 1PC transactions. That change was necessary for parallel commits. Before that change, I think logic in DistSender was preventing us from hitting this case because it would always split EndTxn requests from the rest of its batch when a txn had been restarted: https://github.com/cockroachdb/cockroach/blob/18bdfe1a691fa6d785d510e86d27cecdac9c436e/pkg/kv/dist_sender.go#L692-L697

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Contributor

craig bot commented Sep 6, 2019

Build succeeded

@craig craig bot merged commit b738ed2 into cockroachdb:master Sep 6, 2019
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/1pcEpoch branch September 9, 2019 19:21
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Sep 26, 2019
Found while verifying that cockroachdb#40518 didn't negatively impact YCSB performance.
As it turns out, most txns that restart hit the check in DistSender instead,
and that ends up being critical for forward progress. Without it, transactions
appear to starve because they never write intents.

The commit doesn't make any changes there, but it does fix an issue where
1PC transactions were being incorrectly detected. The detection was ignoring
the fact that a 1PC attempt could be rejected by DistSender or a Replica.
We now tie the metric to whether an EndTransaction actually evaluated as
a 1PC txn instead of tying it to whether the EndTransaction _wanted_ to
be evaluated as a 1PC txn.

Release justification: low risk and improves metric reporting

Release note: None
craig bot pushed a commit that referenced this pull request Sep 26, 2019
41098: roachtest/jepsen: don't fail test if retrieving invoke.log fails r=nvanbenschoten a=nvanbenschoten

Fixes #41062.

Release justification: Testing only.

Release note: None

41104: kv: check for 1-phase commit after request, not before r=nvanbenschoten a=nvanbenschoten

Found while verifying that #40518 didn't negatively impact YCSB performance.
As it turns out, most txns that restart hit the check in DistSender instead,
and that ends up being critical for forward progress. Without it, transactions
appear to starve because they never write intents.

The commit doesn't make any changes there, but it does fix an issue where
1PC transactions were being incorrectly detected. The detection was ignoring
the fact that a 1PC attempt could be rejected by DistSender or a Replica.
We now tie the metric to whether an EndTransaction actually evaluated as
a 1PC txn instead of tying it to whether the EndTransaction _wanted_ to
be evaluated as a 1PC txn.

Release justification: low risk and improves metric reporting

Release note: None

41127: Add myself to AUTHORS r=miretskiy a=miretskiy

Authors += Myself

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

teamcity: failed test: TestTxnCoordSenderRetries
3 participants