Fix a corner case issue that causes inconsistent Coordinator states when lazy recovery happens before group commit #2135

komamitsu · 2024-07-31T08:53:49Z

Description

We found the following issue by Jepsen test that could happen in the current implementation of the group commit feature:

Let's say there is an ongoing transaction Tx1
- The group commit feature for the Coordinator table is enabled
- Tx1 has a transaction ID p99:c1 where p99 is a parent ID and c1 is a child ID
- Tx1 is managed in a group identified by p99
Tx1 is delayed too much for some reasons, and another transaction Tx2 reads a PREPARED record (tx_id: p99:c1) created by Tx1
Tx2 inserts a record tx_id: p99:c1, tx_child_ids:{}, tx_state: ABORTED to the Coordinator table to rollback the transaction
Tx1 is still alive and inserts a record tx_id: p99, tx_child_ids:{c1}, tx_state: COMMITTED to the Coordinator table. This record's partition key doesn't conflict with the existing record inserted by the lazy recovery
Both the lazy recovery and the original commit finished successfully, but it must not occur

This PR fixes the issue. For details, see Additional notes below.

Related issues and/or PRs

None

Changes made

Make lazy recovery insert a record tx_id: <parent tx ID>, tx_child_ids:{}, tx_state: ABORTED before inserting a record tx_id: p99:c1, tx_child_ids:{}, tx_state: ABORTED that the existing lazy recovery does to conflict with a record insertion by the original commit
- The additional insertion will be executed only when lazy recoveries read a PREPARE record created with the group commit feature enabled
Add and update tests
Add a TLA+ specification that focuses on the conflict of group commit and lazy recovery

Checklist

The following is a best-effort checklist. If any items in this checklist are not applicable to this PR or are dependent on other, unmerged PRs, please still mark the checkboxes after you have read and understood each item.

I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes.
Any remaining open issues linked to this PR are documented and up-to-date (Jira, GitHub, etc.).
Tests (unit, integration, etc.) have been added for the changes.
My changes generate no new warnings.
Any dependent changes in other PRs have been merged and published.

Additional notes (optional)

Lazy recoveries don't know which the transaction that created the PREPARE record is using, a parent ID or a full ID as tx_id partition key.

Case a) If a transaction becomes "ready for commit" in time, it'll be committed in a group with tx_id: <parent tx ID>.
Case b) If a transaction is delayed, it'll be committed in an isolated group with a full ID as tx_id: <full tx ID>.

If lazy recoveries only insert a record with tx_id: <full tx ID> to abort the transaction, it will not conflict with the group commit using tx_id: <parent tx ID> in case #a. Therefore, lazy recoveries first need to insert a record with tx_id: <parent tx ID> and empty tx_child_ids to the Coordinator table. We'll call this insertion lazy-recovery-abort-with-parent-id. This record is intended to conflict with a potential group commit considering case#1, even though it doesn't help in finding the coordinator state since tx_child_ids is empty.

Once the record insertion with tx_id: <parent tx ID> succeeds, the lazy recovery will insert another record with tx_id: <full tx ID>. We'll call this insertion lazy-recovery-abort-with-full-id. This record insertion is needed to conflict with a potential delayed group commit that has tx_id: <full tx ID> in case #b, and indicates the transaction is aborted.

Let's walk through all the cases.

A. The original commit with tx_id: <parent tx ID> succeeds in case #a, and then lazy recovery happens

The original commit with tx_id: <parent tx ID> succeeds
lazy-recovery-abort-with-parent-id fails
The transaction is treated as committed since the commit's tx_child_ids contains the transaction child ID

B. The original commit with tx_id: <parent tx ID> is in-progress in case #a, and lazy recovery happens first

lazy-recovery-abort-with-parent-id succeeds
The original commit with tx_id: <parent tx ID> fails
(If the lazy recovery crashes here, another lazy recovery will insert the below lazy-recovery-abort-with-full-id later)
lazy-recovery-abort-with-full-id succeeds
The transaction is treated as aborted because of lazy-recovery-abort-with-full-id

C. The original commit with tx_id: <full tx ID> is done in case #b, and then lazy recovery happens

The original commit with tx_id: <full tx ID> succeeds
lazy-recovery-abort-with-parent-id succeeds
lazy-recovery-abort-with-full-id fails
The transaction is treated as committed since the commit's tx_id is the transaction full ID

D. The original commit with tx_id: <full tx ID> is in-progress in case #b, and lazy recovery happens first

lazy-recovery-abort-with-parent-id succeeds
(If the lazy recovery crashes here and the original commit happens, the situation will be the same as C)
lazy-recovery-abort-with-full-id succeeds
The original commit with tx_id: <full tx ID> fails
The transaction is treated as aborted because of lazy-recovery-abort-with-full-id

Release notes

Fixed a corner case issue that causes inconsistent Coordinator states when lazy recovery happens before group commit

by group commit and lazy recovery

brfrn169 · 2024-08-02T07:56:04Z

@komamitsu Overall, LGTM. Thanks.

One question. For future optimization, can we perform the two writes lazy-recovery-abort-with-parent-id and lazy-recovery-abort-with-full-id concurrently?

komamitsu · 2024-08-02T09:36:33Z

@brfrn169 That's a really good point.

Concurrent writes for the two records might result in the following situation:

lazy-recovery-abort-with-full-id succeeds with tx_id: p99:c1
The original commit with tx_id: <parent tx ID> succeeds with tx_id: p99, tx_child_ids: [c1]
lazy-recovery-abort-with-parent-id fails with tx_id: p99, tx_child_ids: []
Finally, the both valid states, tx_id: p99, tx_child_ids: [c1], tx_state: COMMITTED and tx_id: p99:c1, tx_child_ids: [], tx_state: ABORTED unexpectedly co-exist

In other words, lazy-recovery-abort-with-parent-id is used as a lock to guarantee that only either of the following cases occurs:

The original commit with tx_id: <parent tx ID> succeeds with tx_id: p99, tx_child_ids: [c1], tx_state: COMMITTED
lazy-recovery-abort-with-full-id succeeds with tx_id: p99:c1, tx_child_ids: [], tx_state: ABORTED

brfrn169 · 2024-08-05T14:59:43Z

@komamitsu Thanks. I understand we can't perform the two writes concurrently.

brfrn169 · 2024-08-07T07:30:46Z

@komamitsu Sorry, one more question. If we put the child ID as a clustering key column of the record in the coordinator table, would the logic be simpler than the current one?

komamitsu · 2024-08-07T10:39:42Z

@brfrn169 Thanks for the question.

I think it depends on if ScalarDB supports INSERT operation with multiple conditions. If it's possible, lazy-recovery-abort-with-full-id and lazy-recovery-abort-with-parent-id can be merged into a single INSERT (tx_state: ABORTED) like this pseudo code.

// Assuming `tx_sub_id` contains child ID for delayed group commit, and also `tx_child_ids` exists for normal group commit
Insert(
  namespace("coordinator").table("state")
  .partitionKey("tx_id", "p99").clusteringKey("tx_sub_id", "c1")
  .columns(Column("tx_state", "ABORTED"))
  .conditions(
    And(
      IF_NOT_EXISTS(PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", "c1")),
      IF_NOT_EXISTS(PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", NULL))
    )
  ));

My understanding is it's not possible for now. So, I don't think it's beneficial to use a clustering key for child ID in terms of this case.

brfrn169 · 2024-08-08T01:47:48Z

@komamitsu We can do that as follows:

storage.mutate(
    Arrays.asList(
        Put.newBuilder()
            .namespace("coordinator")
            .table("state")
            .partitionKey(Key.ofText("tx_id", "p99"))
            .clusteringKey(Key.ofText("tx_sub_id", "c1"))
            .intValue("state", TransactionState.ABORTED.get())
            .condition(ConditionBuilder.putIfNotExists())
            .build(),
        Put.newBuilder()
            .namespace("coordinator")
            .table("state")
            .partitionKey(Key.ofText("tx_id", "p99"))
            .clusteringKey(Key.ofText("tx_sub_id", "null"))
            .intValue("state", TransactionState.ABORTED.get())
            .condition(ConditionBuilder.putIfNotExists())
            .build()));

ScalarDB Storage API can atomically mutate multiple records within a partition.

komamitsu · 2024-08-08T03:07:04Z

@brfrn169 Oh, I didn't know Storage API supports multiple atomic mutations. I just looked over the implementations and they seem good.

lazy-recovery-abort-with-full-id and lazy-recovery-abort-with-parent-id can be merged into a single INSERT (tx_state: ABORTED) like this pseudo code.

I was thinking of inserting a single record to coordinator.state table with checking 2 records when writing this comment. But, atomically inserting the 2 records probably work as well and it would be simpler than the current implementation of this PR.

brfrn169 · 2024-08-15T10:17:40Z

@komamitsu Sorry for the late reply.

I was thinking of inserting a single record to coordinator.state table with checking 2 records when writing this comment. But, atomically inserting the 2 records probably work as well and it would be simpler than the current implementation of this PR.

Okay, Thanks. Let's discuss it offline later.

brfrn169

Thank you for the offline discussion! As discussed, since having tx_child_ids as a clustering key in the coordinator.state table doesn't have significant advantages, I think the current implementation is the best approach for now. Thank you!

feeblefakie · 2024-08-27T10:32:40Z

@komamitsu @brfrn169
Very sorry for the late review.
I don't know what was discussed offline, so I might be asking the same question, but let me ask just in case.

I don't understand why we need to write two records, if using child ID as a clustering key column.

PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", "c1"))
PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", null))

Logically, the following information is enough to indicate that the group (p99) and the transaction (p99:c1) are committed.

PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", "c1"))

Can you clarify?

komamitsu · 2024-08-28T02:46:41Z

Logically, the following information is enough to indicate that the group (p99) and the transaction (p99:c1) are committed.

PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", "c1"))

@feeblefakie Your understanding is correct from the perspective of normal operation without considering conflicts between original commits and lazy recoveries.

PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", null))

This is needed to be issued by lazy recoveries only for conflicting with the original commit.

Let's see the above 2 cases in the description again.

A. The original commit with tx_id: <parent tx ID> succeeds in case #a, and then lazy recovery happens

The original commit with tx_id: <parent tx ID> succeeds
lazy-recovery-abort-with-parent-id fails
The transaction is treated as committed since the commit's tx_child_ids contains the transaction child ID

B. The original commit with tx_id: <parent tx ID> is in-progress in case #a, and lazy recovery happens first

lazy-recovery-abort-with-parent-id succeeds
The original commit with tx_id: <parent tx ID> fails
(If the lazy recovery crashes here, another lazy recovery will insert the below lazy-recovery-abort-with-full-id later)
lazy-recovery-abort-with-full-id succeeds
The transaction is treated as aborted because of lazy-recovery-abort-with-full-id

If we only insert a single record with full-id-like sub ID (PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", "c1"))), the original commit and lazy recovery don't conflict and will result in the inconsistent state.

Does that answer your question?

Torch3333

LGTM, thank you!

feeblefakie · 2024-09-03T02:24:33Z

@komamitsu Thank you for the explanation! I understood the fix, but I wondered if there could be better/simpler ways to handle it. That is because the tx_id column is used to have either a parent TX ID or a full TX ID, depending on the case, and I think it is pretty confusing.
I'm not fully sure, and it might be necessary for keeping the backward-compatibility.
Anyway, it's more like a design question/discussion, so I'm fine with the changes to the existing design.
Let's discuss it at some point. (cc: @brfrn169 )

feeblefakie

LGTM! Thank you!
(very sorry for the late reply)

…hen lazy recovery happens before group commit (#2135)

komamitsu added 14 commits July 25, 2024 16:18

Prevent both commit and abort from occurring

24e0f71

by group commit and lazy recovery

Fix a bug that happens in lazy-recovery

f4d43a6

Add CoordinatorWithGroupCommit.tla

0568b18

Add TODO

e899dff

Improve CoordinatorWithGroupCommit.tla

719a6e6

Improve CoordinatorWithGroupCommit.tla

1ddbcdb

Improve CoordinatorWithGroupCommit.tla

21e97fe

Consider PREPARED records

c99db07

Add CoordinatorWithGroupCommit.toolbox

91aae8f

Add a summary comment to the TLA+ spec

42f94d6

WIP

6c709f5

Add a huge comment

8d9b5df

Add and update unit tests

39c47fa

Add test cases that group commit created PREPARED

fc70320

komamitsu marked this pull request as ready for review July 31, 2024 09:04

komamitsu self-assigned this Jul 31, 2024

komamitsu added the bugfix label Jul 31, 2024

komamitsu requested review from feeblefakie, brfrn169 and Torch3333 July 31, 2024 09:04

brfrn169 approved these changes Aug 21, 2024

View reviewed changes

Torch3333 approved these changes Aug 28, 2024

View reviewed changes

feeblefakie approved these changes Sep 3, 2024

View reviewed changes

Merge branch 'master' into handle-early-lazy-recovery

94d21ad

brfrn169 merged commit 9257340 into master Sep 3, 2024
46 checks passed

brfrn169 deleted the handle-early-lazy-recovery branch September 3, 2024 04:42

feeblefakie pushed a commit that referenced this pull request Sep 3, 2024

Fix a corner case issue that causes inconsistent Coordinator states w…

93333b1

…hen lazy recovery happens before group commit (#2135)

feeblefakie mentioned this pull request Sep 3, 2024

Backport to branch(3) : Fix a corner case issue that causes inconsistent Coordinator states when lazy recovery happens before group commit #2197

Merged

feeblefakie pushed a commit that referenced this pull request Sep 3, 2024

Fix a corner case issue that causes inconsistent Coordinator states w…

21fde17

…hen lazy recovery happens before group commit (#2135)

feeblefakie mentioned this pull request Sep 3, 2024

Backport to branch(3.13) : Fix a corner case issue that causes inconsistent Coordinator states when lazy recovery happens before group commit #2198

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a corner case issue that causes inconsistent Coordinator states when lazy recovery happens before group commit #2135

Fix a corner case issue that causes inconsistent Coordinator states when lazy recovery happens before group commit #2135

komamitsu commented Jul 31, 2024 •

edited

Loading

brfrn169 commented Aug 2, 2024 •

edited

Loading

komamitsu commented Aug 2, 2024

brfrn169 commented Aug 5, 2024

brfrn169 commented Aug 7, 2024

komamitsu commented Aug 7, 2024

brfrn169 commented Aug 8, 2024 •

edited

Loading

komamitsu commented Aug 8, 2024

brfrn169 commented Aug 15, 2024

brfrn169 left a comment

feeblefakie commented Aug 27, 2024 •

edited

Loading

komamitsu commented Aug 28, 2024 •

edited

Loading

Torch3333 left a comment

feeblefakie commented Sep 3, 2024 •

edited

Loading

feeblefakie left a comment

Fix a corner case issue that causes inconsistent Coordinator states when lazy recovery happens before group commit #2135

Fix a corner case issue that causes inconsistent Coordinator states when lazy recovery happens before group commit #2135

Conversation

komamitsu commented Jul 31, 2024 • edited Loading

Description

Related issues and/or PRs

Changes made

Checklist

Additional notes (optional)

Release notes

brfrn169 commented Aug 2, 2024 • edited Loading

komamitsu commented Aug 2, 2024

brfrn169 commented Aug 5, 2024

brfrn169 commented Aug 7, 2024

komamitsu commented Aug 7, 2024

brfrn169 commented Aug 8, 2024 • edited Loading

komamitsu commented Aug 8, 2024

brfrn169 commented Aug 15, 2024

brfrn169 left a comment

Choose a reason for hiding this comment

feeblefakie commented Aug 27, 2024 • edited Loading

komamitsu commented Aug 28, 2024 • edited Loading

Torch3333 left a comment

Choose a reason for hiding this comment

feeblefakie commented Sep 3, 2024 • edited Loading

feeblefakie left a comment

Choose a reason for hiding this comment

komamitsu commented Jul 31, 2024 •

edited

Loading

brfrn169 commented Aug 2, 2024 •

edited

Loading

brfrn169 commented Aug 8, 2024 •

edited

Loading

feeblefakie commented Aug 27, 2024 •

edited

Loading

komamitsu commented Aug 28, 2024 •

edited

Loading

feeblefakie commented Sep 3, 2024 •

edited

Loading