Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a corner case issue that causes inconsistent Coordinator states when lazy recovery happens before group commit #2135

Merged
merged 15 commits into from
Sep 3, 2024

Conversation

komamitsu
Copy link
Contributor

@komamitsu komamitsu commented Jul 31, 2024

Description

We found the following issue by Jepsen test that could happen in the current implementation of the group commit feature:

  • Let's say there is an ongoing transaction Tx1
    • The group commit feature for the Coordinator table is enabled
    • Tx1 has a transaction ID p99:c1 where p99 is a parent ID and c1 is a child ID
    • Tx1 is managed in a group identified by p99
  • Tx1 is delayed too much for some reasons, and another transaction Tx2 reads a PREPARED record (tx_id: p99:c1) created by Tx1
  • Tx2 inserts a record tx_id: p99:c1, tx_child_ids:{}, tx_state: ABORTED to the Coordinator table to rollback the transaction
  • Tx1 is still alive and inserts a record tx_id: p99, tx_child_ids:{c1}, tx_state: COMMITTED to the Coordinator table. This record's partition key doesn't conflict with the existing record inserted by the lazy recovery
  • Both the lazy recovery and the original commit finished successfully, but it must not occur
image

This PR fixes the issue. For details, see Additional notes below.

Related issues and/or PRs

None

Changes made

  • Make lazy recovery insert a record tx_id: <parent tx ID>, tx_child_ids:{}, tx_state: ABORTED before inserting a record tx_id: p99:c1, tx_child_ids:{}, tx_state: ABORTED that the existing lazy recovery does to conflict with a record insertion by the original commit
    • The additional insertion will be executed only when lazy recoveries read a PREPARE record created with the group commit feature enabled
  • Add and update tests
  • Add a TLA+ specification that focuses on the conflict of group commit and lazy recovery

Checklist

The following is a best-effort checklist. If any items in this checklist are not applicable to this PR or are dependent on other, unmerged PRs, please still mark the checkboxes after you have read and understood each item.

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes.
  • Any remaining open issues linked to this PR are documented and up-to-date (Jira, GitHub, etc.).
  • Tests (unit, integration, etc.) have been added for the changes.
  • My changes generate no new warnings.
  • Any dependent changes in other PRs have been merged and published.

Additional notes (optional)

Lazy recoveries don't know which the transaction that created the PREPARE record is using, a parent ID or a full ID as tx_id partition key.

Case a) If a transaction becomes "ready for commit" in time, it'll be committed in a group with tx_id: <parent tx ID>.
Case b) If a transaction is delayed, it'll be committed in an isolated group with a full ID as tx_id: <full tx ID>.

If lazy recoveries only insert a record with tx_id: <full tx ID> to abort the transaction, it will not conflict with the group commit using tx_id: <parent tx ID> in case #a. Therefore, lazy recoveries first need to insert a record with tx_id: <parent tx ID> and empty tx_child_ids to the Coordinator table. We'll call this insertion lazy-recovery-abort-with-parent-id. This record is intended to conflict with a potential group commit considering case#1, even though it doesn't help in finding the coordinator state since tx_child_ids is empty.

Once the record insertion with tx_id: <parent tx ID> succeeds, the lazy recovery will insert another record with tx_id: <full tx ID>. We'll call this insertion lazy-recovery-abort-with-full-id. This record insertion is needed to conflict with a potential delayed group commit that has tx_id: <full tx ID> in case #b, and indicates the transaction is aborted.

Let's walk through all the cases.

A. The original commit with tx_id: <parent tx ID> succeeds in case #a, and then lazy recovery happens

  • The original commit with tx_id: <parent tx ID> succeeds
  • lazy-recovery-abort-with-parent-id fails
  • The transaction is treated as committed since the commit's tx_child_ids contains the transaction child ID
image

B. The original commit with tx_id: <parent tx ID> is in-progress in case #a, and lazy recovery happens first

  • lazy-recovery-abort-with-parent-id succeeds
  • The original commit with tx_id: <parent tx ID> fails
  • (If the lazy recovery crashes here, another lazy recovery will insert the below lazy-recovery-abort-with-full-id later)
  • lazy-recovery-abort-with-full-id succeeds
  • The transaction is treated as aborted because of lazy-recovery-abort-with-full-id
image

C. The original commit with tx_id: <full tx ID> is done in case #b, and then lazy recovery happens

  • The original commit with tx_id: <full tx ID> succeeds
  • lazy-recovery-abort-with-parent-id succeeds
  • lazy-recovery-abort-with-full-id fails
  • The transaction is treated as committed since the commit's tx_id is the transaction full ID
image

D. The original commit with tx_id: <full tx ID> is in-progress in case #b, and lazy recovery happens first

  • lazy-recovery-abort-with-parent-id succeeds
  • (If the lazy recovery crashes here and the original commit happens, the situation will be the same as C)
  • lazy-recovery-abort-with-full-id succeeds
  • The original commit with tx_id: <full tx ID> fails
  • The transaction is treated as aborted because of lazy-recovery-abort-with-full-id
image

Release notes

Fixed a corner case issue that causes inconsistent Coordinator states when lazy recovery happens before group commit

@komamitsu komamitsu marked this pull request as ready for review July 31, 2024 09:04
@komamitsu komamitsu self-assigned this Jul 31, 2024
@brfrn169
Copy link
Collaborator

brfrn169 commented Aug 2, 2024

@komamitsu Overall, LGTM. Thanks.

One question. For future optimization, can we perform the two writes lazy-recovery-abort-with-parent-id and lazy-recovery-abort-with-full-id concurrently?

@komamitsu
Copy link
Contributor Author

@brfrn169 That's a really good point.

Concurrent writes for the two records might result in the following situation:

  • lazy-recovery-abort-with-full-id succeeds with tx_id: p99:c1
  • The original commit with tx_id: <parent tx ID> succeeds with tx_id: p99, tx_child_ids: [c1]
  • lazy-recovery-abort-with-parent-id fails with tx_id: p99, tx_child_ids: []
  • Finally, the both valid states, tx_id: p99, tx_child_ids: [c1], tx_state: COMMITTED and tx_id: p99:c1, tx_child_ids: [], tx_state: ABORTED unexpectedly co-exist

In other words, lazy-recovery-abort-with-parent-id is used as a lock to guarantee that only either of the following cases occurs:

  • The original commit with tx_id: <parent tx ID> succeeds with tx_id: p99, tx_child_ids: [c1], tx_state: COMMITTED
  • lazy-recovery-abort-with-full-id succeeds with tx_id: p99:c1, tx_child_ids: [], tx_state: ABORTED

@brfrn169
Copy link
Collaborator

brfrn169 commented Aug 5, 2024

@komamitsu Thanks. I understand we can't perform the two writes concurrently.

@brfrn169
Copy link
Collaborator

brfrn169 commented Aug 7, 2024

@komamitsu Sorry, one more question. If we put the child ID as a clustering key column of the record in the coordinator table, would the logic be simpler than the current one?

@komamitsu
Copy link
Contributor Author

@brfrn169 Thanks for the question.

I think it depends on if ScalarDB supports INSERT operation with multiple conditions. If it's possible, lazy-recovery-abort-with-full-id and lazy-recovery-abort-with-parent-id can be merged into a single INSERT (tx_state: ABORTED) like this pseudo code.

// Assuming `tx_sub_id` contains child ID for delayed group commit, and also `tx_child_ids` exists for normal group commit
Insert(
  namespace("coordinator").table("state")
  .partitionKey("tx_id", "p99").clusteringKey("tx_sub_id", "c1")
  .columns(Column("tx_state", "ABORTED"))
  .conditions(
    And(
      IF_NOT_EXISTS(PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", "c1")),
      IF_NOT_EXISTS(PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", NULL))
    )
  ));

My understanding is it's not possible for now. So, I don't think it's beneficial to use a clustering key for child ID in terms of this case.

@brfrn169
Copy link
Collaborator

brfrn169 commented Aug 8, 2024

@komamitsu We can do that as follows:

storage.mutate(
    Arrays.asList(
        Put.newBuilder()
            .namespace("coordinator")
            .table("state")
            .partitionKey(Key.ofText("tx_id", "p99"))
            .clusteringKey(Key.ofText("tx_sub_id", "c1"))
            .intValue("state", TransactionState.ABORTED.get())
            .condition(ConditionBuilder.putIfNotExists())
            .build(),
        Put.newBuilder()
            .namespace("coordinator")
            .table("state")
            .partitionKey(Key.ofText("tx_id", "p99"))
            .clusteringKey(Key.ofText("tx_sub_id", "null"))
            .intValue("state", TransactionState.ABORTED.get())
            .condition(ConditionBuilder.putIfNotExists())
            .build()));

ScalarDB Storage API can atomically mutate multiple records within a partition.

@komamitsu
Copy link
Contributor Author

@brfrn169 Oh, I didn't know Storage API supports multiple atomic mutations. I just looked over the implementations and they seem good.

lazy-recovery-abort-with-full-id and lazy-recovery-abort-with-parent-id can be merged into a single INSERT (tx_state: ABORTED) like this pseudo code.

I was thinking of inserting a single record to coordinator.state table with checking 2 records when writing this comment. But, atomically inserting the 2 records probably work as well and it would be simpler than the current implementation of this PR.

@brfrn169
Copy link
Collaborator

@komamitsu Sorry for the late reply.

I was thinking of inserting a single record to coordinator.state table with checking 2 records when writing this comment. But, atomically inserting the 2 records probably work as well and it would be simpler than the current implementation of this PR.

Okay, Thanks. Let's discuss it offline later.

Copy link
Collaborator

@brfrn169 brfrn169 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the offline discussion! As discussed, since having tx_child_ids as a clustering key in the coordinator.state table doesn't have significant advantages, I think the current implementation is the best approach for now. Thank you!

@feeblefakie
Copy link
Contributor

feeblefakie commented Aug 27, 2024

@komamitsu @brfrn169
Very sorry for the late review.
I don't know what was discussed offline, so I might be asking the same question, but let me ask just in case.

I don't understand why we need to write two records, if using child ID as a clustering key column.

PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", "c1"))
PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", null))

Logically, the following information is enough to indicate that the group (p99) and the transaction (p99:c1) are committed.

PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", "c1"))

Can you clarify?

@komamitsu
Copy link
Contributor Author

komamitsu commented Aug 28, 2024

Logically, the following information is enough to indicate that the group (p99) and the transaction (p99:c1) are committed.

PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", "c1"))

@feeblefakie Your understanding is correct from the perspective of normal operation without considering conflicts between original commits and lazy recoveries.

PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", null))

This is needed to be issued by lazy recoveries only for conflicting with the original commit.

Let's see the above 2 cases in the description again.


A. The original commit with tx_id: <parent tx ID> succeeds in case #a, and then lazy recovery happens

  • The original commit with tx_id: <parent tx ID> succeeds
  • lazy-recovery-abort-with-parent-id fails
  • The transaction is treated as committed since the commit's tx_child_ids contains the transaction child ID
image

B. The original commit with tx_id: <parent tx ID> is in-progress in case #a, and lazy recovery happens first

  • lazy-recovery-abort-with-parent-id succeeds
  • The original commit with tx_id: <parent tx ID> fails
  • (If the lazy recovery crashes here, another lazy recovery will insert the below lazy-recovery-abort-with-full-id later)
  • lazy-recovery-abort-with-full-id succeeds
  • The transaction is treated as aborted because of lazy-recovery-abort-with-full-id
image

If we only insert a single record with full-id-like sub ID (PartitionKey("tx_id", "p99"), ClusteringKey("tx_sub_id", "c1"))), the original commit and lazy recovery don't conflict and will result in the inconsistent state.

image

Does that answer your question?

Copy link
Contributor

@Torch3333 Torch3333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@feeblefakie
Copy link
Contributor

feeblefakie commented Sep 3, 2024

@komamitsu Thank you for the explanation! I understood the fix, but I wondered if there could be better/simpler ways to handle it. That is because the tx_id column is used to have either a parent TX ID or a full TX ID, depending on the case, and I think it is pretty confusing.
I'm not fully sure, and it might be necessary for keeping the backward-compatibility.
Anyway, it's more like a design question/discussion, so I'm fine with the changes to the existing design.
Let's discuss it at some point. (cc: @brfrn169 )

Copy link
Contributor

@feeblefakie feeblefakie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!
(very sorry for the late reply)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants