Skip to content

Conversation

@kzalys
Copy link
Contributor

@kzalys kzalys commented Oct 31, 2025

For issue https://issues.apache.org/jira/browse/CASSANDRA-20996

The auto-repair history management mechanism does not use LWTs for all of its history table queries. As a result, it is possible to run into a few edge cases where a deleted node's repair history gets resurrected. For example:
This can lead to the following race condition:

  1. Node A sends out a query to vote for Node D to be removed from auto-repair history
  2. Whereas node B sends out a query clear the node D's delete hosts list (or to set force repair to true for node D)
  3. Node A's vote is carried out, it is now present in the auto-repair history table
  4. Node C sees that enough nodes have voted to remove Node D and sends out a query to delete the auto-repair history for Node D.
  5. The delete query is executed, a tombstone is inserted for Node D's auto-repair history
  6. Finally, Node B's query to modify Node D's entry in the repair history is carried out, this mutation has a higher timestamp than the tombstone inserted by Node C. As a result, Node D's auto-repair history gets resurrected.

This PR introduces a unit test to simulate these out-of-order deletions/upserts and updates the auto-repair history mutations to use LWTs in order to prevent the race conditions from happening.

@kzalys kzalys changed the title Use LWTs for all auto-repair history mutations CASSANDRA-20996 Use LWTs for all auto-repair history mutations Oct 31, 2025
@kzalys
Copy link
Contributor Author

kzalys commented Nov 11, 2025

@jaydeepkumar1984 can I have a review on this please?

@@ -538,4 +538,42 @@ public void testSkipSystemTraces()
{
assertFalse(AutoRepairUtils.shouldConsiderKeyspace(Keyspace.open(SchemaConstants.TRACE_KEYSPACE_NAME)));
}

@Test
public void testAutoRepairHistoryOutOfOrderDeleteRaceCondition()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test passes even without any of the changes because ADD_HOST_ID_TO_DELETE_HOSTS already had IF EXISTS, however, the changes in this PR are necessary.
Please clarify in the description that the PR includes the test case and a certain cases we missed earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've extended this test to include cases that make this test fail without the changes in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, updated the description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants