fix(scan_operations): add retry policy to cql query #9600

aleksbykov · 2024-12-22T10:21:41Z

The node where scan operations was started could be
used by disruptive nemesis. If node was restarted/stopped
while scan query had been running, the scan operation would
be terminated and error event and message will mark
test as failed.

Add to cql session ExponentialBackoffRetryPolicy
which allow to retry the query, if node was down
and once it back, query will be succesfully finished

Fixes: #9284

Testing

Test 1 - Passed
Test 2 - Failed - failed due to c-s error: java.io.IOException: Operation x10 on key(s) [32374e4d3230504b3030]: Data returned was not validated. Errors related to Scan operations are not reproduced
Test 3 - Failed - job failed due to issue scylla server hangs on shutdown - on mapreduce (eventually getting killed after 15 minutes with no progress) scylladb#21568, Errors related to Scan operations were not reproduced

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

soyacz · 2024-12-23T07:25:31Z

sdcm/scan_operation_thread.py

@@ -460,6 +474,9 @@ def execute_query(self, session, cmd: str,
                                  | FullPartitionScanReversedOrderEvent]) -> None:
        self.log.debug('Will run command %s', cmd)
        validate_mapreduce_service_requests_start_time = time.time()
+        session.cluster.default_retry_policy = ExponentialBackoffRetryPolicy(**self._exp_backoff_retry_policy_params)
+        session.default_timeout = self._session_execution_timeout


is there a difference between self._request_default_timeout and self._request_default_timeout? maybe it could be reused?

fruch · 2024-12-23T07:28:24Z

this is a replacement for @temichus trials in #9370 ?

fruch · 2024-12-23T07:42:51Z

sdcm/scan_operation_thread.py

@@ -120,6 +125,8 @@ def execute_query(
                        | FullPartitionScanReversedOrderEvent]) -> ResultSet:
        # pylint: disable=unused-argument
        self.log.debug('Will run command %s', cmd)
+        session.cluster.default_retry_policy = ExponentialBackoffRetryPolicy(**self._exp_backoff_retry_policy_params)


this is a bit weird it comes next to the code executing the query, and no the code creating the session.

I would recommend consolidating the session creating code into something like:

@property def cql_connection(self, **kwargs): with self.fullscan_params.db_cluster.cql_connection_patient( node=self.db_node, user=self.fullscan_params.user, password=self.fullscan_params.user_password, **kwargs) as session: session.cluster.default_retry_policy = ExponentialBackoffRetryPolicy(**self._exp_backoff_retry_policy_params) session.default_timeout = self._request_default_timeout yield session

there way too many repetitions of applying this retry, and it should be across the board for all of the sessions.

temichus · 2024-12-23T08:37:14Z

this is a replacement for @temichus trials in #9370 ?

yes

The node where scan operations was started could be used by disruptive nemesis. If node was restarted/stopped while scan query had been running, the scan operation would be terminated and error event and message will mark test as failed. Add to cql session ExponetionalBackoffRetryPolicy which allow to retry the query, if node was down and once it back, query will be succesfully finished Fixes: scylladb#9284

fruch · 2025-01-11T20:56:18Z

sdcm/scan_operation_thread.py

@@ -191,6 +193,18 @@ def fetch_result_pages(self, result, read_pages):
            if read_pages > 0:
                pages += 1

+    @contextmanager
+    def cql_connection(self, **kwargs):
+        node = kwargs.get("node", self.db_node)


you need to pop out the node from kwargs, if you want to specify it later
or use setdefault

the unittest demonstrate it's currently broken

fruch · 2025-01-13T19:52:07Z

@aleksbykov anything blocking this one ?

yarongilor · 2025-01-15T11:33:19Z

There are various errors of scan operation repeating in Azure-3h for a while:

2024-12-25 03:32:29.893: (FullScanAggregateEvent Severity.ERROR) period_type=end event_id=d9d1ca8c-f6a1-4a31-8cfc-d0a5071f0bc0 during_nemesis=NodetoolDecommission duration=1m26s node=longevity-10gb-3h-master-db-node-9aaba4c5-eastus-4 select_from=keyspace1.standard1 message=FullScanAggregatesOperation operation failed: Fullscan failed - 'mapreduce_service_requests_dispatched_to_other_nodes' was not triggered

https://argus.scylladb.com/tests/scylla-cluster-tests/9aaba4c5-7332-4337-8abc-2ca5e180934c

2025-01-14 07:36:48.326: (FullScanAggregateEvent Severity.ERROR) period_type=end event_id=7c049364-179d-44b5-8dae-af8b787768c2 during_nemesis=RollingConfigChangeInternodeCompression duration=25s node=longevity-10gb-3h-master-db-node-df48f23c-eastus-4 select_from=keyspace1.standard1 message=FullScanAggregatesOperation operation failed, ReadTimeout error: ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation failed for keyspace1.standard1 - received 0 responses and 1 failures from 1 CL=ONE." info={\'consistency\': \'ONE\', \'required_responses\': 1, \'received_responses\': 0}')

https://argus.scylladb.com/tests/scylla-cluster-tests/df48f23c-a07f-4bc6-bd47-e5e2e40d764e
@fruch , @aleksbykov , please advise if these errors are addressed by this PR (or by #9370 ?).
cc: @pehala

aleksbykov requested review from temichus and soyacz December 22, 2024 10:21

github-actions bot assigned aleksbykov Dec 22, 2024

aleksbykov requested a review from fruch December 22, 2024 10:21

aleksbykov added backport/6.2 backport/2024.2 Need backport to 2024.2 backport/6.1 Need backport to 6.1 labels Dec 22, 2024

aleksbykov marked this pull request as ready for review December 23, 2024 02:54

soyacz reviewed Dec 23, 2024

View reviewed changes

fruch reviewed Dec 23, 2024

View reviewed changes

aleksbykov force-pushed the fix-scan-operations branch 2 times, most recently from da75bfd to 12a71e2 Compare January 9, 2025 10:07

aleksbykov force-pushed the fix-scan-operations branch from 12a71e2 to d90f122 Compare January 9, 2025 13:57

fruch reviewed Jan 11, 2025

View reviewed changes

fruch added the tests/longevity-tier1 label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scan_operations): add retry policy to cql query #9600

fix(scan_operations): add retry policy to cql query #9600

aleksbykov commented Dec 22, 2024 •

edited by fruch

Loading

soyacz Dec 23, 2024

fruch commented Dec 23, 2024

fruch Dec 23, 2024 •

edited

Loading

temichus commented Dec 23, 2024

fruch Jan 11, 2025

fruch commented Jan 13, 2025

yarongilor commented Jan 15, 2025

fix(scan_operations): add retry policy to cql query #9600

Are you sure you want to change the base?

fix(scan_operations): add retry policy to cql query #9600

Conversation

aleksbykov commented Dec 22, 2024 • edited by fruch Loading

Testing

PR pre-checks (self review)

Reminders

soyacz Dec 23, 2024

Choose a reason for hiding this comment

fruch commented Dec 23, 2024

fruch Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

temichus commented Dec 23, 2024

fruch Jan 11, 2025

Choose a reason for hiding this comment

fruch commented Jan 13, 2025

yarongilor commented Jan 15, 2025

aleksbykov commented Dec 22, 2024 •

edited by fruch

Loading

fruch Dec 23, 2024 •

edited

Loading