Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(scan_operations): add retry policy to cql query #9600

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aleksbykov
Copy link
Contributor

@aleksbykov aleksbykov commented Dec 22, 2024

The node where scan operations was started could be
used by disruptive nemesis. If node was restarted/stopped
while scan query had been running, the scan operation would
be terminated and error event and message will mark
test as failed.

Add to cql session ExponentialBackoffRetryPolicy
which allow to retry the query, if node was down
and once it back, query will be succesfully finished

Fixes: #9284

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@aleksbykov aleksbykov requested a review from fruch December 22, 2024 10:21
@aleksbykov aleksbykov added backport/6.2 backport/2024.2 Need backport to 2024.2 backport/6.1 Need backport to 6.1 labels Dec 22, 2024
@aleksbykov aleksbykov marked this pull request as ready for review December 23, 2024 02:54
@@ -460,6 +474,9 @@ def execute_query(self, session, cmd: str,
| FullPartitionScanReversedOrderEvent]) -> None:
self.log.debug('Will run command %s', cmd)
validate_mapreduce_service_requests_start_time = time.time()
session.cluster.default_retry_policy = ExponentialBackoffRetryPolicy(**self._exp_backoff_retry_policy_params)
session.default_timeout = self._session_execution_timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a difference between self._request_default_timeout and self._request_default_timeout? maybe it could be reused?

@fruch
Copy link
Contributor

fruch commented Dec 23, 2024

this is a replacement for @temichus trials in #9370 ?

@@ -120,6 +125,8 @@ def execute_query(
| FullPartitionScanReversedOrderEvent]) -> ResultSet:
# pylint: disable=unused-argument
self.log.debug('Will run command %s', cmd)
session.cluster.default_retry_policy = ExponentialBackoffRetryPolicy(**self._exp_backoff_retry_policy_params)
Copy link
Contributor

@fruch fruch Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit weird it comes next to the code executing the query, and no the code creating the session.

I would recommend consolidating the session creating code into something like:

@property
def cql_connection(self, **kwargs):
    with self.fullscan_params.db_cluster.cql_connection_patient(
                    node=self.db_node,
                    user=self.fullscan_params.user,
                    password=self.fullscan_params.user_password, **kwargs) as session:
        session.cluster.default_retry_policy = ExponentialBackoffRetryPolicy(**self._exp_backoff_retry_policy_params)
        session.default_timeout = self._request_default_timeout
        yield session

there way too many repetitions of applying this retry, and it should be across the board for all of the sessions.

@temichus
Copy link
Contributor

this is a replacement for @temichus trials in #9370 ?

yes

@aleksbykov aleksbykov force-pushed the fix-scan-operations branch 2 times, most recently from da75bfd to 12a71e2 Compare January 9, 2025 10:07
The node where scan operations was started could be
used by disruptive nemesis. If node was restarted/stopped
while scan query had been running, the scan operation would
be terminated and error event and message will mark
test as failed.

Add to cql session ExponetionalBackoffRetryPolicy
which allow to retry the query, if node was down
and once it back, query will be succesfully finished

Fixes: scylladb#9284
@aleksbykov aleksbykov force-pushed the fix-scan-operations branch from 12a71e2 to d90f122 Compare January 9, 2025 13:57
@@ -191,6 +193,18 @@ def fetch_result_pages(self, result, read_pages):
if read_pages > 0:
pages += 1

@contextmanager
def cql_connection(self, **kwargs):
node = kwargs.get("node", self.db_node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to pop out the node from kwargs, if you want to specify it later
or use setdefault

the unittest demonstrate it's currently broken

@fruch
Copy link
Contributor

fruch commented Jan 13, 2025

@aleksbykov anything blocking this one ?

@yarongilor
Copy link
Contributor

There are various errors of scan operation repeating in Azure-3h for a while:

2024-12-25 03:32:29.893: (FullScanAggregateEvent Severity.ERROR) period_type=end event_id=d9d1ca8c-f6a1-4a31-8cfc-d0a5071f0bc0 during_nemesis=NodetoolDecommission duration=1m26s node=longevity-10gb-3h-master-db-node-9aaba4c5-eastus-4 select_from=keyspace1.standard1 message=FullScanAggregatesOperation operation failed: Fullscan failed - 'mapreduce_service_requests_dispatched_to_other_nodes' was not triggered

https://argus.scylladb.com/tests/scylla-cluster-tests/9aaba4c5-7332-4337-8abc-2ca5e180934c

2025-01-14 07:36:48.326: (FullScanAggregateEvent Severity.ERROR) period_type=end event_id=7c049364-179d-44b5-8dae-af8b787768c2 during_nemesis=RollingConfigChangeInternodeCompression duration=25s node=longevity-10gb-3h-master-db-node-df48f23c-eastus-4 select_from=keyspace1.standard1 message=FullScanAggregatesOperation operation failed, ReadTimeout error: ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation failed for keyspace1.standard1 - received 0 responses and 1 failures from 1 CL=ONE." info={\'consistency\': \'ONE\', \'required_responses\': 1, \'received_responses\': 0}')

https://argus.scylladb.com/tests/scylla-cluster-tests/df48f23c-a07f-4bc6-bd47-e5e2e40d764e
@fruch , @aleksbykov , please advise if these errors are addressed by this PR (or by #9370 ?).
cc: @pehala

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix Fullscanoperation thread to choose only alive node
5 participants