-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(scan_operations): add retry policy to cql query #9600
base: master
Are you sure you want to change the base?
Conversation
sdcm/scan_operation_thread.py
Outdated
@@ -460,6 +474,9 @@ def execute_query(self, session, cmd: str, | |||
| FullPartitionScanReversedOrderEvent]) -> None: | |||
self.log.debug('Will run command %s', cmd) | |||
validate_mapreduce_service_requests_start_time = time.time() | |||
session.cluster.default_retry_policy = ExponentialBackoffRetryPolicy(**self._exp_backoff_retry_policy_params) | |||
session.default_timeout = self._session_execution_timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a difference between self._request_default_timeout
and self._request_default_timeout
? maybe it could be reused?
sdcm/scan_operation_thread.py
Outdated
@@ -120,6 +125,8 @@ def execute_query( | |||
| FullPartitionScanReversedOrderEvent]) -> ResultSet: | |||
# pylint: disable=unused-argument | |||
self.log.debug('Will run command %s', cmd) | |||
session.cluster.default_retry_policy = ExponentialBackoffRetryPolicy(**self._exp_backoff_retry_policy_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit weird it comes next to the code executing the query, and no the code creating the session.
I would recommend consolidating the session creating code into something like:
@property
def cql_connection(self, **kwargs):
with self.fullscan_params.db_cluster.cql_connection_patient(
node=self.db_node,
user=self.fullscan_params.user,
password=self.fullscan_params.user_password, **kwargs) as session:
session.cluster.default_retry_policy = ExponentialBackoffRetryPolicy(**self._exp_backoff_retry_policy_params)
session.default_timeout = self._request_default_timeout
yield session
there way too many repetitions of applying this retry, and it should be across the board for all of the sessions.
da75bfd
to
12a71e2
Compare
The node where scan operations was started could be used by disruptive nemesis. If node was restarted/stopped while scan query had been running, the scan operation would be terminated and error event and message will mark test as failed. Add to cql session ExponetionalBackoffRetryPolicy which allow to retry the query, if node was down and once it back, query will be succesfully finished Fixes: scylladb#9284
12a71e2
to
d90f122
Compare
@@ -191,6 +193,18 @@ def fetch_result_pages(self, result, read_pages): | |||
if read_pages > 0: | |||
pages += 1 | |||
|
|||
@contextmanager | |||
def cql_connection(self, **kwargs): | |||
node = kwargs.get("node", self.db_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to pop out the node from kwargs, if you want to specify it later
or use setdefault
the unittest demonstrate it's currently broken
@aleksbykov anything blocking this one ? |
There are various errors of scan operation repeating in Azure-3h for a while:
https://argus.scylladb.com/tests/scylla-cluster-tests/9aaba4c5-7332-4337-8abc-2ca5e180934c
https://argus.scylladb.com/tests/scylla-cluster-tests/df48f23c-a07f-4bc6-bd47-e5e2e40d764e |
The node where scan operations was started could be
used by disruptive nemesis. If node was restarted/stopped
while scan query had been running, the scan operation would
be terminated and error event and message will mark
test as failed.
Add to cql session ExponentialBackoffRetryPolicy
which allow to retry the query, if node was down
and once it back, query will be succesfully finished
Fixes: #9284
Testing
java.io.IOException: Operation x10 on key(s) [32374e4d3230504b3030]: Data returned was not validated
. Errors related to Scan operations are not reproducedPR pre-checks (self review)
backport
labelsReminders
sdcm/sct_config.py
)unit-test/
folder)