CI Failure (timeout while acquiring leadership) in `MaintenanceTest.test_maintenance_sticky` #13650

abhijat · 2023-09-25T13:54:05Z

https://buildkite.com/redpanda/redpanda/builds/37568

Module: rptest.tests.maintenance_test
Class: MaintenanceTest
Method: test_maintenance_sticky
Arguments: {
    "use_rpk": false
}

test_id:    MaintenanceTest.test_maintenance_sticky
status:     FAIL
run time:   225.985 seconds

TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/maintenance_test.py", line 228, in test_maintenance_sticky
    self._maintenance_disable(node)
  File "/root/tests/rptest/tests/maintenance_test.py", line 198, in _maintenance_disable
    wait_until(lambda: self._has_leadership_role(node),
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

JIRA Link: CORE-1465

The text was updated successfully, but these errors were encountered:

abhijat · 2023-09-25T13:54:58Z

There is one warning which stands out from docker-rp-11

TRACE 2023-09-24 07:10:30,072 [shard 0:main] cluster - health_monitor_backend.cc:532 - unable to get node health report from 2 - rpc::errc::exponential_backoff
WARN  2023-09-24 07:10:30,072 [shard 0:main] cluster - health_monitor_backend.cc:542 - unable to get node health report from 2 - rpc::errc::exponential_backoff, marking node as down

vbotbuildovich · 2023-11-03T16:08:21Z

*https://buildkite.com/redpanda/redpanda/builds/37652
*https://buildkite.com/redpanda/redpanda/builds/38007

vbotbuildovich · 2023-12-13T20:22:03Z

*https://buildkite.com/redpanda/redpanda/builds/40665
*https://buildkite.com/redpanda/redpanda/builds/40726
*https://buildkite.com/redpanda/redpanda/builds/40752

vbotbuildovich · 2023-12-19T00:41:43Z

*https://buildkite.com/redpanda/redpanda/builds/42945

vbotbuildovich · 2023-12-20T00:09:12Z

*https://buildkite.com/redpanda/redpanda/builds/43024

vbotbuildovich · 2024-02-02T00:13:14Z

*https://buildkite.com/redpanda/redpanda/builds/44584

vbotbuildovich · 2024-02-27T06:47:41Z

*https://buildkite.com/redpanda/redpanda/builds/45349

vbotbuildovich · 2024-03-14T00:17:06Z

*https://buildkite.com/redpanda/redpanda/builds/46036#018e315d-230e-42a1-b6a4-6593136815c5

vbotbuildovich · 2024-04-10T17:28:06Z

*https://buildkite.com/redpanda/redpanda/builds/47450

vbotbuildovich · 2024-05-02T21:13:06Z

*https://buildkite.com/redpanda/redpanda/builds/48623

vbotbuildovich · 2024-06-05T20:44:46Z

*https://buildkite.com/redpanda/redpanda/builds/49772

vbotbuildovich · 2024-06-11T21:12:37Z

*https://buildkite.com/redpanda/redpanda/builds/48623
*https://buildkite.com/redpanda/redpanda/builds/49772

vbotbuildovich · 2024-06-11T21:29:45Z

*https://buildkite.com/redpanda/redpanda/builds/48623
*https://buildkite.com/redpanda/redpanda/builds/49772

vbotbuildovich · 2024-06-12T21:09:37Z

*https://buildkite.com/redpanda/redpanda/builds/48623
*https://buildkite.com/redpanda/redpanda/builds/49772

ztlpn · 2024-07-16T19:34:49Z

The test fails because there are not a lot of partitions (12) and the leadership balancer mutes them all after several ticks (it mutes the partitions that it moves):

DEBUG 2024-06-02 08:22:23,922 [shard 0:main] cluster - leader_balancer.cc:511 - No leadership balance improvements found with total delta 32, number of muted groups 12

therefore it cannot take any further action for some time after that and the node that comes out of the maintenance mode won't acquire any new leaders.

The fix I guess is to increase the number of partitions

Previously when this test was running in CI there were not a lot of partitions (12) and the leadership balancer could mute them all after several ticks (it mutes the partitions that it moves). After that it cannot take any further action for some time and the node that comes out of the maintenance mode won't acquire any new leaders. Increase the number of partitions in the test to avoid that. Fixes: redpanda-data#13650 (cherry picked from commit a645618)

abhijat added kind/bug Something isn't working ci-failure labels Sep 25, 2023

abhijat changed the title ~~CI Failure (timeout while acquiring leadership) in Class.method~~ CI Failure (timeout while acquiring leadership) in MaintenanceTest.test_maintenance_sticky Sep 25, 2023

mmaslankaprv mentioned this issue Sep 27, 2023

[v23.2.x] Backport of: #13009 #13674

Merged

7 tasks

dotnwat added the area/replication label Dec 15, 2023

abhijat mentioned this issue Jan 15, 2024

[v23.2.x] archival: Use explicit types to encode upload candidate creation result #16099

Merged

piyushredpanda mentioned this issue Jun 15, 2024

[v24.1.x] tests: use partition size metrics in test verifying local retention move #19847

Merged

ztlpn added the ci-rca/test CI Root Cause Analysis - Test Issue label Jul 16, 2024

ztlpn self-assigned this Jul 16, 2024

ztlpn mentioned this issue Jul 16, 2024

Maintenance mode test fixes #21435

Merged

7 tasks

bharathv closed this as completed in a645618 Jul 17, 2024

bharathv closed this as completed in #21435 Jul 17, 2024

vbotbuildovich mentioned this issue Jul 17, 2024

[v24.1.x] CI Failure (timeout while acquiring leadership) in MaintenanceTest.test_maintenance_sticky #21453

Closed

vbotbuildovich mentioned this issue Jul 17, 2024

[v23.3.x] CI Failure (timeout while acquiring leadership) in MaintenanceTest.test_maintenance_sticky #21456

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (timeout while acquiring leadership) in `MaintenanceTest.test_maintenance_sticky` #13650

CI Failure (timeout while acquiring leadership) in `MaintenanceTest.test_maintenance_sticky` #13650

abhijat commented Sep 25, 2023 •

edited by jira bot

Loading

abhijat commented Sep 25, 2023

vbotbuildovich commented Nov 3, 2023

vbotbuildovich commented Dec 13, 2023

vbotbuildovich commented Dec 19, 2023

vbotbuildovich commented Dec 20, 2023

vbotbuildovich commented Feb 2, 2024

vbotbuildovich commented Feb 27, 2024

vbotbuildovich commented Mar 14, 2024

vbotbuildovich commented Apr 10, 2024

vbotbuildovich commented May 2, 2024

vbotbuildovich commented Jun 5, 2024

vbotbuildovich commented Jun 11, 2024

vbotbuildovich commented Jun 11, 2024

vbotbuildovich commented Jun 12, 2024

ztlpn commented Jul 16, 2024

CI Failure (timeout while acquiring leadership) in MaintenanceTest.test_maintenance_sticky #13650

CI Failure (timeout while acquiring leadership) in MaintenanceTest.test_maintenance_sticky #13650

Comments

abhijat commented Sep 25, 2023 • edited by jira bot Loading

abhijat commented Sep 25, 2023

vbotbuildovich commented Nov 3, 2023

vbotbuildovich commented Dec 13, 2023

vbotbuildovich commented Dec 19, 2023

vbotbuildovich commented Dec 20, 2023

vbotbuildovich commented Feb 2, 2024

vbotbuildovich commented Feb 27, 2024

vbotbuildovich commented Mar 14, 2024

vbotbuildovich commented Apr 10, 2024

vbotbuildovich commented May 2, 2024

vbotbuildovich commented Jun 5, 2024

vbotbuildovich commented Jun 11, 2024

vbotbuildovich commented Jun 11, 2024

vbotbuildovich commented Jun 12, 2024

ztlpn commented Jul 16, 2024

CI Failure (timeout while acquiring leadership) in `MaintenanceTest.test_maintenance_sticky` #13650

CI Failure (timeout while acquiring leadership) in `MaintenanceTest.test_maintenance_sticky` #13650

abhijat commented Sep 25, 2023 •

edited by jira bot

Loading