Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (timeout while acquiring leadership) in MaintenanceTest.test_maintenance_sticky #13650

Closed
abhijat opened this issue Sep 25, 2023 · 15 comments · Fixed by #21435
Closed
Assignees
Labels
area/replication ci-failure ci-rca/test CI Root Cause Analysis - Test Issue kind/bug Something isn't working

Comments

@abhijat
Copy link
Contributor

abhijat commented Sep 25, 2023

https://buildkite.com/redpanda/redpanda/builds/37568

Module: rptest.tests.maintenance_test
Class: MaintenanceTest
Method: test_maintenance_sticky
Arguments: {
    "use_rpk": false
}
test_id:    MaintenanceTest.test_maintenance_sticky
status:     FAIL
run time:   225.985 seconds

TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/maintenance_test.py", line 228, in test_maintenance_sticky
    self._maintenance_disable(node)
  File "/root/tests/rptest/tests/maintenance_test.py", line 198, in _maintenance_disable
    wait_until(lambda: self._has_leadership_role(node),
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

JIRA Link: CORE-1465

@abhijat abhijat added kind/bug Something isn't working ci-failure labels Sep 25, 2023
@abhijat
Copy link
Contributor Author

abhijat commented Sep 25, 2023

There is one warning which stands out from docker-rp-11

TRACE 2023-09-24 07:10:30,072 [shard 0:main] cluster - health_monitor_backend.cc:532 - unable to get node health report from 2 - rpc::errc::exponential_backoff
WARN  2023-09-24 07:10:30,072 [shard 0:main] cluster - health_monitor_backend.cc:542 - unable to get node health report from 2 - rpc::errc::exponential_backoff, marking node as down

@abhijat abhijat changed the title CI Failure (timeout while acquiring leadership) in Class.method CI Failure (timeout while acquiring leadership) in MaintenanceTest.test_maintenance_sticky Sep 25, 2023
@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

2 similar comments
@ztlpn
Copy link
Contributor

ztlpn commented Jul 16, 2024

The test fails because there are not a lot of partitions (12) and the leadership balancer mutes them all after several ticks (it mutes the partitions that it moves):

DEBUG 2024-06-02 08:22:23,922 [shard 0:main] cluster - leader_balancer.cc:511 - No leadership balance improvements found with total delta 32, number of muted groups 12

therefore it cannot take any further action for some time after that and the node that comes out of the maintenance mode won't acquire any new leaders.

The fix I guess is to increase the number of partitions

@ztlpn ztlpn added the ci-rca/test CI Root Cause Analysis - Test Issue label Jul 16, 2024
@ztlpn ztlpn self-assigned this Jul 16, 2024
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Jul 17, 2024
Previously when this test was running in CI there were not a lot of partitions (12)
and the leadership balancer could mute them all after several ticks (it mutes the
partitions that it moves). After that it cannot take any further action for some time
and the node that comes out of the maintenance mode won't acquire any new leaders.

Increase the number of partitions in the test to avoid that.

Fixes: redpanda-data#13650
(cherry picked from commit a645618)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Jul 17, 2024
Previously when this test was running in CI there were not a lot of partitions (12)
and the leadership balancer could mute them all after several ticks (it mutes the
partitions that it moves). After that it cannot take any further action for some time
and the node that comes out of the maintenance mode won't acquire any new leaders.

Increase the number of partitions in the test to avoid that.

Fixes: redpanda-data#13650
(cherry picked from commit a645618)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/replication ci-failure ci-rca/test CI Root Cause Analysis - Test Issue kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants