Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in PartitionBalancerTest.test_rack_awareness #5795

Closed
ztlpn opened this issue Aug 2, 2022 · 8 comments · Fixed by #5834 or #11342
Closed

Failure in PartitionBalancerTest.test_rack_awareness #5795

ztlpn opened this issue Aug 2, 2022 · 8 comments · Fixed by #5834 or #11342
Assignees
Labels
ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.

Comments

@ztlpn
Copy link
Contributor

ztlpn commented Aug 2, 2022

https://buildkite.com/redpanda/redpanda/builds/13399#0182598f-ee62-46da-900f-15a299898705

Module: rptest.tests.partition_balancer_test
Class: PartitionBalancerTest
Method: test_rack_awareness
Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 244, in test_rack_awareness
    self.run_node_stop_start_chain(steps=3,
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 132, in run_node_stop_start_chain
    additional_check()
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 234, in check_rack_placement
    assert (
AssertionError: bad rack placement {1, 2} for partition id: 0 (replicas: [3, 5, 6])
@ztlpn ztlpn added kind/bug Something isn't working ci-failure labels Aug 2, 2022
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Aug 4, 2022
Rack awareness is forced in redpanda with soft allocation constraint. It
is done in such a way to make it possible to allocate partitions even if
there are not enough nodes in distinct racks. When the rack awareness
test was running the partition balancer was able to calculate movements
before the node that was previously stopped was reported alive. This way
a partition must have been allocated when two nodes were unavailable
when those two nodes happened to be on the same rack the rack awareness
constraint could not be held.

Fixes: redpanda-data#5795

Signed-off-by: Michal Maslanka <michal@redpanda.com>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Aug 4, 2022
Rack awareness is forced in redpanda with soft allocation constraint. It
is done in such a way to make it possible to allocate partitions even if
there are not enough nodes in distinct racks. When the rack awareness
test was running the partition balancer was able to calculate movements
before the node that was previously stopped was reported alive. This way
a partition must have been allocated when two nodes were unavailable
when those two nodes happened to be on the same rack the rack awareness
constraint could not be held.

Fixes: redpanda-data#5795

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@dotnwat
Copy link
Member

dotnwat commented Jun 7, 2023

logs and stuff

https://ci-artifacts.dev.vectorized.cloud/vtools/7962/018893bf-3e5d-4e72-b50c-a51cc1bab43a/vbuild/ducktape/results/2023-06-07--001/PartitionBalancerTest/test_rack_awareness/6/

https://buildkite.com/redpanda/vtools/builds/7962#_

====================================================================================================
test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_rack_awareness
status:     FAIL
run time:   2 minutes 35.302 seconds


    AssertionError("bad rack placement {'A', 'B'} for partition id: 0 (replicas: [1, 2, 3])")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 79, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 490, in test_rack_awareness
    self.check_rack_placement(self.topic, rack_layout)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 456, in check_rack_placement
    assert (
AssertionError: bad rack placement {'A', 'B'} for partition id: 0 (replicas: [1, 2, 3])

@dotnwat dotnwat reopened this Jun 7, 2023
@ztlpn
Copy link
Contributor Author

ztlpn commented Jun 7, 2023

I looked briefly into this. The problem is that at some point the balancer marked both nodes in rack C as unavailable:

[INFO  - 2023-06-07 02:59:48,885 - partition_balancer_test - check - lineno:167]: partition balancer status: {'status': 'ready', 'violations': {'unavailable_nodes': [6, 5]}, 'seconds_since_last_tick': 0, 'current_reassignments_count': 32}, req_start: 1686106788.883177, start: 1686106776.8720794

even though node 5 was made available some time before:

[INFO  - 2023-06-07 02:59:36,819 - partition_balancer_test - make_available - lineno:246]: made docker-rp-5 available

From balancer logs it is clear that the balancer didn't notice that node 5 was up (redpanda-1):

DEBUG 2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:179 - node 4: 125 ms since last heartbeat
DEBUG 2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:179 - node 3: 125 ms since last heartbeat
DEBUG 2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:179 - node 6: 11032 ms since last heartbeat
INFO  2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:186 - node 6 is unresponsive, time since last status reply: 11032 ms
DEBUG 2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:179 - node 2: 125 ms since last heartbeat
DEBUG 2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:179 - node 5: 27217 ms since last heartbeat
INFO  2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:186 - node 5 is unresponsive, time since last status reply: 27217 ms
DEBUG 2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:179 - node 1: 0 ms since last heartbeat

This is most probably related to a change where we started relying on node_status_table for node availability checks. And it is the same problem as described here.

Sev/medium as this is a Redpanda bug.

@ztlpn ztlpn added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Jun 7, 2023
@ztlpn
Copy link
Contributor Author

ztlpn commented Jun 7, 2023

Assigning @bharathv since he is already looking into this.

@andijcr
Copy link
Contributor

andijcr commented Jun 7, 2023

@rystsov
Copy link
Contributor

rystsov commented Jun 8, 2023

@andijcr
Copy link
Contributor

andijcr commented Jun 9, 2023

@ztlpn
Copy link
Contributor Author

ztlpn commented Jun 14, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.
Projects
None yet
7 participants