Failure in `PartitionBalancerTest.test_rack_awareness` #5795

ztlpn · 2022-08-02T17:06:52Z

https://buildkite.com/redpanda/redpanda/builds/13399#0182598f-ee62-46da-900f-15a299898705

Module: rptest.tests.partition_balancer_test
Class: PartitionBalancerTest
Method: test_rack_awareness

Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 244, in test_rack_awareness
    self.run_node_stop_start_chain(steps=3,
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 132, in run_node_stop_start_chain
    additional_check()
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 234, in check_rack_placement
    assert (
AssertionError: bad rack placement {1, 2} for partition id: 0 (replicas: [3, 5, 6])

The text was updated successfully, but these errors were encountered:

Rack awareness is forced in redpanda with soft allocation constraint. It is done in such a way to make it possible to allocate partitions even if there are not enough nodes in distinct racks. When the rack awareness test was running the partition balancer was able to calculate movements before the node that was previously stopped was reported alive. This way a partition must have been allocated when two nodes were unavailable when those two nodes happened to be on the same rack the rack awareness constraint could not be held. Fixes: redpanda-data#5795 Signed-off-by: Michal Maslanka <michal@redpanda.com>

dotnwat · 2023-06-07T04:30:33Z

logs and stuff

https://ci-artifacts.dev.vectorized.cloud/vtools/7962/018893bf-3e5d-4e72-b50c-a51cc1bab43a/vbuild/ducktape/results/2023-06-07--001/PartitionBalancerTest/test_rack_awareness/6/

https://buildkite.com/redpanda/vtools/builds/7962#_

====================================================================================================
test_id:    rptest.tests.partition_balancer_test.PartitionBalancerTest.test_rack_awareness
status:     FAIL
run time:   2 minutes 35.302 seconds


    AssertionError("bad rack placement {'A', 'B'} for partition id: 0 (replicas: [1, 2, 3])")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 79, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 490, in test_rack_awareness
    self.check_rack_placement(self.topic, rack_layout)
  File "/root/tests/rptest/tests/partition_balancer_test.py", line 456, in check_rack_placement
    assert (
AssertionError: bad rack placement {'A', 'B'} for partition id: 0 (replicas: [1, 2, 3])

ztlpn · 2023-06-07T10:12:39Z

I looked briefly into this. The problem is that at some point the balancer marked both nodes in rack C as unavailable:

[INFO  - 2023-06-07 02:59:48,885 - partition_balancer_test - check - lineno:167]: partition balancer status: {'status': 'ready', 'violations': {'unavailable_nodes': [6, 5]}, 'seconds_since_last_tick': 0, 'current_reassignments_count': 32}, req_start: 1686106788.883177, start: 1686106776.8720794

even though node 5 was made available some time before:

[INFO  - 2023-06-07 02:59:36,819 - partition_balancer_test - make_available - lineno:246]: made docker-rp-5 available

From balancer logs it is clear that the balancer didn't notice that node 5 was up (redpanda-1):

DEBUG 2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:179 - node 4: 125 ms since last heartbeat
DEBUG 2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:179 - node 3: 125 ms since last heartbeat
DEBUG 2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:179 - node 6: 11032 ms since last heartbeat
INFO  2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:186 - node 6 is unresponsive, time since last status reply: 11032 ms
DEBUG 2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:179 - node 2: 125 ms since last heartbeat
DEBUG 2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:179 - node 5: 27217 ms since last heartbeat
INFO  2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:186 - node 5 is unresponsive, time since last status reply: 27217 ms
DEBUG 2023-06-07 02:59:47,827 [shard 0] cluster - partition_balancer_planner.cc:179 - node 1: 0 ms since last heartbeat

This is most probably related to a change where we started relying on node_status_table for node availability checks. And it is the same problem as described here.

Sev/medium as this is a Redpanda bug.

ztlpn · 2023-06-07T10:13:59Z

Assigning @bharathv since he is already looking into this.

andijcr · 2023-06-07T14:15:37Z

https://buildkite.com/redpanda/redpanda/builds/30725#01889258-f200-4d95-ad29-027a3ee8b5f8

rystsov · 2023-06-08T17:55:56Z

https://buildkite.com/redpanda/redpanda/builds/30854#01889a79-3d40-492e-8450-36a5d5381abb

andijcr · 2023-06-09T13:12:32Z

https://buildkite.com/redpanda/redpanda/builds/30931#01889ec3-8524-4bd4-a54d-4c9bc8650064

michael-redpanda · 2023-06-12T19:09:14Z

ztlpn · 2023-06-14T10:32:37Z

https://buildkite.com/redpanda/redpanda/builds/31189#0188b56e-3d98-464a-a114-3e5bffd27532

ztlpn added kind/bug Something isn't working ci-failure labels Aug 2, 2022

ztlpn mentioned this issue Aug 2, 2022

Partition balancer admin ops fuzz test + small improvements #5778

Merged

5 tasks

piyushredpanda assigned mmaslankaprv Aug 3, 2022

mmaslankaprv mentioned this issue Aug 4, 2022

tests: wait for node to be reported alive in rack awareness test #5834

Merged

5 tasks

mmaslankaprv closed this as completed in #5834 Aug 5, 2022

dotnwat reopened this Jun 7, 2023

ztlpn added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Jun 7, 2023

ztlpn assigned bharathv Jun 7, 2023

rystsov mentioned this issue Jun 7, 2023

ducky: add retries to txn test #11251

Merged

7 tasks

dlex mentioned this issue Jun 7, 2023

Default to reading from the beginning of partition in ThroughputLimitsSnc.test_consumers #11254

Merged

7 tasks

rystsov mentioned this issue Jun 8, 2023

Fix txn consume group issues leading to undefined behavior #11110

Merged

7 tasks

bharathv mentioned this issue Jun 12, 2023

node_status_backend: reset backoff on peer checkin #11342

Merged

7 tasks

dotnwat mentioned this issue Jun 13, 2023

Move storage aspects of node from local monitor into storage::node #11379

Merged

7 tasks

dlex mentioned this issue Jun 13, 2023

Adapt redpanda service factory to tier configs #11337

Merged

7 tasks

dotnwat mentioned this issue Jun 14, 2023

[storage]: add interface for lying about disk space stats #11401

Merged

7 tasks

vshtokman closed this as completed in #11342 Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in `PartitionBalancerTest.test_rack_awareness` #5795

Failure in `PartitionBalancerTest.test_rack_awareness` #5795

ztlpn commented Aug 2, 2022

dotnwat commented Jun 7, 2023

ztlpn commented Jun 7, 2023

ztlpn commented Jun 7, 2023

andijcr commented Jun 7, 2023

rystsov commented Jun 8, 2023

andijcr commented Jun 9, 2023

michael-redpanda commented Jun 12, 2023

ztlpn commented Jun 14, 2023

Failure in PartitionBalancerTest.test_rack_awareness #5795

Failure in PartitionBalancerTest.test_rack_awareness #5795

Comments

ztlpn commented Aug 2, 2022

dotnwat commented Jun 7, 2023

ztlpn commented Jun 7, 2023

ztlpn commented Jun 7, 2023

andijcr commented Jun 7, 2023

rystsov commented Jun 8, 2023

andijcr commented Jun 9, 2023

michael-redpanda commented Jun 12, 2023

ztlpn commented Jun 14, 2023

Failure in `PartitionBalancerTest.test_rack_awareness` #5795

Failure in `PartitionBalancerTest.test_rack_awareness` #5795