CI Failure (`Restart flag did not clear after restart`) in `ClusterConfigTest.test_restart` #8328

dotnwat · 2023-01-20T04:51:52Z

https://buildkite.com/redpanda/redpanda/builds/46890#_

Module: rptest.tests.cluster_config_test
Class:  ClusterConfigTest
Method: test_restart

[DEBUG - 2023-01-20 02:20:04,625 - admin - _request - lineno:328]: Response OK, JSON: [{'node_id': 1, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}, {'node_id': 2, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}, {'node_id': 3, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}]
[ERROR - 2023-01-20 02:20:05,126 - cluster - wrapped - lineno:41]: Test failed, doing failure checks...
Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cluster_config_test.py", line 279, in test_restart
    self._check_restart_clears()
  File "/root/tests/rptest/tests/cluster_config_test.py", line 248, in _check_restart_clears
    wait_until(lambda: self.admin.get_cluster_config_status()[0]['restart']
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Restart flag did not clear after restart

Notice above that the last config status query before timing out occurred at 02:20:04,625. Below, we can see that the restart flag was eventually flipped and the change was applied at 02:20:04,792 right after the last check.

TRACE 2023-01-20 02:20:04,604 [shard 0] cluster - config_manager.cc:498 - reconcile_status: sending status update to leader: {cluster_status: node 1, version: 2, restart: false (0 invalid, 0 unknown)}
TRACE 2023-01-20 02:20:04,618 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1659 - Append entries request: {raft_group:{0}, commit_index:{22}, term:{5}, prev_log_index:{22}, prev_log_term:{5}, last_visible_index:{22}}
TRACE 2023-01-20 02:20:04,619 [shard 0] storage - storage_resources.cc:215 - stm_take_bytes 463 += 107 (current 5368708657)
TRACE 2023-01-20 02:20:04,619 [shard 0] storage - storage_resources.cc:202 - configuration_manager_take_bytes 926 += 107 (current 5368708194)
TRACE 2023-01-20 02:20:04,621 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2261 - flushed offset updated: 23
DEBUG 2023-01-20 02:20:04,623 [shard 0] admin_api_server - admin_server.cc:365 - [_anonymous] GET http://docker-rp-15:9644/v1/cluster_config/status
TRACE 2023-01-20 02:20:04,623 [shard 0] admin_api_server - admin_server.cc:1167 - status: {cluster_status: node 1, version: 2, restart: true (0 invalid, 0 unknown)}
TRACE 2023-01-20 02:20:04,623 [shard 0] admin_api_server - admin_server.cc:1167 - status: {cluster_status: node 2, version: 2, restart: true (0 invalid, 0 unknown)}
TRACE 2023-01-20 02:20:04,624 [shard 0] admin_api_server - admin_server.cc:1167 - status: {cluster_status: node 3, version: 2, restart: true (0 invalid, 0 unknown)}
TRACE 2023-01-20 02:20:04,789 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1659 - Append entries request: {raft_group:{0}, commit_index:{23}, term:{5}, prev_log_index:{23}, prev_log_term:{5}, last_visible_index:{23}}
TRACE 2023-01-20 02:20:04,789 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2604 - Follower commit index updated 23
TRACE 2023-01-20 02:20:04,789 [shard 0] storage - readers_cache.cc:101 - {redpanda/controller/0} - trying to get reader for: {start_offset:{23}, max_offset:{23}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2023-01-20 02:20:04,790 [shard 0] storage - readers_cache.cc:130 - {redpanda/controller/0} - reader cache miss for: {start_offset:{23}, max_offset:{23}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2023-01-20 02:20:04,790 [shard 0] storage - readers_cache.cc:75 - {redpanda/controller/0} - adding reader [22,23]
TRACE 2023-01-20 02:20:04,792 [shard 0] cluster - config_manager.cc:860 - apply_status: updating node 1: {cluster_status: node 1, version: 2, restart: false (0 invalid, 0 unknown)}

the test will timeout after 10 seconds. so what took so long? here is where the polling starts in the ducktape test:

[DEBUG - 2023-01-20 02:19:55,028 - admin - _request - lineno:305]: Dispatching GET http://docker-rp-21:9644/v1/cluster_config/status

4 seconds later, sending the node status fails with not_leader error. in this case the config manager will sleep for 5 seconds before trying again. so the initial delay plus the backoff has eaten up 9 out 10 seconds.

TRACE 2023-01-20 02:19:59,603 [shard 0] cluster - config_manager.cc:498 - reconcile_status: sending status update to leader: {cluster_status: node 1, version: 2, restart: false (0 invalid, 0 unknown)}
TRACE 2023-01-20 02:19:59,604 [shard 0] cluster - config_frontend.cc:118 - set_status from node 1: {cluster_status: node 1, version: 2, restart: false (0 invalid, 0 unknown)}
INFO  2023-01-20 02:19:59,604 [shard 0] cluster - config_manager.cc:507 - reconcile_status: failed applying local status update: raft::errc::not_leader

Finally sending the update to the controller succeeds 5 seconds later, but this is only 25 milliseconds before the final timeout in the ducktape test, which does not leave much room for error.

TRACE 2023-01-20 02:20:04,604 [shard 0] cluster - config_manager.cc:498 - reconcile_status: sending status update to leader: {cluster_status: node 1, version: 2, restart: false (0 invalid, 0 unknown)}

And finally, why did the send that failed above not actually occur until 4 seconds into the start of the polling? Mostly just bad luck, I think. At start up there was no leader, which will lead to a 5 second sleep in the reconcile status loop:

TRACE 2023-01-20 02:19:54,603 [shard 0] cluster - config_manager.cc:209 - Starting reconcile_status...
TRACE 2023-01-20 02:19:54,603 [shard 0] cluster - config_manager.cc:492 - reconcile_status: no leader to send status to

which puts us 4 seconds into the start of the ducktape polling, and unfortunately, at that point it looks as if a leadership transfer happened or something resulting in the unfortunate failure to update the controller.

JIRA Link: CORE-1146

The text was updated successfully, but these errors were encountered:

dotnwat · 2023-01-20T04:52:12Z

@jcsp there are a couple ways i can think of to make this more robust:

increase the timeout in the ducktape test to account for this particular timing and leader transfer edge case.
either poll more frequently in the reconcile loop in the case that not_leader is occurring, or subscribe for an up-call on leadership updates that will then asynchronously poke the reconcile loop when the leader info is available.

wdyt?

jcsp · 2023-01-20T10:39:08Z

I think option 2 makes sense: it should only be a few lines of code to pass in a ref to group_manager, register for leadership notifications & then replace the sleep_abortable with a condtiion variable wait with a timeout.

rystsov · 2023-01-26T17:04:41Z

@dotnwat why do you think it's sev/low? (not criticizing, asking mostly for the educational purposes and to make sure we have an explicit reason expressed in case the conditions change and we need to re-evaluate)

dotnwat · 2023-01-27T00:14:20Z

@rystsov this is not a bug in redpanda nor a problem with the test, it was bad luck on timing by a few milliseconds because of a leadership transfer. the fix could be considered more of an optimization.

dotnwat · 2024-03-29T04:16:44Z

updated BK link to most recent failure but original one was so old to have been deleted in BK so PT was failing.

piyushredpanda · 2024-06-18T05:20:33Z

Not seen in at least two months, closing

dotnwat added kind/bug Something isn't working ci-failure labels Jan 20, 2023

dotnwat mentioned this issue Jan 20, 2023

kafka: support partition reassignments #7685

Merged

6 tasks

dotnwat added sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages area/controller labels Jan 20, 2023

bharathv self-assigned this Jan 24, 2023

bharathv mentioned this issue Jan 25, 2023

config_manager: subscribe to raft0 leadership change notifications #8399

Merged

6 tasks

piyushredpanda closed this as completed in #8399 Jan 29, 2023

dotnwat reopened this Mar 27, 2024

dotnwat mentioned this issue Mar 27, 2024

json: minor clean-ups #17403

Merged

6 tasks

dotnwat added the ci-ignore Automatic ci analysis tools ignore this issue label Mar 29, 2024

piyushredpanda closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (`Restart flag did not clear after restart`) in `ClusterConfigTest.test_restart` #8328

CI Failure (`Restart flag did not clear after restart`) in `ClusterConfigTest.test_restart` #8328

dotnwat commented Jan 20, 2023 •

edited by jira bot

Loading

dotnwat commented Jan 20, 2023 •

edited

Loading

jcsp commented Jan 20, 2023

rystsov commented Jan 26, 2023

dotnwat commented Jan 27, 2023

dotnwat commented Mar 29, 2024

piyushredpanda commented Jun 18, 2024

CI Failure (Restart flag did not clear after restart) in ClusterConfigTest.test_restart #8328

CI Failure (Restart flag did not clear after restart) in ClusterConfigTest.test_restart #8328

Comments

dotnwat commented Jan 20, 2023 • edited by jira bot Loading

dotnwat commented Jan 20, 2023 • edited Loading

jcsp commented Jan 20, 2023

rystsov commented Jan 26, 2023

dotnwat commented Jan 27, 2023

dotnwat commented Mar 29, 2024

piyushredpanda commented Jun 18, 2024

CI Failure (`Restart flag did not clear after restart`) in `ClusterConfigTest.test_restart` #8328

CI Failure (`Restart flag did not clear after restart`) in `ClusterConfigTest.test_restart` #8328

dotnwat commented Jan 20, 2023 •

edited by jira bot

Loading

dotnwat commented Jan 20, 2023 •

edited

Loading