Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (Restart flag did not clear after restart) in ClusterConfigTest.test_restart #8328

Closed
dotnwat opened this issue Jan 20, 2023 · 6 comments · Fixed by #8399
Closed
Assignees
Labels
area/controller ci-failure ci-ignore Automatic ci analysis tools ignore this issue kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@dotnwat
Copy link
Member

dotnwat commented Jan 20, 2023

https://buildkite.com/redpanda/redpanda/builds/46890#_

Module: rptest.tests.cluster_config_test
Class:  ClusterConfigTest
Method: test_restart
[DEBUG - 2023-01-20 02:20:04,625 - admin - _request - lineno:328]: Response OK, JSON: [{'node_id': 1, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}, {'node_id': 2, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}, {'node_id': 3, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}]
[ERROR - 2023-01-20 02:20:05,126 - cluster - wrapped - lineno:41]: Test failed, doing failure checks...
Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cluster_config_test.py", line 279, in test_restart
    self._check_restart_clears()
  File "/root/tests/rptest/tests/cluster_config_test.py", line 248, in _check_restart_clears
    wait_until(lambda: self.admin.get_cluster_config_status()[0]['restart']
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Restart flag did not clear after restart

Notice above that the last config status query before timing out occurred at 02:20:04,625. Below, we can see that the restart flag was eventually flipped and the change was applied at 02:20:04,792 right after the last check.

TRACE 2023-01-20 02:20:04,604 [shard 0] cluster - config_manager.cc:498 - reconcile_status: sending status update to leader: {cluster_status: node 1, version: 2, restart: false (0 invalid, 0 unknown)}
TRACE 2023-01-20 02:20:04,618 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1659 - Append entries request: {raft_group:{0}, commit_index:{22}, term:{5}, prev_log_index:{22}, prev_log_term:{5}, last_visible_index:{22}}
TRACE 2023-01-20 02:20:04,619 [shard 0] storage - storage_resources.cc:215 - stm_take_bytes 463 += 107 (current 5368708657)
TRACE 2023-01-20 02:20:04,619 [shard 0] storage - storage_resources.cc:202 - configuration_manager_take_bytes 926 += 107 (current 5368708194)
TRACE 2023-01-20 02:20:04,621 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2261 - flushed offset updated: 23
DEBUG 2023-01-20 02:20:04,623 [shard 0] admin_api_server - admin_server.cc:365 - [_anonymous] GET http://docker-rp-15:9644/v1/cluster_config/status
TRACE 2023-01-20 02:20:04,623 [shard 0] admin_api_server - admin_server.cc:1167 - status: {cluster_status: node 1, version: 2, restart: true (0 invalid, 0 unknown)}
TRACE 2023-01-20 02:20:04,623 [shard 0] admin_api_server - admin_server.cc:1167 - status: {cluster_status: node 2, version: 2, restart: true (0 invalid, 0 unknown)}
TRACE 2023-01-20 02:20:04,624 [shard 0] admin_api_server - admin_server.cc:1167 - status: {cluster_status: node 3, version: 2, restart: true (0 invalid, 0 unknown)}
TRACE 2023-01-20 02:20:04,789 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1659 - Append entries request: {raft_group:{0}, commit_index:{23}, term:{5}, prev_log_index:{23}, prev_log_term:{5}, last_visible_index:{23}}
TRACE 2023-01-20 02:20:04,789 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2604 - Follower commit index updated 23
TRACE 2023-01-20 02:20:04,789 [shard 0] storage - readers_cache.cc:101 - {redpanda/controller/0} - trying to get reader for: {start_offset:{23}, max_offset:{23}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2023-01-20 02:20:04,790 [shard 0] storage - readers_cache.cc:130 - {redpanda/controller/0} - reader cache miss for: {start_offset:{23}, max_offset:{23}, min_bytes:0, max_bytes:18446744073709551615, type_filter:nullopt, first_timestamp:nullopt}
TRACE 2023-01-20 02:20:04,790 [shard 0] storage - readers_cache.cc:75 - {redpanda/controller/0} - adding reader [22,23]
TRACE 2023-01-20 02:20:04,792 [shard 0] cluster - config_manager.cc:860 - apply_status: updating node 1: {cluster_status: node 1, version: 2, restart: false (0 invalid, 0 unknown)}

the test will timeout after 10 seconds. so what took so long? here is where the polling starts in the ducktape test:

[DEBUG - 2023-01-20 02:19:55,028 - admin - _request - lineno:305]: Dispatching GET http://docker-rp-21:9644/v1/cluster_config/status

4 seconds later, sending the node status fails with not_leader error. in this case the config manager will sleep for 5 seconds before trying again. so the initial delay plus the backoff has eaten up 9 out 10 seconds.

TRACE 2023-01-20 02:19:59,603 [shard 0] cluster - config_manager.cc:498 - reconcile_status: sending status update to leader: {cluster_status: node 1, version: 2, restart: false (0 invalid, 0 unknown)}
TRACE 2023-01-20 02:19:59,604 [shard 0] cluster - config_frontend.cc:118 - set_status from node 1: {cluster_status: node 1, version: 2, restart: false (0 invalid, 0 unknown)}
INFO  2023-01-20 02:19:59,604 [shard 0] cluster - config_manager.cc:507 - reconcile_status: failed applying local status update: raft::errc::not_leader

Finally sending the update to the controller succeeds 5 seconds later, but this is only 25 milliseconds before the final timeout in the ducktape test, which does not leave much room for error.

TRACE 2023-01-20 02:20:04,604 [shard 0] cluster - config_manager.cc:498 - reconcile_status: sending status update to leader: {cluster_status: node 1, version: 2, restart: false (0 invalid, 0 unknown)}

And finally, why did the send that failed above not actually occur until 4 seconds into the start of the polling? Mostly just bad luck, I think. At start up there was no leader, which will lead to a 5 second sleep in the reconcile status loop:

TRACE 2023-01-20 02:19:54,603 [shard 0] cluster - config_manager.cc:209 - Starting reconcile_status...
TRACE 2023-01-20 02:19:54,603 [shard 0] cluster - config_manager.cc:492 - reconcile_status: no leader to send status to

which puts us 4 seconds into the start of the ducktape polling, and unfortunately, at that point it looks as if a leadership transfer happened or something resulting in the unfortunate failure to update the controller.

JIRA Link: CORE-1146

@dotnwat dotnwat added kind/bug Something isn't working ci-failure labels Jan 20, 2023
@dotnwat
Copy link
Member Author

dotnwat commented Jan 20, 2023

@jcsp there are a couple ways i can think of to make this more robust:

  1. increase the timeout in the ducktape test to account for this particular timing and leader transfer edge case.
  2. either poll more frequently in the reconcile loop in the case that not_leader is occurring, or subscribe for an up-call on leadership updates that will then asynchronously poke the reconcile loop when the leader info is available.

wdyt?

@dotnwat dotnwat added sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages area/controller labels Jan 20, 2023
@jcsp
Copy link
Contributor

jcsp commented Jan 20, 2023

I think option 2 makes sense: it should only be a few lines of code to pass in a ref to group_manager, register for leadership notifications & then replace the sleep_abortable with a condtiion variable wait with a timeout.

@rystsov
Copy link
Contributor

rystsov commented Jan 26, 2023

@dotnwat why do you think it's sev/low? (not criticizing, asking mostly for the educational purposes and to make sure we have an explicit reason expressed in case the conditions change and we need to re-evaluate)

@dotnwat
Copy link
Member Author

dotnwat commented Jan 27, 2023

@rystsov this is not a bug in redpanda nor a problem with the test, it was bad luck on timing by a few milliseconds because of a leadership transfer. the fix could be considered more of an optimization.

@dotnwat dotnwat reopened this Mar 27, 2024
@dotnwat dotnwat mentioned this issue Mar 27, 2024
6 tasks
@dotnwat dotnwat added the ci-ignore Automatic ci analysis tools ignore this issue label Mar 29, 2024
@dotnwat
Copy link
Member Author

dotnwat commented Mar 29, 2024

updated BK link to most recent failure but original one was so old to have been deleted in BK so PT was failing.

@piyushredpanda
Copy link
Contributor

Not seen in at least two months, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller ci-failure ci-ignore Automatic ci analysis tools ignore this issue kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants